The Merge Join is a Physical Operation when joining 2 sets of data that are in the same order.
There is an interesting Clustered Index in the Adventure Works database. The SaleOrderDetail table’s primary key is a combination of the SalesOrderID and SalesOrderDetailID, not just the identity field of the table (SalesOrderDetailID). Most developers I work with usually create the primary key (by default is the clustered index) on the Identity fields with a unique constraint on the business key of the table.
The compound primary key might take more space, but it looks to help query plans have more options when optimizing.
The following query will use a Merge Join after doing a clustered index scan on SalesOrderHeader and SalesOrderDetail.
SELECT *
FROM Sales.SalesOrderHeader soh
INNER JOIN Sales.SalesOrderDetail sod
ON sod.SalesOrderID = soh.SalesOrderID
Now, of course, no one does a SELECT *, right? Maybe not, but you can see that since the first column in the SalesOrderHeader and SalesOrderDetail clustered indexes is the SalesOrderID, the query optimizer can Merge the 2 data sets together because they are in the same order. The optimizer knows this is the best join.
A benefit to the Merge Join operation is once the first piece of data (SalesOrderID) is merged with Detail data, the data can be passed to the next operator, which can improve performance if further processing is needed.
Say I want a list of SalesOrderIDs by ProductID and retrieve the Product Name from the product tables
SELECT sod.SalesOrderID,
p.ProductID, p.Name
FROM Production.Product p
INNER JOIN Sales.SalesOrderDetail sod
ON sod.ProductID = p.ProductID
The SalesOrderDetail Scan is on a non-clustered index on ProductID. This is how the Merge is able to be used with the Clustered Index scan of the Product table
Below is the info on the Scan Operators which shows the Object (index) used in the operator.
What happens when we add a column to this query – Order Qty.
SELECT sod.SalesOrderID, sod.OrderQty,
p.ProductID, p.Name
FROM Production.Product p
INNER JOIN Sales.SalesOrderDetail sod
ON sod.ProductID = p.ProductID
The optimizer determines the Hash Match physical join is more efficient. The Cost is 1.7226.
An option here is to create a covering index by adding the OrderQty column to a non-cluster index using the INCLUDE statement of the Create Index statement.
CREATE NONCLUSTERED INDEX [IX_SalesOrderDetail_ProductIDIncludeOrderQty] ON [Sales].[SalesOrderDetail]
([ProductID] ASC )
INCLUDE (OrderQty)
You can force a Merge Join with a query hint to see the difference without the Covering Index.
-- Force Merge join
SELECT sod.SalesOrderID, sod.OrderQty,
p.ProductID, p.Name
FROM Production.Product p
INNER MERGE JOIN Sales.SalesOrderDetail sod
ON sod.ProductID = p.ProductID
The Cost is now 6.3887 and the query runs with parallelism. And, a Sort operator (74% of query cost) is needed if you do not have the covering index above.
Query Hints should be left to the experts.
If the data sets are not in the same order based on the Index (clustered or non-clustered), the optimizer can still use a Sort operator to order the data sets in the same logical order to use a Merge Operation.
The AdventureWorks database is available from CodePlex at http://msftdbprodsamples.codeplex.com/
I first learned Execution Plans from a free PDF provided by Red Gate with author Grant Fritchey. You can get this PDF at http://www.simple-talk.com/sql/performance/execution-plan-basics/
Happy Query Tuning from Louisiana!!!
No comments:
Post a Comment