CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Raymond Lewallen

Framework Design, Agile Coach, President Oklahoma City Developers Group, Microsoft MVP C#, TDD, Continuous Integration, Patterns and Practices, Domain Driven Design, Speaker, VB.Net, C# and Sql Server

New INTERSECT and EXCEPT operators in Sql Server 2005 T-SQL

This is the first post of many to come that talk about new enhancements in the Sql Server 2005 T-Sql language.

INTERSECT

Let say you have a products table.  I’m going to use the products table from the AdventureWorks database.  Now let’s say you imported a list of products into the database into a table called productduplicates.  Just for a reason, let’s say you manufacturer sent you a full list of products, and you need to see which products you already have in your database.

Normally, if you didn’t have matching primary keys on both tables, like a productId, you would do an inner join on some columns that uniquely identify the rows, as such:

INNER JOINselect a.[name], a.productnumber
 from production.product a
inner join production.productduplicates b
 on a.[name] = b.[name]
 and a.productnumber = b.productnumber

Execution plan for inner joinHash Match(Inner Join, HASH:([b].[Name], [b].[ProductNumber])=([a].[Name], [a].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as [a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as [b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber] as [a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber] as [b].[ProductNumber]))
  |--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS [b]))       
  |--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [a]))     

Now you can replace this with a simple INTERSECT statement like the following:

INTERSECTselect [name], productnumber from production.product
intersect
select [name], productnumber from production.productduplicates

Execution plan for INTERSECTHash Match(Right Semi Join, HASH:([AdventureWorks].[Production].[productduplicates].[Name], [AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name], [AdventureWorks].[Production].[Product].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))
  |--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates])) 
  |--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))  

Just like if you were using a UNION for two select statements, the same number of columns must exist in both select statements, and they must be of the same datatype.  In both above examples, you get a list of products that exist in both tables, so you know which products from the manufacturer you already have listed.

I haven’t seen any performance difference in the two, but I’ve only played with this on small tables.  The execution plan does show that while you are doing a hash match inner join on the first query, the second query performs a hash match right semi join.  Both hash matches perform 50% of the work in the query.  The other 50% is split evenly between the table scan and index scan in both queries.  So what does the fact that a right semi match is performed when using INTERSECT?  I tells me its doing the same thing as if we were to use the EXISTS, like so:

EXISTSselect a.[name], a.productnumber
 from production.product a
where exists(select b.[name], b.productnumber from production.productduplicates b
 where a.[name] = b.[name]
 and a.productnumber = b.productnumber)

EXCEPT

Same scenario, if you wanted to know which products you have that do not exist in the duplicates table, you would normally do a NOT EXISTS like so:

NOT EXISTSselect a.[name], a.productnumber
 from production.product a
where not exists(select b.[name], b.productnumber from production.productduplicates b
 where a.[name] = b.[name]
 and a.productnumber = b.productnumber)

Execution plan for NOT EXISTSHash Match(Right Anti Semi Join, HASH:([b].[Name], [b].[ProductNumber])=([a].[Name], [a].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as [a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as [b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber] as [a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber] as [b].[ProductNumber]))
  |--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS [b]))
  |--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [a]))

And you get back the list of products that you have that aren’t in the duplicates list.  You can flip this around to figure out which products are in the duplicates that you don’t have in your products.

Let’s simply the code a little bit by using the EXCEPT operator.  This looks identical to the INTERSECTS query:

EXCEPTselect [name], productnumber from production.product
except
select [name], productnumber from production.productduplicates

Execution plan for exceptHash Match(Right Anti Semi Join, HASH:([AdventureWorks].[Production].[productduplicates].[Name], [AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name], [AdventureWorks].[Production].[Product].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))
  |--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates]))
  |--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))

Again, can’t find a performance difference.  In all queries shows, the reads and duration all average out to be virtually identical.  You’d have to test this on some large tables (high number of rows) to figure out if there is a performance difference at all.

Looking at the execution plans for everything, it doesn’t seem that there is a any gain from using INTERSECT and EXCEPT over JOINS or EXISTS/NOT EXISTS except for one big thing: Much simpler code!



Comments

Rick.Stavanja.com said:

# August 1, 2005 3:00 AM

Rick.Stavanja.com said:

# August 1, 2005 3:03 AM

Rick.Stavanja.com said:

Raymond Lewallen wrote a nice piece today about the new INTERSECT and EXCEPT operators in the upcoming

# September 14, 2006 2:33 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add

About Raymond Lewallen

Working primarily in the public sector during his career, Raymond has designed and built several high profile enterprise level applications for all levels of the government. Raymond now works as a solutions architect for EMC. Raymond is an agile coach, Microsoft MVP C# and also president of the Oklahoma City Developers Group and Oklahoma Agile Developers Group. Raymond spends a lot of his time learning and teaching such things as Test Driven Development, Domain Driven Design, Design Patterns and Extreme Programming practices and principles, to name a few. Raymond is also an advocate of Alt.Net. Raymond is primarily a framework guy, so don't ask him anything about UI :) Check out Devlicio.us!