Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

New INTERSECT and EXCEPT operators in Sql Server 2005 T-SQL

This is the first post of many to come that talk about new enhancements in the Sql Server 2005 T-Sql language.

INTERSECT

Let say you have a products table.  I’m going to use the
products table from the AdventureWorks database.  Now let’s say
you imported a list of products into the database into a table called
productduplicates.  Just for a reason, let’s say you manufacturer
sent you a full list of products, and you need to see which products
you already have in your database.

Normally, if you didn’t have matching primary keys on both tables,
like a productId, you would do an inner join on some columns that
uniquely identify the rows, as such:

INNER JOIN

select a.[name], a.productnumber
 from production.product a
inner join production.productduplicates b
 on a.[name] = b.[name]
 and a.productnumber = b.productnumber

Execution plan for inner join

Hash
Match(Inner Join, HASH:([b].[Name], [b].[ProductNumber])=([a].[Name],
[a].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as
[a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as
[b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]
as
[a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]
as [b].[ProductNumber]))
  |–Table
Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS
[b]))       
 
|–Clustered Index
Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]
AS [a]))     

Now you can replace this with a simple INTERSECT statement like the following:

INTERSECT

select [name], productnumber from production.product
intersect
select [name], productnumber from production.productduplicates

Execution plan for INTERSECT

Hash Match(Right
Semi Join,
HASH:([AdventureWorks].[Production].[productduplicates].[Name],
[AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name],
[AdventureWorks].[Production].[Product].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name]
AND
[AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))

  |–Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates])) 
  |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))  

Just like if you were using a UNION for two select statements, the
same number of columns must exist in both select statements, and they
must be of the same datatype.  In both above examples, you get a
list of products that exist in both tables, so you know which products
from the manufacturer you already have listed.

I haven’t seen any performance difference in the two, but I’ve only
played with this on small tables.  The execution plan does show
that while you are doing a hash match inner join on the first query,
the second query performs a hash match right semi join.  Both hash
matches perform 50% of the work in the query.  The other 50% is
split evenly between the table scan and index scan in both
queries.  So what does the fact that a right semi match is
performed when using INTERSECT?  I tells me its doing the same
thing as if we were to use the EXISTS, like so:

EXISTS

select a.[name], a.productnumber
 from production.product a
where exists(select b.[name], b.productnumber from production.productduplicates b
 where a.[name] = b.[name]
 and a.productnumber = b.productnumber)

EXCEPT

Same scenario, if you wanted to know which products you have that do
not exist in the duplicates table, you would normally do a NOT EXISTS
like so:

NOT EXISTS

select a.[name], a.productnumber
 from production.product a
where not exists(select b.[name], b.productnumber from production.productduplicates b
 where a.[name] = b.[name]
 and a.productnumber = b.productnumber)

Execution plan for NOT EXISTS

Hash
Match(Right Anti Semi Join, HASH:([b].[Name],
[b].[ProductNumber])=([a].[Name], [a].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as
[a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as
[b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]
as
[a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]
as [b].[ProductNumber]))
  |–Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS [b]))
  |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [a]))

And you get back the list of products that you have that aren’t in
the duplicates list.  You can flip this around to figure out which
products are in the duplicates that you don’t have in your products.

Let’s simply the code a little bit by using the EXCEPT operator.  This looks identical to the INTERSECTS query:

EXCEPT

select [name], productnumber from production.product
except
select [name], productnumber from production.productduplicates

Execution plan for except

Hash Match(Right
Anti Semi Join,
HASH:([AdventureWorks].[Production].[productduplicates].[Name],
[AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name],
[AdventureWorks].[Production].[Product].[ProductNumber]),
RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name]
AND
[AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))

  |–Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates]))
  |–Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))

Again, can’t find a performance difference.  In all queries
shows, the reads and duration all average out to be virtually
identical.  You’d have to test this on some large tables (high
number of rows) to figure out if there is a performance difference at
all.

Looking at the execution plans for everything, it doesn’t seem that
there is a any gain from using INTERSECT and EXCEPT over JOINS or
EXISTS/NOT EXISTS except for one big thing: Much simpler code!

This entry was posted in Most Popular, Sql Development. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

Leave a Reply