LINQ Intersect() 2.7x faster with HashSet

Sometime one shouldn’t be too optimistic with .NET Fx optimization. I knew that the extension method IEnumerable.Count() was optimized to check if the enumerable was a collection, in such case the Collection.Count property was called instead (see Count() impl at the end of the post).

So when it comes to set operations such as Intersect Union… if one of the sequence involved was an HashSet<T>, I assumed that the obvious optimization of using the O(1) HashSet<T>.Contains() magic, was implemented. But actually, by decompiling I discovered that it was not implemented. A Set<T> was created each time the code achieve a common set operation, no matter the type of sequences. In the case where one wishes to do set operations with an existing HashSet<T> this is unfortunate. The gain one can expect is around 2.7, with variations depending on the sequences collateral size. This comes from the fact that with the HashSet<T> optimization, the operation cost is O(other sequence size).

  • If the HashSet<T> is bigger than the other sequence, gains factor can be high (like 15x).
  • If the HashSet<T> is smaller than the other sequence, gains factor tends to 1x.
  • If the HashSet<T> size is comparable to the other sequence size, the gain factor is around 2.7.

Here are quantitative measures:

nbElemHashsetA=1000  nbElemB=1000  intersectionSize=800  nbLoop=10000
Gain : 283%
nbElemHashsetA=1000  nbElemB=1000  intersectionSize=200  nbLoop=10000
Gain : 270%
nbElemHashsetA=1000  nbElemB=100  intersectionSize=50  nbLoop=10000
Gain : 1539%
nbElemHashsetA=100  nbElemB=1000  intersectionSize=50  nbLoop=10000
Gain : 127%

Here is the code optimization:

Here you can copy\paste the whole program used for measure.

Note 1: If one has information on how the internal Set<T> compare to the public HashSet<T> I am interested. I can see Set<T> relies also on Hash computation.

18June2011: After analysis, this class Set<T> is just a lightweight HashSet<T>. The hash algorithm used is strictly identical to the one of HashSet<T>. Since performances are strictly equivalent, I am still wondering its reason of being there??

Note 2: In the case of doing operations on two sequences (not HashSet<T>), since the second sequence will be transformed into a Set<T> under the hood, it appears that longSeq.Intersect(shortSeq) is preferable than shortSeq.Intersect(longSeq) (both in terms of memory and performance). Here also the .NET Fx Impl could be optimized by checking if seqs are both ICollections<T> or ICollection<T>, and if yes, compare the relative size and do the optimization.

Note 3: A Simple IEnumerable<T>.ToHashSet() extension method would be welcomed in the .NET Fx.

Note 4: I feedback on MS Connect here, please vote up.

Note 5: As Jon Skeet just noted, it’s worth noting that you need to make sure all the EqualityComparers are appropriate… otherwise you could be intersecting (say) a list of strings with a case-insensitive HashSet<string>, wanting a case-sensitive comparison, and you wouldn’t get the right results.

Note 6: Here is the IEnumerable<T>Count() pseudo impl:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
public static int Count<TSource>(this IEnumerable<TSource> source) {
if (source == null) {
throw Error.ArgumentNull("source");
}
var collectionT = source as ICollection<TSource>;
if (collectionT != null) {
return collectionT.Count;
}
var collection = source as ICollection;
if (collection != null) {
return collection.Count;
}
int num = 0;
using (IEnumerator<TSource> enumerator = source.GetEnumerator()) {
while (enumerator.MoveNext()) {
num++;
}
}
return num;
}
view raw LINQCount.cs hosted with ❤ by GitHub

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • http://www.facebook.com/anders.borum Anders Borum

    Bart, who’s talking about SQL Server here?

  • http://www.facebook.com/anders.borum Anders Borum

    It’s interesting to see Microsoft has closed the ticket as “won’t fix”. The fact that the reviewer of the ticket thought you were talking about LINQ to SQL just painted a picture of the person not even caring to check this page (as linked from the ticket).

    Closing the ticket with no clarification message comes across a bit arrogant I think.

  • Anonymous

    approval

  • http://twitter.com/bartczernicki Bart Czernicki

    This is what SQL Server does behind the scenes for you if you do not have an index on a table (a heap).  Alternatively, you could actually write indexes for LINQ by overriding the “join” syntax.  I found that improves performance more than 2.7x…more like 15x+…