Sometime one shouldn’t be too optimistic with .NET Fx optimization. I knew that the extension method IEnumerable.Count() was optimized to check if the enumerable was a collection, in such case the Collection.Count property was called instead (see Count() impl at the end of the post).
So when it comes to set operations such as Intersect Union… if one of the sequence involved was an HashSet<T>, I assumed that the obvious optimization of using the O(1) HashSet<T>.Contains() magic, was implemented. But actually, by decompiling I discovered that it was not implemented. A Set<T> was created each time the code achieve a common set operation, no matter the type of sequences. In the case where one wishes to do set operations with an existing HashSet<T> this is unfortunate. The gain one can expect is around 2.7, with variations depending on the sequences collateral size. This comes from the fact that with the HashSet<T> optimization, the operation cost is O(other sequence size).
- If the HashSet<T> is bigger than the other sequence, gains factor can be high (like 15x).
- If the HashSet<T> is smaller than the other sequence, gains factor tends to 1x.
- If the HashSet<T> size is comparable to the other sequence size, the gain factor is around 2.7.
Here are quantitative measures:
nbElemHashsetA=1000 nbElemB=1000 intersectionSize=800 nbLoop=10000
Gain : 283%
nbElemHashsetA=1000 nbElemB=1000 intersectionSize=200 nbLoop=10000
Gain : 270%
nbElemHashsetA=1000 nbElemB=100 intersectionSize=50 nbLoop=10000
Gain : 1539%
nbElemHashsetA=100 nbElemB=1000 intersectionSize=50 nbLoop=10000
Gain : 127%
Here is the code optimization:
Here you can copy\paste the whole program used for measure.
Note 1: If one has information on how the internal Set<T> compare to the public HashSet<T> I am interested. I can see Set<T> relies also on Hash computation.
18June2011: After analysis, this class Set<T> is just a lightweight HashSet<T>. The hash algorithm used is strictly identical to the one of HashSet<T>. Since performances are strictly equivalent, I am still wondering its reason of being there??
Note 2: In the case of doing operations on two sequences (not HashSet<T>), since the second sequence will be transformed into a Set<T> under the hood, it appears that longSeq.Intersect(shortSeq) is preferable than shortSeq.Intersect(longSeq) (both in terms of memory and performance). Here also the .NET Fx Impl could be optimized by checking if seqs are both ICollections<T> or ICollection<T>, and if yes, compare the relative size and do the optimization.
Note 3: A Simple IEnumerable<T>.ToHashSet() extension method would be welcomed in the .NET Fx.
Note 4: I feedback on MS Connect here, please vote up.
Note 5: As Jon Skeet just noted, it’s worth noting that you need to make sure all the EqualityComparers are appropriate… otherwise you could be intersecting (say) a list of strings with a case-insensitive HashSet<string>, wanting a case-sensitive comparison, and you wouldn’t get the right results.
Note 6: Here is the IEnumerable<T>Count() pseudo impl: