Comparing Hash Table Implementations Performances

 
One of my preferred
programming paradigm is hash table. I consider almost like magic being able to test if a collection contains a particular object, independently from the size of the collection (hence the term constant time often used to characterize hash tables). Personally, I use hash tables intensively and it
is the cornerstone of all VisualNDepend and Code Query Language (CQL)
implementation performances. Performances are here critical since the goal of NDepend is to let .NET developers knows anything about their code bases in
real-time, even on millions lines of code application that spawns on multiple Visual Studio solutions (
anything about their code bases include architecture/dependencies/layering facts, 82 code metric values, changes between 2 versions, state mutability facts, encapsulation facts…).

 

I won’t go through all the details of hashtable here (load factor/collision/bucket/hash
computation/hash on immutable state/caveat etc…). Jeff Atwood
did it very well in his post Hashtables, Pigeonholes, and Birthday
. Frankly, if you are not
sure about hash table concepts or if you don’t use them on a daily basis, go
read this post and understand why you are about to use hash tables extensively from now. Also,
if you don’t believe me that hash tables are so central for good programming,
then believe Microsoft engineers that
provided the method Objet.GetHashCode()
meaning that the hash concept applies for all .NET objects of all time.

 

I was curious about
comparing the performances of the different implementations of hash table .NET
developers. My motivation was to see if it was worth switching from the .NET framework hash table implementations to some other ones.

 

Basically there are the 2 classes in the .NET framework
System.Collections.Generic.Dictionary<K,V>
,
System.Collections.Generic.HashSet<K>

(I don’t count the non-generic System.Collection.HashTable). There are also the implementations
available in the 3 libraries C5
(by IT University of Copenhagen), PowerCollections  (by Wintellect) and Mono (by Novell). With the exception that
PowerCollections doesn’t seem to provide its own Hash Dictionary.

 

Find below the detailed results. The conclusion is that the .NET Framework implementations are a bit more cheaper than the Mono ones, while the C5 and PowerCollections implementations are behind. Basically .NET implementations are between 1.4 to 1.7 more performant on insertion than the Mono ones. Both are equivalent when it comes to test containment in HashSet<T> and the Mono implementation of Dictionary<K,V>.TryGetValue(…) is a tiny bit better than the .NET one. One has to consider that the big motivation for using hash tables is when there are a lot of containment tests/retrieval, since these operations are done in a constant time. Thus, the Mono flaw on insertion is not that serious and this is a good news for all of us that don’t have access to the class System.Collections.Generic.HashSet<K> because we are still running on .NET v2.0. We can just use the Mono HashSet<T> implementation.

 

Concerning memory , if you look at results you’ll see that Mono collection seems up to  2 times cheaper that .NET ones but this results are biased. Indeed, some arrays objects are maintained and I cannot distinguish who reference what. I tried to do the memory tests separately but internally some arrays are maintained by the CLR ans it is still not possible to infer a clear result.

 

Here is the C# source
file

used to do the tests. Notice that I prefixed Mono collections with Mono. to avoid collision with .NET Framework collections. All hash table tested have some string generic parameters, and we also did the test with int32 values, to get also a glimpse at how it behaves with value type. If you have
feedback on these testing methodology please put them in comment. I know that the biggest concerns is the fact that a single insertion in an hash table can sometime takes a lot of time because internally it sometime triggers a re-arrangement of keys proportional to the number of elements. The only way to shorten these unwanted effects was to do tests for several values.

 

The results for both
performance and memory profiling were obtained with the great JetBrain dotTrace
profiler
. Concerning memory profiling tool
I have to say that usually I use the Scitech .NET
Memory Profiler

which has some awesome real-time
profiling capabilities without almost any overhead. I choosed dotTrace here because
the result presentation is nicer and consistent with the performance
result I wanted to present.

Concerning performance profiling
Red-Gate recently released .NET Performance Profiler v4.1
that has some great visualization
features but honestly, I haven’t found the time yet to dig in it and see if I
could prefer it over my good friend dotTrace.

 

 

 

Hash Table: Performance Profiling Results

 

Performance for 1M strings.

 

 

Performance for 100K strings.

 

 

Performance for 10K strings.

 

 

Performance for 1000 int32.


 

Performance for 10K int32.

 

 

 

 

Hash Table: Memory Profiling

 

Memory for 100 elements

 

 

Memory for 1000 elements

 

 

Memory for 10M elements

 

 

Memory for 100M elements

 

 

Memory for 1000M elements

 

 

 

 

 

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Thomas Wang

    One thing that I seldom see covered is the observation on hash table load factors. Load factor does not appear to be linearly correlated with performance. I measured hash table insertion performance on stock hardware varying the load factor. I see 5% slow-down using a load factor of 4, and only a 60% slow-down using a load factor of 16…