Performance Contest #1 Results and Analysis

 


Sorry for the delay in getting the results out. There is a really funny story behind it with me running 7 hours of tests in debug mode then aggregating results and getting it all ready before I realized what I had done :-/


You can find the original contest here http://codebetter.com/blogs/gregyoung/archive/2006/07/27/147736.aspx.



Before I announce the winners I want to look at various optimizations for the problem, some which were used and some which were not used.  Many seemed to have realized that the suggestion towards threading in this case was a red herring (well sort of, further information below). Large parts of this problem are spent in the creation of objects, the hash function and tokenizing the string. Let’s look at some strategies to coming up with a good solution.


Order is Important!


One of the key optimizations for the tokenizer is realizing that the order of processing is extremely important. If we know any information about the domain our code will run in we can often make domain specific optimizations by changing the ordering of our statements.


In the English language for instance characters appear on an order of magnitude more frequently than punctuation. In order to optimize for this case we should check our most often circumstances first. This is better shown with a simple example. The following two functions are both doing the exact same thing but the second example is much faster than the first.










        static count CountWrongOrder(string s) {


            count ret = new count();


            foreach (char c in s) {


                if (c == ‘ ‘ || c == ‘\t’ || c == ‘\n’) {


                    ret.spaces++;


                } else if (char.IsDigit(c)) {


                    ret.numbers++;


                } else if (char.IsUpper(c)) {


                    ret.uppercase++;


                } else if (char.IsLower(c)) {


                    ret.lowercase++;


                }


            }


            return ret;


        }


        static count CountRightOrder(string s) {


            count ret = new count();


            foreach (char c in s) {


                if (char.IsLower(c)) {


                    ret.lowercase++;


                } else if (c == ‘ ‘ || c ==‘\t’ || c==‘\n’) {


                    ret.spaces++;


                } else if (char.IsUpper(c)) {


                    ret.uppercase++;


                } else if (char.IsDigit(c)) {


                    ret.numbers++;


                }


            }


            return ret;


        }


Right and wrong order


 


Run across the string “This is some normal english text. Occasionally you will also get a number such as 2″ (concat’ed 10000) times the second version is almost twice the speed of the first. In our most often case in the first example (a lower case letter) we have to fail through all of the other conditions where as in the second example it ius our first condition. The goal should be to check the mutually exclusive conditions in an order that is correct when you statistically analyze the data.


ToCharArray() ?!


I saw many submissions calling Data.ToCharArray() to get a character array representing the data so they could read it. There is another method on the string object that does a similar job, which although hidden to you is much faster than ToCharArray(). Consider the following test.








        static int ORCharsNewArray(string s) {


            char[] tmp = s.ToCharArray();


            int ret = 0;


            foreach (char c in tmp) {


                ret |= c;


            }


            return ret;


        }


        static int ORCharsSameArray(string s) {


            int ret = 0;


            foreach (char c in s) {


                ret |= c;


            }


            return ret;


        }


        static void Main(string[] args) {


            string foo = “123456789012345678901234567890”;


            Stopwatch s = new Stopwatch();


            s.Start();


            for (int i = 0; i < 100000; i++) {


                int bar = ORCharsNewArray(foo);


            }


            s.Stop();


            Console.WriteLine(s.ElapsedTicks);


            s.Reset();


            s.Start();


            for (int i = 0; i < 100000; i++) {


                int bar = ORCharsSameArray(foo);


            }


            s.Stop();


            Console.WriteLine(s.ElapsedTicks);


        }


ToCharArray vs get_Chars


 


In debug mode they have nearly identical performance but in release the ORCharsSameArray function is about 3 times faster. When you use a foreach on a string (or if you use an index) a special method get_Chars is called (you can’t call this on your own). We can see this occurring by looking at the IL in question.








.method private hidebysig static int32 ORCharsSameArray(string s) cil managed


{


      .maxstack 2


      .locals init (


            [0] int32 num1,


            [1] char ch1,


            [2] string text1,


            [3] int32 num2)


      L_0000: ldc.i4.0


      L_0001: stloc.0


      L_0002: ldarg.0


      L_0003: stloc.2


      L_0004: ldc.i4.0


      L_0005: stloc.3


      L_0006: br.s L_0018


      L_0008: ldloc.2


      L_0009: ldloc.3


      L_000a: callvirt instance char string::get_Chars(int32)


      L_000f: stloc.1


      L_0010: ldloc.0


      L_0011: ldloc.1


      L_0012: or


      L_0013: stloc.0


      L_0014: ldloc.3


      L_0015: ldc.i4.1


      L_0016: add


      L_0017: stloc.3


      L_0018: ldloc.3


      L_0019: ldloc.2


      L_001a: callvirt instance int32 string::get_Length()


      L_001f: blt.s L_0008


      L_0021: ldloc.0


      L_0022: ret


}


IL of ORCharsSameArray


 


This method lets you view the data inside of the string as if it were an array of chars.  Since the data is only being read there is no need to generate an array. The reason that they are roughly the same speed in debug is that the method does not get inlined, in release mode the method is inlined and it is nearly as efficient as unsafe code.


Threading (The Red Herring?)


A few tried threaded solutions only one did it efficiently. Garmon got the algorithm right.


The problem with threading is locking … In order to maintain a hash properly between two threads one needs to setup critical sections around the hash. The big problem comes in that on every probe to the hash one has to worry about a resize of the hash by the other thread. In order to get around this one has to create a hash per thread, since each thread has its own hash they can operate on their own hash without locking as the other thread will not touch the hash.


Once the two threads are complete one then must merge the two hashes. This will always be a O(n) operation at best generally O(hashsize) which is why threading was not much of an issue here unless you get absolutely huge data sets with lots of repetition. It takes a lot to make up for the O(n) operation on the copy, the creation of the second hash, the delay to actually start the thread, and the additional memory overhead caused by having two hashes.


The key to threads being successful is a high rate of repetition in the data!


Use a Map


All of the submissions used if conditionals to tokenize the data.  There were cases such as








            if ((c >= ‘a’) && (c <= ‘z’)) {


                builder.Append(char.ToUpper(value));


            }


Conditional Tokenizing


 


Could we pre-generate this information into a table and simply do a table lookup?








        enum Operations {


            EndWord = 0x0100,


            MoveNext = 0x0200


        }


        static string FormatEntry(int value) {


            return string.Format(“0x{0:x4}”, value);


        }


 


        static void MainToRun(string[] args) {


            Console.WriteLine(“UInt16 [] map = {“);


            for (int i = 0; i < 255; i++) {


                char c = (char)i;


                if ((c >= ‘a’ && c <= ‘z’) || (c >= ‘0’ && c <= ‘9’)) {


                    Console.Write(FormatEntry(c | (int)Operations.MoveNext));


                } else if (c >= ‘A’ && c <= ‘Z’) {


                    Console.Write(FormatEntry(char.ToLower(c) | (int)Operations.MoveNext));


                } else if (c == ‘ ‘ || c == ‘\t’ || c == ‘\n’ || c == ‘\r’) {


                    Console.Write(FormatEntry(0 | (int)Operations.EndWord));


                } else {


                    Console.Write(FormatEntry(0));


                }


                if (i < 254) {


                    Console.Write(“, “);


                }


                if ((i + 1) % 8 == 0) {


                    Console.Write(“\n”);


                }


            }


            Console.Write(“\n}\n”);


        }


Code to generate map


 


Note that this code is doing any transformations that we may want and is also storing some additional information in the high bits of the int16(we only use 8 bits for the char). In particular it is storing a bit that tells us if what was read is a word terminator or not and it tells us whether or not we should add to the current position of the output buffer. This code will produce output similar to the following








        static readonly UInt16 [] map = {


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0100, 0x0100, 0x0000, 0x0000, 0x0100, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0100, 0x0000, 0x0000, 0x0000, 0x0224, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x022d, 0x0000, 0x0000,


0x0230, 0x0231, 0x0232, 0x0233, 0x0234, 0x0235, 0x0236, 0x0237,


0x0238, 0x0239, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0261, 0x0262, 0x0263, 0x0264, 0x0265, 0x0266, 0x0267,


0x0268, 0x0269, 0x026a, 0x026b, 0x026c, 0x026d, 0x026e, 0x026f,


0x0270, 0x0271, 0x0272, 0x0273, 0x0274, 0x0275, 0x0276, 0x0277,


0x0278, 0x0279, 0x027a, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0261, 0x0262, 0x0263, 0x0264, 0x0265, 0x0266, 0x0267,


0x0268, 0x0269, 0x026a, 0x026b, 0x026c, 0x026d, 0x026e, 0x026f,


0x0270, 0x0271, 0x0272, 0x0273, 0x0274, 0x0275, 0x0276, 0x0277,


0x0278, 0x0279, 0x027a, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,


0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000


Map outputted by example


 


Now that we have this map saved off we can write an extremely elegant tokenizer that is also faster than our prior incarnations which used conditionals.








        public unsafe WordEntry[] CountWords(string _Text) {


            table.Clear();


            int loc = 0;


            int lastloc = 0;


            byte [] buf = new byte[_Text.Length * 2];


            fixed (byte* buffer = buf) {


                fixed (char* c = _Text) {


                    char* current = c;


                    char* stop = c + _Text.Length;


                    while (current < stop) {


                        UInt16 val = map[*current];


                        int add = (val >> 9);


                        if (add > 0) {


                            buffer[loc] = (byte) (val & 0xFF);


                            loc++;


                        } else if (loc != lastloc && ((val >> 8) & 1) == 1) {


                            table.Increment(buffer + lastloc, loc – lastloc );


                            loc += 4;


                            lastloc = loc;


                        }


                        current++;


                    }


                    if (loc != lastloc) {


                        table.Increment(buffer + lastloc, loc – lastloc);


                    }


                    return table.ToArray();


                }


            }


        }


Map Based Parser


 


Clean, compact, and efficient … what more can you ask for?


A Side Challenge


I gave myself a side challenge yesterday, could I tokenize the string two chars at a time with only a single branch in my loop? It took me a little while to come up with an answer to this problem but it did exist. Here is the code.








                    while (current < stop) {


                        Int32 chars = *current;


                        chars = (chars ) >> 16 | ((chars & 0xFF) << 8);


                        Int32 processed = parsermap[chars];


                        Int16* tmp2 = (Int16*)tmp;


                        *tmp2 = (Int16)(processed & 0xFFFF);


                        int add = (processed >> 19); //number to add to output pointer


                        tmp += add;


                        tmp2 = (Int16*)current;


                        tmp2 += (processed >> 17) & 0x03; //number to add to buffer pointer (1 or 2)


                        current = (Int32*)tmp2;


                        if ((processed & 0x010000) > 0) {


                            Int32* t = (Int32*) tmp;


                            *t=0;


                            table.Increment(last, (int) (tmp – last));


                            tmp += 4;


                            last = tmp;


                        }


                    }


Parse 2 chars at a time with only one if statement?


 


In order to really look at this code you will also need to look at the code that generates a 64k map that it references (no I will not post the 64k map hereJ). The complete code and the map can be found in the MappedWordCounter.cs file. If people would like I can explain further just how it works but basically it sets high bits in the map to define how much to move forward both the buffer and the output pointers (it also stores a bit flag as to whether or not the 2 chars contained a word terminator as did the previous tokenizing example). There is however one odd case which is worth discussion.


If you read two characters and the first is an ignored or termination character. In the case of an ignored character you don’t want to end up with a “hole” in your array as such you have to not move the output buffer.  In this case one is added to the buffer making it on an odd letter and you add zero to the output as you don’t want to include the character. This means that in this case you read and write the second character twice. A further special case can be seen when you have two ignored or termination characters except the buffer moves two while the output moves zero positions.


Initial Hash Table Size


In the tests where a new object is created for every call the hash table size becomes extremely important. If the table is too big it will be allocating useless memory and will need to clear that memory. If the table is too small it will have to go through numerous growth operations which are a fairly big hit. I have yet to find a “good” way of initializing the table except for using domain information such as “every 9 letters or so will on average produce a new word”.


In general it is best to error a bit on the side of caution and allocate slightly more memory to your hash table than to error the other direction and to force a growth of your table.


I made YoungCompressed use a 150000 entry table for every entry, check out the performance characteristics compared to the other entries which were using a text.length / 7 sized table.


The Default Equality Comparison is SLOW


No submissions used this but it can quickly speed up many of them.  I am going to pick on Brandon Grossutti’s submission here but the optimization applies to all of them that use a dictionary<>. By simply changing Brandon’s code from


Dictionary<string, int> MyWords = new Dictionary<string, int>();


To


Dictionary<string, int> MyWords = new Dictionary<string, int> (1, StringComparer.Ordinal);


Took about 7% off the total running time of the submission. This is an important one to remember.


Write a Custom Hash Table


It is well known that code which is designed to be generic will often be less efficient than code tailor made for the job at hand (in other words generality comes at a loss of performance). The .NET dictionary and hash classes are both very generic. By writing your own version of these classes there is a performance gain to be had. I wrote a few varying forms of my hash table that are included with the code.


The first version of the table works in a very similar fashion to the standard table. It stores  the string keys as char * and uses a modulus based system for finding the next available slot (i.e. slot = (slot + 1) % tablesize). This method is quite fast but there are a few optimizations which can be made.


One of these optimizations is to get rid of the modulus which is a slow operation on most machines (and it is used extremely often, every probe). In order to get rid of the modulus we need to make some special requirements on the hash table. To be exact you need to control the size of the hash table to only be a power of two (this can be seen in “bithash” and is in my overall submission).








        /// <summary>


        /// calculates the next highest power of two


        /// </summary>


        /// <param name=”number”></param>


        private static UInt32 CalculateNextPowerOfTwo(uint _Number) {


            for (int i = 8; i < 32; i++) {


                uint current = (uint) 1 << i;


                if (current > _Number) {


                    return current;


                }


            }


            return uint.MaxValue;


        }


Code to find the next item that is a power of two given a number


 


If the table is a power of two you can use a bit mask instead of a modulus to insure that your current value is within the realm of your hash. Once the hash is assured to be a power of two one can calculate the bitmask for the hash table as hashsize – 1. All that is then required is the mask which can be seen here.


uint slot = (uint)(Hash & m_HashMask);


The bit mask is significantly faster on most architectures, this change yielded nearly a 10% gain for the algorithm on my machine.


Buffer?


Since I am using char* in my hash table, I do not need to initialize a specific string for it (and I can deal with as a unit32* when I need to do things like comparisons). The big problem you will run into here is that you need to keep the memory that the string is in pinned. This can either be done explicitly or by holding a GCHandle for the object (such as I do with the actual data in the hash table). If you do not do this you can end up with seemingly random bugs that occur when the GC compacts the heap (the string you were pointing to has moved).


In order to get around the nightmare of trying to maintain GCHandles my tokenizer puts the data into a big output buffer. This output buffer is unsafe and it is pinned for the duration of the operation (so we know it won’t move).


This methodology is not very general but it does help prevent a lot of other overhead.


Equality Compare (Custom Hash)


One of the slowest items in the code is the equality compare of an entry in the hash table.  Consider the following code








                    for(int j=0;j<_Length;j++) {


                        if(_Word[j] != Current->Chars[j]) {


                            //failed


                        }


                    }


 


Naïve Equality Compare


 


This code goes through character by character comparing our string in our hash to the string we wish to use (keep in mind that this gets called at least once for every probe to the hash). A better way of doing this would be unroll the loop a bit by using a (uint32*) to perform our comparison, as such we could compare two characters every iteration of the loop instead of just one (remember that a character is 16 bits).


We will however run into a problem comparing two characters at a time. What happens if we have a string with an odd number of characters? We only really have two choices about what to do here.  The first option is to loop 1 character short and then check the last character, this is a generally good solution but it can be optimized further if you can control the data coming in.


Since there is only one client for this hash we can definitely control the data coming in. Imagine the words cat, dog, and “the” in memory as shown below.

















C


A


T


D


O


G


T


H


E


 


 


 


 


In this case we would need to use the above method (where we special case the last character and read it separately). Since we control the memory layout however we can avoid this by “aligning” the memory as follows.

















C


A


T


0


D


O


G


0


T


H


E


0


 


By aligning our memory we have insured that we can always use just our integer based comparison to compare two characters at once! There is no need to special case; on the last odd letter we will compare the last odd letter and the character past it. Since both of the words are odd this optimization is completely safe so long as we are diligent to use the same character for padding every time.


Compress Data


As I said above a character is 16 bits, we do not however care about the high bits as we are only dealing with English text which can be treated as ASCII characters (8 bits).  In order to do this we can store the data as a byte array. Since we are storing the data as bytes we can now compare four characters at a time when we have to compare our string.


Since we are using a map to handle translations for us the conversion on the way in is pretty much free (it’s just different values in our map). In general we should also save memory as we are only using 8 bits per byte but our alignment is now 4 bytes instead of the 2 it was in the char example. If we happen across an edge condition of many 1 character words we may end up actually using more memory.


There is also a downside to this optimization. When it comes time for us to take our data out of the hash table to create our return value we have to convert all of the data back from being a byte array of ASCII characters to being UTF-16 characters that .NET knows and loves. The code for this can be seen here.








                        UInt16* stop = (UInt16 *) (ptr + Current->Length);


                        while (ptr < stop) {


                            UInt16 chars = *ptr;


                            *buf = ((chars >> 8) << 16) | (chars & 0xFF);


                            ptr++;


                            buf++;


                        }


Decompress byte data into a string 2 chars at a time


 


So in short, this optimization saves us time on the compare and saves us memory in most cases but costs us time to get the string out. For English data it generally comes out to be about even but if you are dealing with larger words and/or with high rated of repetition then this optimization can really pay off.


Change the Hash Algorithm


If you look through my code, I use a reflectorred version of the string.GetHashCode() method to hash my strings (Alois does this as well). One place I sought to optimize was changing this algorithm as the hash function gets called an insane number of times (at a minimum once per word). I tried two other algorithms a variant of a Zobrist hash and a fairly simple hash that I came across. I am including the source of all three.








        public static unsafe uint CalculateHash(char* _String, int _Length) {


            uint num1 = 0x15051505;


            uint num2 = num1;


            uint* numPtr1 = (uint*)_String;


            for (int num3 = (int)_Length; num3 > 0; num3 -= 4) {


                num1 = (((num1 << 5) + num1) + (num1 >> 0x1b)) ^ numPtr1[0];


                if (num3 <= 2) {


                    break;


                }


                num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr1[1];


                numPtr1 += 2;


            }


            return (num1 + (num2 * 0x5d588b65));


        }


Modified dotnet hash


 








        public unsafe static Int32 Hash(char* str, int length) {


            UInt32* ptr = (UInt32*)str;


            UInt32* stop = (UInt32*)(str + length);


            UInt32 Hash = 0;


            while(ptr < stop) {


                Hash ^= (hashmap[ptr[0] & 0xFFFF] ^ hashmap[ptr[0] >> 16]);


                Hash ^= (hashmap[ptr[1] & 0xFFFF] ^ hashmap[ptr[1] >> 16]);


                ptr+=2;


            }


            return (Int32) Hash;


        }


Zobrist variant


 








        public static unsafe uint CalculateHash(char* _String, int _Length) {


            int i = 0;


            UInt32* tmp = (UInt32*)_String;


            while (_Length > 0) {


                i = (int) ((i<<3) ^ (*tmp++));


                _Length -= 4;


            }


 


            return (uint) i ;


        }


Simple Hash


 


I pulled these into a separate application and wrote a simple performance measurement. For 10000000 identical strings I received the following results.
















Algorithm


Time (in seconds)


DotNetHash


2.21


ZobristVariant


2.47


Simple


1.4


 


This would lead one to believe that simple may actually be a better overall choice. Simple does however produce overall slower performance though as it does not do as good of a job hashing which will end up causing more collisions in the hash table (its simplicity causes further issue). I have to say that I am fairly impressed with the .NET implementation performance wise and continued to use it.


Results


The results were extremely varied from entry to entry.  Here is an explanation of the various tests (You can download them from the “attachment” at the bottom or from http://codebetter.com/blogs/gregyoung/attachment/148292.ashx).


1)      LotsOfWords (run 200 times)  this test uses a string that is 110,000 words separated by spaces. The thought behind this test is to try to catch people “cheating” with hashing algorithms etc, it is also a worst case scenario for things like buffer management if you don’t use a rolling buffer. Used with a new object every time this test can give information as to the effectiveness of the initial hash sizing (if the hash starts off too small this test will suffer greatly due to copies)


2)      FewWordLotsOfTimes (run 200 times) this test uses a small string that contains the oddities listed in the original post. This test is designed to test that the hash counts properly (including for large numbers of items) and that the tokenizer is working properly. Used with a new object every time this test can also give information as to the effectiveness of the sizing of the hash as if you make your hash too big initially you will suffer from creating or initializing a huge buffer on every iteration


3)      Small.txt (run 500000 times) this reads the file “small.txt” which is a short paragraph on premature optimization (scored as small)


4)      Premature.txt (run 200000 times) this reads the file “premature.txt” which is a medium length document discussing premature optimization (scored as medium)


5)      Warandpeace.txt (run 2000 times) this reads the file warandpeace.txt which is the full text of war and peace (scored as large)


6)      Bible.txt (run 2000 times) this reads the file “bible.txt” which is the full version of the bible. This test is interesting since the bible has a higher rate of repetition than war and peace.


7)      Walden.txt (run 2000 times) this reads the file “walden.txt” which is the full version of Walden by Henry Thoreau. This test is interesting because it has a lower rate of repetition than war and peace.


A few people had some slight problems with things like “$”, it was my mistake to update it after the fact … My guess was that they never saw the update after the first day so I let it slide and corrected it where I could.


Wilson was run separately from the rest. It was much slower than the rest as it used an arraylist resulting in an O(n) lookup. I do have to say though that it passed tests with flying colors off the bat (including items such as “$”).


My generalized entry in this would be YoungBitCompressed. It uses most of the optimizations listed above (power of two hash, buffer alignment, compressed buffers (as bytes), and a map based tokenizer). The other entry of interest should be YoungCompressed which I left as optimized for larger data sets (it always had a 150k hash table). This item is quite quick on the larger items (almost as fast as the power of two hash table) but it is very slow on the shorter data sets.


Now for the results (all times are in seconds) *drum roll* (when looking at these remember to multiply by 1.7 or so and have a laugh at the time I wasted running them without JIT optimizations :().


 


 





































































































































Counter


Safe?


LotsOfWords


LotsofTimes


Small


Medium


Large


Bible


Walden


Young Mapped Compressed


No


13.71


38.09


26.92


156.96


234.4


354.78


54.84


YoungCompressed(const 150k hash)


No


14.07


29.66


1010.91


510.15


202.45


270.72


51.15


YoungBitCompressed


No


14.96


30.07


21.03


127.92


189.3


265.54


43.58


Young


No


15.7


36.36


26.63


148.82


223.06


303.39


50.23


Grossutti


Yes


21.5


159.5


66.75


479.88


770.13


1296.41


174.95


Bowen


Yes


82.28


257.28


100.6


739.53


1389.11


1842.78


362.4


Bushman


Yes


25.63


110.78


65.35


406.65


621.41


857.29


144.66


Idzi


Yes


15.49


136.32


53.09


381.15


698.01


1044.31


150.39


Kraus


No


17.6


84.21


47.99


320.49


488.15


684.18


101.45


Garmon


Yes


40.37


103.84


85.94


661.95


911.94


1078.82


193.9


Wilson


Yes


34800


1200


1000 (est)


1200


DNR


DNR


DNR


Wilson extrapolated


 


Final Scores














































Kraus


10


10


10


30


Idzi


8


8


7


23


Bushman


7


7


8


22


Grossutti


6


6


6


18


Garmon


5


5


5


15


Bowen


0


0


0


0


Wilson


0


0


0


0


 


Congratulations!


Kevin Idzi and Steve Bushman, this was a very heated race for second. Steve won the large category but Kevin squeaked out on top of the smaller categories to win by a single point in a photo finish. Kevin takes second place and a copy Donald Knuth’s TAOCP V4F3 “Generating all Combinations and Partitions”  plus the bonus for safe code only…


 


And of course … Alois Kraus will receive a copy of Things a Computer Scientist Rarely Talks About for winning all three categories way to go Alois!


Great job guys!


I hope everyone has had fun and learned something through this; I know I have in both regards. Let me know any questions/comments as I get ready for the next challenge a bit later in the week!


After speaking with a few people I believe that next time I will choose a slightly smaller problem (although my entry is only about 140 lines of code total).  I will be posting another problem this week but would like to get people’s opinions prior to posting it as to whether


1)      This problem size is ok


2)      The time frame is ok, I understand that people are busy and getting 150 lines of heavily optimized code written can often be a difficult task


 


 


 


 


Please remember that my code is there for reference only, it is not a “right” answer or as optimized as it could possibly be…  I am quite sure my entry could still be optimized by a factor of two. Maybe someone can come through and finish this up with some smart micro-optimizations? :)

This entry was posted in Contests. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

10 Responses to Performance Contest #1 Results and Analysis

  1. kevinidzi says:

    Yes – that is because that was what was submitted. As far as the contract, the interface just said WordEntry objects, what I changed didn’t nullify the interface itself (just the object being passed through the interface :) ) I should have asked if that was doable before using it, I admit – I like the way you grouped all of the projects together and setup the single solution. I wasn’t sure how you were going to package up the outputs. I’ll keep that in mind next time :)

  2. Greg says:

    Alois, I saw that article today as well and thought wow I wonder if those could have been useful :) been to long since I have used that (I think once in data structures class way back when:))

    Kevin: The struct was defined originally and was used in the interface contract :)

    I have to admit I was hoping to see someone do a slick optimization where they also created a hash that indexed a set of array chunks then used buffer.BlockCopy to copy them to the new array :) I never got a chance to do that (would be in my ToArray method). This would have been a ncie optimization.

    Of course you are also comparing to Alois running with a struct still yes?

  3. kevinidzi says:

    In looking over the tests that were run, I found out that my tests were run with the WordEntry as a struct, not as a class (which is what I submitted). I re-ran the numbers on my machine and got some interesting results:
    (i hope i can use the pre tag!)

    Idzi (used in original results)
    	LotsOfWords Time=		00:00:15.6431977 
    	SameItemLotsOfTimes Time=	00:02:13.2396571 
    	File:data\small.txt Time=	00:00:54.7412453 
    	File:data\premature.txt Time=	00:06:51.8044892 
    	File:data\warandpeace.txt Time=	00:12:20.8279974 
    	File:data\bible.txt Time=	00:17:11.8194954 
    	File:data\walden.txt Time=	00:02:40.8796759 
    
    Idzi (matching submission)
    	LotsOfWords Time=		00:00:20.2047430 
    	SameItemLotsOfTimes Time=	00:01:30.7247786 
    	File:data\small.txt Time=	00:00:43.7896991 
    	File:data\premature.txt Time=	00:05:12.9812222 
    	File:data\warandpeace.txt Time=	00:09:34.5107560 
    	File:data\bible.txt Time=	00:12:41.5713927 
    	File:data\walden.txt Time=	00:01:49.0632337 
    
    Kraus
    	LotsOfWords Time=		00:00:17.9200076 
    	SameItemLotsOfTimes Time=	00:01:22.9423289 
    	File:data\small.txt Time=	00:00:47.5256895 
    	File:data\premature.txt Time=	00:05:17.1119066 
    	File:data\warandpeace.txt Time=	00:08:01.0490335 
    	File:data\bible.txt Time=	00:10:44.7056665 
    	File:data\walden.txt Time=	00:01:36.6565953 
    
    Young
    	LotsOfWords Time=		00:00:16.5721912 
    	SameItemLotsOfTimes Time=	00:00:35.3185721 
    	File:data\small.txt Time=	00:00:28.5897805 
    	File:data\premature.txt Time=	00:02:33.6622168 
    	File:data\warandpeace.txt Time=	00:03:52.5732323 
    	File:data\bible.txt Time=	00:05:11.5162662 
    	File:data\walden.txt Time=	00:00:48.9111570 
    

    The end result looks like the code I submitted actually came ahead of Alois for the small and premature tests (it was a very tight race). Although mine fell behind for the others – but by not nearly as much as the version that was actually tested.

    So that kind of goes to what I mentioned before, here is a managed app which is very similar in some of the benchmarks as an unsafe application. The remaining gaps with the unsafe version can be quickly removed by updating just a few lines of code (same algorithm). Of course, I can’t get the perf that Greg got – but it seems nobody else went that far either. So I guess if you have the ‘perfect’ algorithm, it might be best in unsafe mode, but many times you can be leapfrogged by something safe just because the approach is a bit different.

  4. Alois says:

    To participate at this contest successfully (read try to win) I did put about 8 hours of work into this. I think I would not have been able to do this without having a free week where I had the time. When you program already 60h per week and then add another 8 really challenging hours to your weekend your doctor will diagnose a heart threatening caffeine level in your blood and a strong reaction to unfiltered sun light ;-).
    I think that a smaller problem would allow more people to participate at these contests.
    One word about optimizations. The concept of Bloom Filters could help.

    http://www.perl.com/pub/a/2004/04/08/bloom_filters.html
    Joe Duffy has written a nice article about them as well:
    http://www.bluebytesoftware.com/blog/

    I think that with Bloom Filters we could build an even better hash table, at least for texts with a very low repetition count.

    Yours,
    Alois Kraus

  5. Greg says:

    Kevin, a general rule of thumb (depending on the algorithm obviously) is about 10-15% loss compared to a native high levellanguage such as C++ (not including JIT time). Obviously native in say assembler is a whole different story …

    Most of my changes were algorithmic. I mainly used unsafe as some of the algorithmic changes would not have done well in safe code (example: string bufferring + aligning)

    It’s funny you should mention this though .. stay tuned :)

  6. kevinidzi says:

    I didn’t choose unsafe because although it might get the maximal performance in some cases – As Greg has officially schooled us pretty well here on that 😉 I think it is more to do with algorithms. I think with the right algorithms, safe code could come very close to unsafe code. And looking at this solution, mine was far from that mark. .

    It’s nice to see all of the different ideas and methods. After watching Rico (http://blogs.msdn.com/ricom/archive/2005/05/18/419815.aspx) go toe to toe against a very optimized c++ (which eventually won out), it is clear that unsafe mode is only as good as your algorithm. Having a safe managed application which is running an excellent algorithm, and which might be faster than a mediocre unsafe application is a good thing :)

  7. Alois says:

    All points. Ohhhh 😉

    Yessssss. Thank you for the Book Greg. Although you have a solution which does mine beat by a factor two still. But lucky me you can’t win your own book ;-).

    Bushmango: I am really interested why nobody did choose unsafe code to get maximal performance.

  8. Greg says:

    Frans .. you could precalc those items .. of course in the case of the shift .. the 1 cycle it takes to shift is less than the number of cycles to do a look up.

    Let’s propose that you replaced the 2 shifts and an and with 2 table look ups (or another table lookup and a split apart of the data).

    Let’s presume that the table lookups takes one cycle (which they don’t).

    It would save 1 clock cycle … Don’t get me wrong its savings but I don’t think that is the place to look for optimizations. A bigger concern of mine is that the compiler generates the following code on the read.

                           UInt16 val = map[*current];
    000000c7  mov         eax,dword ptr ds:[02277444h]
    000000cc  movzx       edx,word ptr [edi]
    000000cf  cmp         edx,dword ptr [eax+4]
    000000d2  jae         00000161
    000000d8  movzx       edx,word ptr [eax+edx*2+8]

    Now this is a place where some optimization can be done! I will post a version that uses an array instead of a pointer to avoid this.

    Also in going through looking at some disassembled code …

    “UInt16 val = map[*current];
    if ((val >> 9) > 0) {
    buffer[loc] = (byte) (val & 0xFF);
    loc++;
    } else if (loc != lastloc && ((val >> 8) & 1) == 1) {
    table.Increment(buffer + lastloc, loc – lastloc );
    loc += 4;
    lastloc = loc;”

    is slightly faster than the original code (by 1 cycle) as the add variable is no longer used in that example and does not really require an assignment (I am kind of surprised that the optimizer didn’t pick up on this).

  9. bushmango says:

    One point! oooooooooh!

  10. FransBouma says:

    Your code:
    “UInt16 val = map[*current];
    int add = (val >> 9);
    if (add > 0) {
    buffer[loc] = (byte) (val & 0xFF);
    loc++;
    } else if (loc != lastloc && ((val >> 8) & 1) == 1) {
    table.Increment(buffer + lastloc, loc – lastloc );
    loc += 4;
    lastloc = loc;”

    You use a pre-calc table, though you still have to shift TWICE, and at least ONCE. As this shift is occuring a hecka of a lot of times, it’s IMHO significant to pre-calc the shifts to either a more clever pre-calc table or to use different checks with the table data. :)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>