Performance Challenge: #1 Word Count

I had not seen a good .NET performance challenge in a while so I figured that I would start one. Obviously the real benefit will come in about two weeks when all of these items are analyzed and discussed (I wager everyone involved will learn something!). If this works out well I will try to run these either bi-weekly or monthly. Let me know your opinion in comments below.


The Prizes


Aside from becoming alpha nerd for a few weeks, the winner of this contest will receive a book from the very well known computer scientist Donald E. Knuth. Just to give things an interesting twist the book has absolutely nothing to do with optimizations. The prize is a copy of Things a Computer Scientist Rarely Talks About, it is the transcripts of Knuth’s 1999 lecture series at MIT discussing the relations between faith and science. If you wish to watch the web casts of these lectures you still can here.


The runner up will receive a copy of Donald Knuth’s TAOCP V4F3 “Generating all Combinations and Partitions” as this book will help them win a future contest.


There will also be a prize TBD for the fastest safe version of the code.


I will do a submission but I cannot win a prize.


The Problem


Wc (word count) is an old unix program. One of its main tasks is to go through a set of data and count the number of repetitions of a given word. For instance given the sentence “Hello, yes hello” it would tell you that the word Hello was used twice and that the word yes was used once. While seemingly trivial this problem offers many interesting aspects such as dealing with the trade off of memory usage for large vs small data sets.


Hint: since it is a classic problem there is a lot of research on it


Here is a naive version of a word counter with a test harness if you can’t beat this you need more optimization J








    class Program {


        static Dictionary<string, int> WordCounts = new Dictionary<string, int>();


        static void Main(string[] args) {


            string test = “The quick brown fox jumps over the lazy dog”;


            Array.ForEach<string>(test.ToUpper().Split(‘ ‘), new Action<string>(AddToHash));


            foreach (string current in WordCounts.Keys) {


                Console.WriteLine(current + “:” + WordCounts[current]);


            }


        }


        static void AddToHash(string _Item) {


            if (!WordCounts.ContainsKey(_Item)) {


                WordCounts.Add(_Item, 0);


            }


            WordCounts[_Item] += 1;


        }


    }


 


Listing 1: Example of a naive word counter


 


Some notes.


1)      Capitalization differences do not equate to a difference.


2)      You must handle normal punctuation properly there will be no apostrophes/quotes in the test data as they bring up many questions on their own.


3)      Code will be run on a 32 bit processor (don’t use 64 bit stuff hoping for a 64 bit processor)


4)      My processor is a dual core (hint)


5)      You must use the IWordCounter interface


6)      You do not need to sort your data before returning


7)      If you do not pass unit tests performance tests will not be run.


UPDATE: here are some test strings to show … the quick sample code meets some but not all of these

“Hello! Greg” results in Hello 1, Greg 1
“Hello!Greg” results in HelloGreg 1
“Hello\nGreg”  results in Hello 1, Greg 1
“A.D.D.” results in ADD 1
“Wow, how great!” results in Wow 1, how 1, great 1
“wow     \n\n\n    great” results in wow 1, great 1
“it was a man-eating shark.” results in it 1, was 1, a 1, man-eating 1, shark 1

numbers can be a bit odd .. although the data has no numbers in it I will describe the behavior (I will also show how to avoid this in a parser if you wanted to). If you wanted to you could support a special case for numbers but I have NO test data with numbers in it .. this is how they will turn out if you follow the general rule above of ignoring punctuation.



1,000,000 results in 1000000 1 
$1,000,000 results in $1000000 1 
$ 1,000,000 results in $ 1, 1000000 1 
“1.2 1.20 120″ results in 12 1, 120 2


 


Submission Process


Submissions must use the IWordCounter interface provided here, this is to help me by making unit/performance testing a bit easier.








using System;


using System.Collections.Generic;


using System.Text;


 


namespace ConsoleApplication39 {


    struct WordEntry {


        public string Word;


        public int Count;


    }


 


    interface IWordCounter {


        WordEntry[] CountWords(string _Text);


    }


 


    class WordComparer : IComparer<WordEntry> {


        public int Compare(WordEntry x, WordEntry y) {


            return x.Word.CompareTo(y.Word);


        }


    }


 


    class WordCounter :IWordCounter {


        Dictionary<string, int> WordCounts = new Dictionary<string, int>();


       


        private void AddToHash(string _Item) {


            if (!WordCounts.ContainsKey(_Item)) {


                WordCounts.Add(_Item, 0);


            }


            WordCounts[_Item] += 1;


        }


 


        public WordEntry [] CountWords(string _Text) {


            List<WordEntry> Words = new List<WordEntry>();


            Array.ForEach<string>(_Text.ToUpper().Split(‘ ‘), new Action<string>(AddToHash));


            foreach (string current in WordCounts.Keys) {


                WordEntry w;


                w.Word = current;


                w.Count = WordCounts[current];


                Words.Add(w);


            }


            return Words.ToArray();


        }


    }


    class Program {


       


        static void Main(string[] args) {


            string test = “The quick brown fox jumps over the lazy dog”;


            IWordCounter Counter = new WordCounter();


            WordEntry [] Counts = Counter.CountWords(test);


            Array.Sort(Counts, new WordComparer());


            Array.ForEach<WordEntry>(Counts, new Action<WordEntry>(PrintEntry));


        }


 


        static void PrintEntry(WordEntry _Entry) {


            Console.WriteLine(_Entry.Word + ” “ + _Entry.Count);


        }


    }


}


 


Listing 2: Naïve example using required interface


 


Submissions are to be EMAILED to me at Gregory<insertmylastname>1@gmail.com before midnight (EST) on Aug 10th, I will have results ready by the 13th. Do not post your submission in comments or I will be forced to delete it. I will post when I receive entries so you can be sure you entry has actually reached me. Also it is not your best interest to do this as someone could come through and optimize one line of your code to beat you 😛


I will the day that the contest ends create a post including performance results for all code. I will also post a zip file of all submissions (including my performance harness & unit tests so others can view it).


Scoring


3 sets of data will be run through the submissions each will be run many times with a new object constructed for each iteration. There will be one very small set of data, one medium set of data and one large set of data. Each of these sections is worth the same number of points.


1.       10 points


2.       8 points


3.       7 points


4.       6 points


5.       5 points


The submission with the most combined points from the three sections will win. If there is a point based tie, the entry that did better on section three (the large set) will win.


Why three datasets? Optimizations for the largest data set often come at the expense of the smallest, what is sought after is the best all around solution.


Additional Rules


1)      Only one submission per person to be evaluated, you can submit multiples for discussion if you come across an interesting tidbit that you think is worth sharing


2)      Items will be run on the MS 2.0.50727 CLR (there can be no tricks such as adding IL instructions to mono)


3)      You may not call out to unmanaged code!


4)      You CAN use unsafe code


5)      No you can’t know what the data is beforehand.


6)      You must pass unit tests in order to be included (what good is fast if it doesn’t work?)


If you have any questions, need clarification, or have thoughts on the overall idea of these competitions please post them here.


                                                                            

This entry was posted in Contests. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

12 Responses to Performance Challenge: #1 Word Count

  1. Greg says:

    Sorry for delay over the weekend .. results will be up today.

  2. Greg says:

    submissions are closed.

  3. Greg says:

    More submissions.

    Garmon Edmon http://weblogs.asp.net/egarmon/ will be known as Edmon
    Alois Kraus http://geekswithblogs.net/akraus1/ will be known as Kraus
    Steve Bushman will be known as Bushman
    Kevin Idzi http://kevin.idzi.com from CostCo will be known as Idzi
    Blaine Wildon will be known as Wilson
    Brandon Grossutti will be known as Grossutti
    My sumission will be known as Young

    Still 7 hours for submissions.

  4. bushmango says:

    My submission should be in your inbox

  5. bushmango says:

    Hmm these new rules almost completely killed my tokenizer….

  6. johnwood says:

    Man I wish I had time to implement this! Next time be nice to me and choose a problem that doesn’t involve optimizing a tokenizer too :)

  7. Greg says:

    First entry has been received from Eric Bowen (http://scrappydog.com/blogs/blog/default.aspx) of dingineer (http://digineer.com)

    This submission will now be known as Bowen.

  8. Greg says:

    OK. I apologize for my quick response a few hours ago which I am forced to change do to some things I didn’t think of (namely abbreviations). Some people have brought up some really fun edge conditions with punctuation and my term of “handle properly”. If you did it that way .. it should be 1-2 lines of code to change (and again I apologize).

    I am updating the original post to include a bit more specific information. Properly will mean that it does not get included with the word.

    here are some test strings to show …

    “Hello! Greg” results in Hello 1, Greg 1
    “Hello!Greg” it results in HelloGreg 1
    “A.D.D.” results in ADD 1
    “Wow, how great!” results in Wow 1, how 1, great 1
    “wow     \n\n\n    great” results in wow 1, great 1
    “it was a man-eating shark.” results in it 1, was 1, a 1, man-eating 1, shark 1

    numbers can be a bit odd .. although the data has no numbers in it I will describe the behavior (I will also show how to avoid this in a parser if you wanted to). If you wanted to you could support a special case for numbers but I have NO test data with numbers in it .. this is how they will turn out if you follow the general rule above of ignoring punctuation.

    1,000,000 results in 1000000 1
    $1,000,000 results in $1000000 1
    $ 1,000,000 results in $ 1, 1000000 1
    “1.2 1.20 120″ results in 12 1, 120 2

    If you think I am missing a case here let me know otherwise the general rule is to simply remove punctuation

    The data is english data only. Data is passed as a .NET string which means it will be in the form of utf 16 but it will only be english characters.

  9. bushmango says:

    Also, is this for English-only, ascii only, or do we have to support all of unicode?

  10. bushmango says:

    How about you give us one of those fabled unit tests? You know, so we can learn some Test Driven Development =) That would allow us to make sure we understand and interpret the rules the same.

  11. Greg says:

    *DELETED*

  12. legobuff says:

    Question on note 2… does “handle normal punctuation properly” also mean that I need to handle money also? or will this be strictly words?

    And thank you for the mental exercise.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>