Sponsored By Aspose - File Format APIs for .NET

Aspose are the market leader of .NET APIs for file business formats – natively work with DOCX, XLSX, PPT, PDF, MSG, MPP, images formats and many more!

An in depth look at for loops

My last two posts have been about Performance Measurement and Viewing Unmanaged Code in VS.NET, these posts while rather interesting have really been to set us up for the this discussion. If you have not read these posts you may want to as I will be using information from them often in this post.


There are a lot of posts on the Internet discussing varying methods of looping using for loops and which perform best. These posts also generally give advice as to how you should handle your looping based upon performance metrics. Most of these types of posts suffer from an affliction I have discussed previously; they look at IL (Intermediate Language) to rationalize performance which is just flat out wrong as subtle differences can cause JIT (Just in Time) optimizations to go haywire.


In this post we will look at the unmanaged code produced by varying constructs to come up with a definitive answer to the performance question. We will also look at some other optimization methods that people claim happen but do not always happen and discuss some reasons why they may not be happening and of course set some rules for fast looping.


In trying to apply this the version of your framework is important as the JIT may vary from version to version, I do not believe anything here will change based upon the platform but that is a possibility. I used version 2.0.50727 for all examples in this document. Other versions such as Rotor, Mono, or 1.x  will most likely show differing behaviors.


Hoisting?


Before I start getting into code I would like to discuss the concept of hoisting as it will be the focal point of this entire discussion. When dealing with loops we can break any given loop into three sections


1)      The preamble (setting up for the loop, doing things such as setting our counter to 0)


2)      The code within the loop


3)      The code to increment our counter and check for the end of iteration possibly jumping back to step two


For the rest of this post I will color code the disassembled assembly as section one (green), section two (blue), section 3 (yellow) to make things easier to follow


Any code that is in step two or three will be run N times as the loop is executed. The code in step one will only be executed a single time when we first enter the loop. The concept of hoisting involves moving code out of steps two and three and into step one. Hoisting obviously gives you a huge performance gain as the code will only be run once. 


Simple Hoisting Example


If you read alot articles on for loop performance they will tell you to not manually hoist because the JIT will end up doing it for you. In order to test this I have created the following test code (which can be used in conjunction with the harness discussed in the previous Performance Measurement post)








    class Program {


        static int[] foo = new int[100000];


        static int Dummy;


        static void Test1NoHoist() {


            int total = 0;


            for (int i = 0; i < foo.Length; i++) {


                total|=i;


            }


            Dummy = total;


        }


 


        static void Test1WithHoist() {


            int total = 0;


            int length = foo.Length;


            for (int i = 0; i < length; i++) {


                total|=i;


            }


            Dummy = total;


        }


 


        static void Main(string[] args) {


            TestHarness.Test(“Test1NoHoist”, 10000, new TestHandler(Test1NoHoist));


            TestHarness.Test(“Test1WithHoist”, 10000, new TestHandler(Test1WithHoist));


            Console.ReadLine();


        }


    }


Listing 1: Basic manual hoisting example


 


You will quickly notice that in Test1WithHoist we manually took the check Length property on foo and moved it to before the loop; this is a common optimization if you come from a C/C++ world. The thought behind this is that since Length translates to get_Length() that we are saving ourselves from having to call the method n times (remember that this would be in the second or third section and everything in the third section is called n times).


An astute reader will notice an oddity dealing with the “Dummy” variable that is assigned after the loop. We will come back this a bit later on… good catch 😉


The by running our performance test, we can see that the code executes in roughly equivalent time. 
















Test


Total Time (ns)


Average (ns)


Test1NoHoist


820583612.35


82058.36


Test1WithHoist


812735363.70


81273.53


 


Let’s take a look at the generated code to offer proof that the two bits of code should run in equivalent time (if you want to see the results for yourself remember to follow the instructions for getting JIT optimized code from Viewing Unmanaged Code in VS.NET.













Test1NoHoist


Test1WithHoist


00000000  xor         ecx,ecx


00000002  xor         edx,edx


00000004  mov         eax,dword ptr ds:[022B1EC4h]


00000009  mov         eax,dword ptr [eax+4]


0000000c  test        eax,eax


0000000e  jle         00000019


00000010  or          ecx,edx


00000012  add         edx,1


00000015  cmp         eax,edx


00000017  jg          00000010


00000019  mov         dword ptr ds:[00912FE8h],ecx


0000001f  ret             


 


 


00000000  xor         ecx,ecx


00000002  mov         eax,dword ptr ds:[022B1EC4h]


00000007  mov         edx,dword ptr [eax+4]


0000000a  xor         eax,eax


0000000c  test        edx,edx


0000000e  jle         00000019


00000010  or          ecx,eax


00000012  add         eax,1


00000015  cmp         eax,edx


00000017  jl          00000010


00000019  mov         dword ptr ds:[00912FE8h],ecx


0000001f  ret             


Listing 2: Disassembled versions of our functions


 


 


A kind of neat item in these two bits of code is that the only real difference is that EAX and EDX are interchanged … This makes them look a bit more different than they are but rest assured they are pretty much identical. In both cases, the access to the stop variable has been hoisted out of the loop (in section three, both are comparing their counter to a register that was preloaded in section one). To better illustrate the point, here is a disassembly of the first for loop (without manual hoisting) when run without JIT optimizations (i.e. it will not be hoisted).








0000002a  xor         esi,esi


0000002c  nop             


0000002d  jmp         00000034


0000002f  nop             


00000030  or          edi,esi


00000032  nop             


00000033  inc         esi 


00000034  mov         eax,dword ptr ds:[02275A34h] ;inlined length


00000039  cmp         esi,dword ptr [eax+4]


0000003c  setl        al  


0000003f  movzx       eax,al


00000042  mov         ebx,eax


00000044  test        ebx,ebx


00000046  jne         0000002F


 


Listing 3: First loop (no hoisting with optimizations disabled) non-important code removed


 


If this does not convince you to NEVER release code that is not being optimized I don’t know what will. It should also give you an idea of how much work the JIT optimizer really does. What has happened here is that the property was inlined, but it was inlined into section three. The optimizer was smart enough to realize this and to move it up to the preamble.  +1 for the optimizer


Non-Trivial Hoisting


Now that we have gone through a simple example, let’s try a more difficult one for the optimizer. In this example we will use a method GetUpperBound(0) which will not be inlined for us, let’s take a look at how well the JIT handles it. Here is the testing code to add to our previous code.








        static void Test1NoHoistNoInlining() {


            int total = 0;


            for (int i = 0; i < foo.GetUpperBound(0); i++) {


                total |= i;


            }


            Dummy = total;


        }


 


        static void Test1WithHoistNoInlining() {


            int total = 0;


            int length = foo.GetUpperBound(0);


            for (int i = 0; i < length; i++) {


                total |= i;


            }


            Dummy = total;


        }


Listing 4: Tests without inlining available


 


Next we should add the following to our Main to run the tests.








            TestHarness.Test(“Test1NoHoist”, 10000, new TestHandler(Test1NoHoistNoInlining));


            TestHarness.Test(“Test1WithHoist”, 10000, new TestHandler(Test1WithHoistNoInlining));


 


Listing 5: Calling our new methods


 


When we run the tests in release mode, we get quite different results than previously. I am including the previous results in the table for comparison.
























Test


Total Time (ns)


Average (ns)


Test1NoHoist


820583612.35


82058.36


Test1WithHoist


812735363.70


81273.53


Test1NoHoistNoInlining


25457518879.48


2545751.88


Test1WithHoistNnInlining


824641533.67


82464.15


 


Interesting, not manually hoisting makes our function 32x slower. Before we even get into code on this one I will put my money on the optimizer not hoisting the call. I believe however there are some reasons for this that we will discuss after looking through the code.













Test1NoHoistNoInlining


Test1WithHoistNoInlining


00000004  xor         esi,esi


00000006  mov         ecx,dword ptr ds:[022B1EC4h]


0000000c  xor         edx,edx


0000000e  cmp         dword ptr [ecx],ecx


00000010  call        792661F8


00000015  test        eax,eax


00000017  jle         00000031


00000019  or          edi,esi


0000001b  add         esi,1


0000001e  mov         ecx,dword ptr ds:[022B1EC4h]


00000024  xor         edx,edx


00000026  cmp         dword ptr [ecx],ecx


00000028  call        792661F8


0000002d  cmp         eax,esi


0000002f  jg          00000019


 


 


00000003  mov         ecx,dword ptr ds:[022B1EC4h]


00000009  xor         edx,edx


0000000b  cmp         dword ptr [ecx],ecx


0000000d  call        792661A8


00000012  mov         edx,eax


00000014  xor         eax,eax


00000016  test        edx,edx


00000018  jle         00000023


0000001a  or          esi,eax


0000001c  add         eax,1


0000001f  cmp         eax,edx


00000021  jl          0000001A             


Listing 6: Disassembled versions



 


As we suspected the code on the left hand side is not automatically hoisting the call of the function for us.  It would seem that the JIT will not automatically hoist method calls for us, that instead we have to explicitly state that we want them hoisted on our own.


It seems that that the JIT cannot get to the point of having a value in a register or memory that it can consider its result and as such feels the need to make the call on every iteration. This would make sense in general as I may be depending upon the behavior of being called at every interval. Consider the following code.








        static int t = 0;


        static bool KeepGoing() {


            Console.WriteLine(“Still Going!”);


            t++;


            return t < 10;


        }


 


 


        static void TestShortMethod() {


            for (; KeepGoing();) {


                System.Threading.Thread.Sleep(100);


            }


        }


Listing 7: Odd but valid code


 


Obviously this is nasty code but it is still valid. This is a simplified example it appears that it was a conscious decision to not support these types of situations as they can quickly become extremely complex from the JIT point of view, my guess is that it won’t deal with anything beyond simply returning a variable which will be inlined into a simple read anyway.  In my opinion, something like this should still be inlined (and left in section 3) to avoid the overhead of setting up the call on every iteration but I may be missing some other case that makes this more difficult. Based upon these results we can create the following rule.


If you wish to use a method call for your stop condition that does anything more complex than simply returning a variable or is not inlinable for other reasons such as being virtual; you should hoist it manually by placing it in the first part of your for loop in order to make it explicit to the JIT that you do not wish the behavior to be called on every iteration.


 Mark Lubischer brought up an excellent point here. We can in fact also hoist the call by using for like this.










        static int MarkWithoutInlining() {


            int total = 0;


            int[] length = new int[10000];


 


            for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {


                total |= i;


            }


            return total;


        }


 


This is a much better way of handling our hoisting as it better defines the scope of our variable while the behavior is in fact equivalent to our original hoist.







        static int MarkWithoutInlining() {


            int total = 0;


00000000  push        edi 


00000001  push        esi 


00000002  push        ebx 


00000003  xor         ebx,ebx


            int[] length = new int[10000];


00000005  mov         edx,2710h


0000000a  mov         ecx,7915982Ah


0000000f  call        FFB21D98


00000014  mov         edi,eax


 


            for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {


00000016  xor         esi,esi


00000018  mov         ecx,edi


0000001a  xor         edx,edx


0000001c  cmp         dword ptr [ecx],ecx


0000001e  call        792664C8


00000023  test        eax,eax


00000025  jle         00000039


00000027  mov         edx,dword ptr [edi+4]


                total |= length[i];


0000002a  cmp         esi,edx


0000002c  jae         0000003F


0000002e  or          ebx,dword ptr [edi+esi*4+8]


            for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {


00000032  add         esi,1


00000035  cmp         esi,eax


00000037  jl          0000002A


            }


            return total;


00000039  mov         eax,ebx


0000003b  pop         ebx 


0000003c  pop         esi 


0000003d  pop         edi 


0000003e  ret             


0000003f  call        792B42E9


00000044  int         3   


 


A Bit About Foreach


Foreach is not an IL concept, it is a compiler concept. Compilers generate normal for loops when they see a foreach being used to iterate an array. If you are interested in seeing how foreach loops work I would highly recommend taking a look with ILDASM or Reflector. I will not discuss heavily foreach loops as they have at this moment no difference when they reach the native level.


Foreach could in the future offer many benefits over a general for loop. Since the foreach loop is explicitly stating that you want to iterate through the array (not allowing things like addition and subtraction to your counter), other things such as hoisting array bounds checks (which we will discuss shortly) would be much more easily accomplished. I would imagine that in the future foreach will be the preferred iteration construct.


I will assure you though that as of now foreach is not faster than the equivalent for loop when dealing with arrays, in fact both VB.NET and C# insure that they are identical. There is one case where foreach will be slower than a for loop and that is if you do not actually use the item within your loop, the for version will obviously be faster since it never loads the variable where as the foreach does this implicitly on every iteration. Foreach can however offer you great performance increases when dealing with other types that implement IEnumerable (not including collections as they simply perform an indexed lookup in their enumerator).


Logically one can come up with a great example for this when looking at a linked list. The for loop will cause ∑ n total operations on the iteration where as the foreach will only cause n. The reason for this is every index operation would have to start at the beginning of the list and iterate n nodes in order to return the nth node where as the enumerator will simply remember the last node it was on and give the next node. For this reason we will add our second rule.


When dealing with items that are enumerable as opposed to dealing with arrays or collections; prefer foreach to a for loop as the enumerator will often offer a much faster way of enumerating than using an index.


Array Bounds Check Hoisting


The check to find out whether or not we are valid to continue is not the only thing that can be hoisted from inside of a loop. In fact every time the variable is used in a comparison within the loop is an opportunity for hoisting to occur. The first example of this type of hoist we will look at is array bounds checking.


Array bounds are checked automatically by the CLR in order to prevent things like buffer overflows from occurring. Every time that you access an array in safe code you will in fact have a comparison occur to insure that you are within the range of the array. If you are not within a valid range an IndexOutOfRangeException will occur as opposed to writing happily beyond the end of your array as many other language such as C would do.


The problem with array bounds checks is that they are extremely redundant when dealing with loops. Consider the following code that shows what is happening.








        static void SampleArrayBoundsCheck() {


            int total = 0;


            for (int i = 0; i < foo.Length; i++) {


                if (i < foo.Length) {


                    total |= foo[ i ];


                }


            }


        }


Listing 9: Sample Array Bounds Code


 


Naturally you are not issuing these checks in your code, but this example helps to make what’s really going on a bit clearer. When looking at this code anyone who has read the first few chapters of a C# book would scratch their head and wonder why all of the redundancy has been placed into the code. It is obvious that if our counter has a constraint to stay below foo.Length that it will in fact always succeed the conditional of being below foo.Length. To test how intelligent the JIT is with handling this situation we can use the following code. Note that for this test we will simply be looking at the native code generated as opposed to measuring performance.








        static int [] SampleArrayBoundsCheck() {


            int [] Destination = new int[foo.Length];


            for (int i = 0; i < foo.Length; i++) {


                Destination[ i ]= foo[ i ];


            }


            return Destination;


        }


Listing 10: Test to see if array bound checks hoist


 


This code is simply copying the static array that we have been using previously to another array. This code should be particularly good to test as it in fact has two bounds checks occurring within it (one for each array).








        static int [] SampleArrayBoundsCheck() {


            int [] Destination = new int[foo.Length];


00000000  push        edi 


00000001  push        esi 


00000002  push        ebx 


00000003  push        ebp 


00000004  push        eax 


00000005  mov         edi,dword ptr ds:[022B1EC4h]


0000000b  mov         ebx,dword ptr [edi+4]


0000000e  mov         edx,ebx


00000010  mov         ecx,7915982Ah


00000015  call        FFB21FC0


0000001a  mov         esi,eax


            for (int i = 0; i < foo.Length; i++) {


0000001c  xor         edx,edx


0000001e  test        ebx,ebx


00000020  jle         00000045


00000022  mov         ebp,dword ptr [edi+4]


00000025  mov         eax,dword ptr [esi+4]


00000028  mov         dword ptr [esp],eax


                Destination[ i ]= foo[ i ];


0000002b  cmp         edx,ebp


0000002d  jae         0000004D


0000002f  mov         ecx,dword ptr [edi+edx*4+8]


00000033  mov         eax,dword ptr [esp]


00000036  cmp         edx,eax


00000038  jae         0000004D


0000003a  mov         dword ptr [esi+edx*4+8],ecx


            for (int i = 0; i < foo.Length; i++) {


0000003e  add         edx,1


00000041  cmp         ebx,edx


00000043  jg          0000002B


            }


            return Destination;


00000045  mov         eax,esi


00000047  pop         ecx 


00000048  pop         ebp 


00000049  pop         ebx 


0000004a  pop         esi 


0000004b  pop         edi 


0000004c  ret              


0000004d  call        792B4511


00000052  int         3   


Listing 11: Disassembly of simple array bound checks


 


Unfortunately even this most trivial of examples does not hoist the array bounds checks. The jumps can clearly be seen on lines 2D and 38.  As to why this does not work I am not sure but perhaps it is because code like this could exist which could cause a buffer overflow? Perhaps the amount of time detect this situation has been deemed too much.








        static int [] SampleArrayBoundsCheck() {


            int [] Destination = new int[foo.Length];


            for (int i = 0; i < 100000; i++) {


                i += 100000000;


                Destination[ i ]= foo[ i ];


            }


            return Destination;


        }


Listing 12: Buffer overfow example


 


Checking before applying the optimization would require a good amount of overhead and if it did not check to make sure anyone was messing with our counter in the interior of the loop then we could run into code like this which would cause a buffer overflow (who knows what we just wrote over, maybe we would get an exception or maybe we just caused a funny character in a string some place, boy do I miss the days of C/Pascal in embedded systems J).  Let’s try something even simpler, Brad Abrams says it works so it must?










        static int SampleArrayConstantBoundsCheck() {


            int foo2 = 0;


            for (int i = 0; i < 1000; i++) {


              foo2 = foo[ i ];


            }


            return foo2;


        }


        static int SampleArrayConstantBoundsCheck() {


            int foo2 = 0;


00000000  push        esi 


            for (int i = 0; i < 1000; i++) {


00000001  xor         edx,edx


00000003  mov         ecx,dword ptr ds:[022B1EC4h]


00000009  mov         esi,dword ptr [ecx+4]


              foo2 = foo[ i ];


0000000c  cmp         edx,esi


0000000e  jae         00000021


00000010  mov         eax,dword ptr [ecx+edx*4+8]


            for (int i = 0; i < 1000; i++) {


00000014  add         edx,1


00000017  cmp         edx,3E8h


0000001d  jl          0000000C


0000001f  pop         esi 


            }


            return foo2;


00000020  ret             


00000021  call        792B44A9


00000026  int         3   


Listing 13: Simplest possible example


 


Again, no love from the JIT optimizer, we cannot possibly make this example any simpler but it still has the bounds checks firmly placed within our loop. The JIT optimizer does not hoist array bounds checks for you. My guess is that it will not do this do to threading reasons, I would assume the process to actually be thread safe (as it is only copying something of reference size which is assured to be atomic but I am probably missing something. There is some similar functionality which happens that I think may cause the confusion that array bounds hoist are actually occurring. Let’s take a look at this other optimization, but first we can make a new rule.


If you are dealing with array accesses in a highly performant area and the bounds checks being in the loop are too much. You will have to handle your iteration in unsafe code to remove the bounds checks.


Local Array Bounds Check Removal


The optimization that the JIT does support is removing bounds checks for locally created arrays. Since the array has been created locally and is only known by a local reference, the JIT can be sure that the array cannot possibly change. In these cases the JIT will completely remove bounds checks. Let’s take a look at an example.










        static int TestLocalArrayBoundsCheckRemoval() {


            int [] Test = new int[10000];


            int total = 0;


            for(int i=0;i<Test.Length;i++) {


                total |= Test[ i ];


            }


            return total;


        }


        static int TestLocalArrayBoundsCheckRemoval() {


            int[] Test = new int[10000];


00000000  push        esi 


00000001  mov         edx,2710h


00000006  mov         ecx,7915982Ah


0000000b  call        FFB21FF0


00000010  mov         ecx,eax


            int total = 0;


00000012  xor         esi,esi


            for(int i=0;i<Test.Length;i++) {


00000014  xor         edx,edx


00000016  mov         eax,dword ptr [ecx+4]


00000019  test        eax,eax


0000001b  jle         00000028


                total |= Test[ i ];


0000001d  or          esi,dword ptr [ecx+edx*4+8]


            for(int i=0;i<Test.Length;i++) {


00000021  add         edx,1


00000024  cmp         eax,edx


00000026  jg          0000001D


            }


            return total;


00000028  mov         eax,esi


0000002a  pop         esi 


0000002b  ret             


Listing 14: Removal of local array bounds


 


What is really neat here is that we have in fact created unmanaged code which does not even check for out of bounds access. This code will never throw an exception (it doesn’t even have code to throw an exception). This optimization is even better than hoisting as it has no overhead, the hoisting would make it a single comparison; this had no comparison. My guess is that people were getting confused between this and hoisting. If you do not use a length property or a constant in your loop this optimization will not occur, as an example the following code will not remove the bounds checks










        static int SampleArrayConstantBoundsCheckRemoval() {


            int foo2 = 0;


            int size = foo.GetUpperBound(0) ;


            for (int i = 0; i < size; i++) {


                foo2 = foo[ i ];


            }


            return foo2;


        }


 


        static int TestLocalArrayBoundsCheckRemoval() {


            int[] Test = new int[10000];


00000000  push        edi 


00000001  push        esi 


00000002  mov         edx,2710h


00000007  mov         ecx,7915982Ah


0000000c  call        FFB21FF0


00000011  mov         esi,eax


            int total = 0;


00000013  xor         edi,edi


            int size = Test.GetUpperBound(0);


00000015  mov         ecx,esi


00000017  xor         edx,edx


00000019  cmp         dword ptr [ecx],ecx


0000001b  call        79266720


00000020  mov         edx,eax


            for(int i=0;i<size;i++) {


00000022  xor         eax,eax


00000024  test        edx,edx


00000026  jle         0000003A


00000028  mov         ecx,dword ptr [esi+4]


                total |= Test[ i ];


0000002b  cmp         eax,ecx


0000002d  jae         0000003F


0000002f  or          edi,dword ptr [esi+eax*4+8]


            for(int i=0;i<size;i++) {


00000033  add         eax,1


00000036  cmp         eax,edx


00000038  jl          0000002B


            }


            return total;


0000003a  mov         eax,edi


0000003c  pop         esi 


0000003d  pop         edi 


0000003e  ret             


0000003f  call        792B4541


00000044  int         3   


Listing 15: Manual hoisting causes Array Bounds Check Removal to fail


 


As we can see the manual hoisting caused the optimization to fail. So manual hoisting is not always a good thing as it can break some JIT patterns such as this. This optimization will also work for writes.










        static int [] TestLocalArrayBoundsCheckWrite() {


            int[] Test = new int[10000];


            for(int i=0;i<Test.Length;i++) {


                Test[ i ] = i;


            }


            return Test;


        }


        static int [] TestLocalArrayBoundsCheckWrite() {


            int[] Test = new int[10000];


00000000  mov         edx,2710h


00000005  mov         ecx,7915982Ah


0000000a  call        FFB21FF0


0000000f  mov         ecx,eax


            for(int i=0;i<Test.Length;i++) {


00000011  xor         edx,edx


00000013  mov         eax,dword ptr [ecx+4]


00000016  test        eax,eax


00000018  jle         00000025


                Test[ i ] = i;


0000001a  mov         dword ptr [ecx+edx*4+8],edx


            for(int i=0;i<Test.Length;i++) {


0000001e  add         edx,1


00000021  cmp         eax,edx


00000023  jg          0000001A


            }


            return Test;


00000025  mov         eax,ecx


00000027  ret             


Listing 16: Bounds Check Removal for writes


 


As we can see, it again removed all of the bounds checking. Of course as we have seen earlier, this will only work for an array that is created in the same method. If we move the creation of the array out of the method this optimization will no longer occur. This leads us to our third rule.


Try to keep creation and initialization of arrays within the same method as the JIT can often times remove the bounds checks during your initialization.


Another oddity


Were you one of the people I said “good catch” to earlier? Remember those odd “Dummy” variables were using in the first set of examples


The JIT is extremely smart in some optimizations. For example it realizes if you are doing calculations and never use the result, it will remove the code for you. It does however have a slight problem when dealing with loops. Let’s try removing those Dummy calls from our previous code and see what happens. Let’s take a look at only one of them since we have already shown that they produce identical results.










        static void Test1NoHoist() {


            int total = 0;


            for (int i = 0; i < foo.Length; i++) {


                total|=i;


            }


        }


              static void Test1NoHoist() {


            int total = 0;


00000000  xor         edx,edx


            for (int i = 0; i < foo.Length; i++) {


00000002  mov         eax,dword ptr ds:[022B1EC4h]


00000007  mov         eax,dword ptr [eax+4]


0000000a  test        eax,eax


0000000c  jle         00000015


0000000e  add         edx,1


00000011  cmp         eax,edx


00000013  jg          0000000E


00000015  ret             


Figure 17: Our method without the dummy value being set


 


I have color coded the example as with the rest of the example.  You will notice that the blue section is missing. The JIT optimizer has correctly realized that we were in fact never using our total variable after we performed all of the calculation on it. It has however failed to realize that by taking out our code, it has in fact left a loop which does nothing. Nothing super interesting here, this is just something I came across while writing this.


Summary


We have looked through quite a few JIT optimizations in this post and have made a few rules which that can help us in situations where measuring becomes very difficult as we need to know what to measure against. Let’s go back through our rules, keep in mind that these rules are not version agnostic so using 1.x, mono, or rotor you will likely get different results!


1)      If you wish to use a method call for your stop condition that does anything more complex than simply returning a variable or is not inlinable for other reasons such as it being virtual, you should hoist it manually in order by placing it in the first part of your for loop to make it explicit to the JIT that you do not wish the method to be called on every iteration.


2)      When dealing with items that are enumerable as opposed to dealing with arrays or collections; prefer foreach to a for loop as the enumerator will often offer a much faster way of enumerating than using an index.


3)      If you are dealing with array access in a highly performant area and the bounds checks being in the loop are too much overhead. You will have to handle your iteration in unsafe code to remove the bounds checks (in other words, it is impossible to write performant looping code in VB.NET)


4)      Try to keep creation and initialization of arrays within the same method as the JIT can often times remove the bounds checks during your initialization.


Hopefully by following these rules you can save yourself some time the next time you are optimizing code as many of these rules deal with things that are difficult to measure, but remember measurement is key.


I hope you enjoyed reading about these optimizations as much as I enjoyed writing about them.

This entry was posted in Under The Covers. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

7 Responses to An in depth look at for loops

  1. ole johnny says:

    Any worthwhile updates on the latest .Net 3.5 JIT optimization?

    Given the following code:
    int[] data= new int[100000000];
    for (int i = 0; i < 100000000; i++)
    {
    data[i]= i;
    }
    I get less than 1 GB per second transfere rate on my laptop which should be good for 8GB a second (got two identical memory modules). I have tried various alternatives including unsafe code as well as adding multiple arrays within the loop. I am running optimized release mode from outside visual studio.

    Is this as it should be, why, how and when to utilize max memory transfere?

    If this simple code cant get closer to the max I dont know what will…

    Appreciate your articles, absolutely great reading!
    OJ

  2. Greg says:

    I am curious on your source derek … this is not the case.

  3. derekdb says:

    My understanding was that the Jit optimizer only hoisted bounds checks when the code directly tested against array.length. I’ve actually seen code of the form
    for (int i=0; i < foo.length; i++) { if (i < 1000) break; … with the claim that this hoisted the bounds check out of the loop. I can’t verify this right now, so I leave that as an exersize for the reader..

  4. Greg says:

    I have some further information on the removal case that I will be posting .. I am waiting to hear back on a few things before I post. Basically if you assign a static/instance to a local variable removal will occur (but often times your explictness is also optimized out). I am trying to figure out if this is a bug or not and whether or not this should be considerred a best practice in 2.0 as the same optimization may in fact incur a penalty dealing with other JITs.

  5. Mark Lubischer says:

    Thanks for the follow up!

  6. Greg says:

    That’s a good question, you are correct that I forgot to mention this, I will add a section to add this.

    This will have 2 distinct behaviors (one when inlined one when not)

    With inlining it is smart enough to realize what you are doing (and to undo it)

               for (int i = 0, j = length.Length; i < j; i++) {
    00000014  xor         edx,edx
    00000016  mov         eax,dword ptr [ecx+4]
    00000019  test        eax,eax
    0000001b  jle         00000026
                   total |= i;
    0000001d  or          esi,edx
               for (int i = 0, j = length.Length; i < j; i++) {
    0000001f  add         edx,1
    00000022  cmp         edx,eax
    00000024  jl          0000001D
               }

    This is identical to the code produced but i<length.Length

    Without inlining it will put it into the preamble of the loop producing functional equivalent results to the other hoisted examples (although slightly different orderring and obviously the variable maintains a better scope) 

              for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {
    00000013  xor         esi,esi
    00000015  mov         ecx,eax
    00000017  xor         edx,edx
    00000019  cmp         dword ptr [ecx],ecx
    0000001b  call        792666A8
    00000020  test        eax,eax
    00000022  jle         0000002D
                   total |= i;
    00000024  or          edi,esi
               for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {
    00000026  add         esi,1
    00000029  cmp         esi,eax
    0000002b  jl          00000024
               }

    The key thing to notice is that the JIT still does not realize what we are doing with array bounds hoists …

                for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {
    00000016  xor         esi,esi
    00000018  mov         ecx,edi
    0000001a  xor         edx,edx
    0000001c  cmp         dword ptr [ecx],ecx
    0000001e  call        792664C8
    00000023  test        eax,eax
    00000025  jle         00000039
    00000027  mov         edx,dword ptr [edi+4]
                    total |= length[i];
    0000002a  cmp         esi,edx
    0000002c  jae         0000003F
    0000002e  or          ebx,dword ptr [edi+esi*4+8]
                for (int i = 0, j = length.GetUpperBound(0); i < j; i++) {
    00000032  add         esi,1
    00000035  cmp         esi,eax
    00000037  jl          0000002A
                }

    Good catch!

    Cheers,

    Greg

  7. Mark Lubischer says:

    Well written article, quite interesting to see the optimization done by the compiler.

    As I was reading through it, I was wondering about how a locally scoped hoist would be handled by the JIT optimizer:

    int total = 0;
    int[] length = new int[10000];

    for (int i = 0, j = length.Length; i < j; i++)
    {
    total |= i;
    }

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>