Bug ?: A bit more on for loops

OK I had left this out of my previous post an in depth look at for loops as I considered the behavior to be a JIT bug. I discussed this with some very smart people (including quite a few at Microsoft) and they agreed but my problem report was marked with “by design” so now I should explain what happens and the hoops you have to jump through in order to work around the issue.


We have discussed before the concept of bounds check removal. Since we know that the array in question has a length l, and we are storing the reference to that array in a register we know that the length cannot possibly change. We know that the length cannot possibly change because the size of an array object is immutable, we can change the data within the array but not the actual size.  There is some discussion to be had on this point and future implementations but I will leave that for another discussion.


Let’s look at some C# code that illustrates this point. I am going to discuss VB.NET and other languages later as it applies to them as well (even more so than C#).








class Program {


    static int[] foo = new int[10000];


    static int WillNotRemoveChecksStaticReference() {


        int total = 0;


        for (int i = 0; i < foo.Length; i++) {


            total |= foo[ i ];


        }


        return total;


    }


    static int WillRemoveChecksLocal() {


        int total = 0;


        int[] bar = new int[10000];


        for (int i = 0; i < bar.Length; i++) {


            total |= bar[ i ];


        }


        return total;


    }


    static int WillRemoveChecksPassedIn(int[] bar) {


        int total = 0;


        for (int i = 0; i < bar.Length; i++) {


            total |= bar[ i ];


        }


        return total;


    }


    static void Main(string[] args) {


        WillNotRemoveChecksStaticReference();


        WillRemoveChecksForStaticReference();


        WillRemoveChecksLocal();


        WillRemoveChecksPassedIn(foo);


    }


}


Listing 1: C# Code to explore


 


The problem is easily seen when we look through the WillNotRemoveChecksStaticReference method. Here is the disassembly of optimized code for that method. I have marked the code using the same standard as in the last post (green is preamble, blue is the code in the loop, yellow is the code to increment the counter and see if the loop should be ended).








    static int WillNotRemoveChecksStaticReference() {


        int total = 0;


00000000  push        edi 


00000001  push        esi 


00000002  xor         ecx,ecx


        for (int i = 0; i < foo.Length; i++) {


00000004  xor         edx,edx


00000006  mov         esi,dword ptr ds:[02275A44h]


0000000c  mov         eax,dword ptr [esi+4]


0000000f  test        eax,eax


00000011  jle         00000025


00000013  mov         edi,dword ptr [esi+4]


            total |= foo[ i ];


00000016  cmp         edx,edi


00000018  jae         0000002A


0000001a  or          ecx,dword ptr [esi+edx*4+8]


        for (int i = 0; i < foo.Length; i++) {


0000001e  add         edx,1


00000021  cmp         eax,edx


00000023  jg          00000016


        }


        return total;


00000025  mov         eax,ecx


00000027  pop         esi 


00000028  pop         edi 


00000029  ret             


0000002a  call        79443579


0000002f  int         3   


Listing 2: Disassembly of WillNotRemoveChecksStaticReference


 


We can quickly see here that the bounds check is in fact not being removed as it would be if the array were either a local or passed in. This means that there is a compare and a branch added to every iteration of the loop. In the previous article we learned how expensive this can be.


If you go to my problem report you will see that someone named TAG has commented on it that this is impossible to have happen (or would require huge amounts of complex logic). On first glance I had suspected something similar; my guess (and the guess of nearly everyone I have talked to about this) was that it was due to volatility. Consider for a moment that there was another thread executing that changed foo to now be a smaller array, this could feasibly make our code write out into never-never land if it did not check every iteration of the loop as the size of the array could change .. right?


Then again maybe not!


The operation would in fact always be safe to remove the bounds check from so long as we are storing our pointer to the array in a register. I am fairly certain the JIT knows if it stored the reference in a register (it did apply this optimization so it is not a huge jump in logic to think that it knows that it did this).  Since we are storing our pointer to the array in a register VOLATILITY DOES NOT APPLY (sorry for bold caps but this is an important point). Any change to the original reference will not affect our loop; our register will still point to the old “safe” copy of the array that we originally read the length from.


The reason I imagine that the JIT does not do this optimization automatically is the condition that exists if we cannot put the pointer to the array into a register. In this case the operation is no longer safe as we will see the change to the array if it is made from another thread.


In order to help prove the point (and show the workaround) we can in fact force the JIT to do exactly what we want by using the following C# code.








    static int WillRemoveChecksForStaticReference() {


        int total = 0;


        int[] bar = foo;


        for (int i = 0; i < bar.Length; i++) {


            total |= bar[ i ];


        }


        return total;


    }


Listing 3: C# code that does remove bounds checking


 


Before continuing, would you delete the seemingly useless code this contains in a single threaded app (or any app)? It does however make a very subtle difference in our disassembled code.








    static int WillRemoveChecksForStaticReference() {


        int total = 0;


00000000  push        esi 


00000001  xor         ecx,ecx


        int[] bar = foo;


00000003  mov         esi,dword ptr ds:[02275A44h]


        for (int i = 0; i < bar.Length; i++) {


00000009  xor         edx,edx


0000000b  mov         eax,dword ptr [esi+4]


0000000e  test        eax,eax


00000010  jle         0000001D


            total |= bar[ i ];


00000012  or          ecx,dword ptr [esi+edx*4+8]


        for (int i = 0; i < bar.Length; i++) {


00000016  add         edx,1


00000019  cmp         eax,edx


0000001b  jg          00000012


        }


        return total;


0000001d  mov         eax,ecx


0000001f  pop         esi 


00000020  ret             


Listing 4: Disassembly of WillRemoveChecksForStaticReference


 


Although the code itself is quite different, you will notice that the disassembly is almost identical. The only differences deal with the bounds removal and the placement of a single line of assembly.


In the first bit of code we had in the loop initialization


00000006  mov         esi,dword ptr ds:[02275A44h]


 


In the second bit of code we had


00000003  mov         esi,dword ptr ds:[02275A44h]


 


Just before the loop started, the JIT could very easily have done the bounds removal optimization for us as opposed to making us write the very non-intuitive code setting the static to a local first. The only situation where the two differ in behavior is if our array is not being moved into a register (esi in this case). In fact the JIT in the first case is essentially treating our array as if it “partially volatile”. If we needs this type of behavior we would probably mark our array volatile ourselves.


So after all of this I will create another rule for optimizing loops.


Never loop on an array that is or is contained in a field or static, always copy the reference to a local first to be explicit about the way that you want to handle changes to the original array on another thread.


Hopefully this will change in a future version of the JIT (though I wonder now since my issue was marked by design). It is still a good best practice to do this as you are being explicit in that you want to loop through every item of the array you started with, and that you are not interested in any changes to that array made by another thread during that process.


Although it is a good best practice to always use a local reference in order to be explicit I still think that the JIT should handle this for you in the case that it is optimizing to a register.  I think I have also shown fairly well that this optimization is not impossible but would in fact be relatively simple (simple branch on whether or not the array reference has been optimized to a register).

This entry was posted in Under The Covers. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

7 Responses to Bug ?: A bit more on for loops

  1. Greg says:

    Jeff .. you have brought up many points, I will do my best to address them.

    1) for vs while vs etc. This is all handled by the C# compiler .. these things all look the same when viewed as IL. There is no such thing as a “for” in IL. The JIT only operates on the IL level so I don’t really see these items causing too much of an issue. These items can be fairly complex but they are complex at the compiler level (not so much at the JIT level). To see better what I am referring to, run ildasm on the examples from the original for article (or the examples listed here).

    2) Arrays .. remember that there are specific instructions for different types of arrays in IL so it knows the difference between a SZ and MD case. http://www.codeproject.com/dotnet/arrays.asp is a great article on this. Since it is being given IL instructions to deal with a SZ array it knows that the array does not have multiple dimensions and that it is 0 based etc.

    3) “So since the length is not part of an array’s type, what really is it.”, for SZ arrays which is what we are dealing with here the size of the array is kept with the array instance it is located at the object reference + 4. We can see this being accessed in the following instruction.

    0000000b  mov         eax,dword ptr [esi+4] //access array length

    The length of an array is immutable, you can only change the array to be a different array (which would have a different length). Since a reference to the array is being held in a register it does not have to deal with the volatility problem it might have to deal with if it were directly reading from memory (i.e. if another thread changes the original reference to point to another array, it will still be pointing at the array it started off with; it will not be updated to point to the new array).

    4) “Because of Garbage collection and shifting memory around might also be the reason for this additional check.”

    I am going to write a post on this but I will give you the abbreviated version here, I had a concern that was partially related (in code with bounds removed that garbage collection could possibly remove the item). This is however coverred in Jeffrey Richter’s CLR via C# (chapter 20). Registers are considerred to be GC roots (see pages around 462-464 for a detailed explanation of what happens). The garbage collector will in fact come into managed code and change register values in the case that memory shifts.

    “Naturally, moving the objects in memory invalidates all variables and CPU registers that contain pointers to the objects. So the garbage collector must revisit all of the application’s roots and modify them so that each root’s value points to the object’s new memory location.” p464

    As such in the case of a memory move the esi register in this case (which points to the array’s object reference) would be updated to reflect the new memory location. Since the array’s length is bound to the instance of the array the length would move as well. The code internal to the loop would continue working with the new memory address as the offset being dealt with is calculated every iteration through the loop based upon the value in esi and the loop counter (edx) in the following line.

    00000012  or          ecx,dword ptr [esi+edx*4+8]

    So if the garbage collector were to interrupt midway through looping it is quite possible that you could end up with the situation ..

    esi = 100000 edx = 0 write to 100008
    esi = 100000 edx = 1 write to 100012
    esi = 100000 edx = 2 write to 100016
    GC
    esi = 110000 edx = 3 write to 110016
    esi = 110000 edx = 4 write to 110020

    notice that GC would have changed the ESI register

  2. Jeff Parker says:

    Oh yeah and one more thing to note which is in the spec,

    Because context is required to determine the type of an array initializer, it is not possible to use an array initializer in an expression context without explicitly stating the type of the array.

    So also since the context of the array can not be determined ahead of time this also might be why you are seeing this odd behavior. Another thought though would be using the fixed keyword for the array The fixed statement prevents the garbage collector from relocating a movable variable. The fixed statement is only permitted in an unsafe context. Fixed can also be used to create fixed size buffers.

    Because of Garbage collection and shifting memory around might also be the reason for this additional check.

  3. Jeff Parker says:

    Well I am not disagreeing with the whole hoisting thing and that it is faster what I am saying is that there is a lot of complexity in a for loop over an array that there may be something in there by design that causes this to happen that you or I are not thinking of. I have been very curious of your posts and some some of my own research and experiments and well can’t think of a reason for this behavior, however I am not a compiler optimization expert either. The biggest thing I can think of why you are seeing this behavior is that an array is a reference type this includes the length of the array.

    Via the spec it defines a for loop as actually a while loop with everything auto generated.

    for-initializer ;
    while ( for-condition ) {
    embedded-statement ;
    LLoop:
    for-iterator ;
    }

    Which is exactly what you are showing. However on the flip site of things the for loop allows any of the following constructs

    for ( ; ; ) embedded-statement
    for ( for-initializer ; ; ) embedded-statement
    for ( ; for-condition ; ) embedded-statement
    for ( ; ; for-iterator ) embedded-statement
    for ( for-initializer ; for-condition ; ) embedded-statement
    for ( ; for-condition ; for-iterator ) embedded-statement
    for ( for-initializer ; ; for-iterator ) embedded-statement
    for ( for-initializer ; for-condition ; for-iterator ) embedded-statement

    Now adding in the complexity of an array, which array can be single dimensional, Multidimensional, or also rectangular and Jagged. The main key I think though to these array oddities you are seeing though in the spec it states

    “On the other hand, the size of the array—as represented by the length
    of each of its dimensions—is not part of an array’s type. This split is made clear in the language syntax, as
    the length of each dimension is specified in the array creation expression rather than in the array type.”

    So since the length is not part of an array’s type, what really is it. This is where I have kind of stopped research for a while. A question for another day. But I firmly believe the problem lies in the array not really in the for loop itself.

  4. Greg says:

    Jeff I am not sure that I am following the relation to your special case in the optimizer. Perhaps you can explain a bit more why you think there is a relation for this particular case.

  5. Jeff Parker says:

    Well if I would have to make a guess at why you are seeing this behavior and why it is by design it is because of the spec. For Loops have every Argument marked as optional. This might be the compilers way of handling it.

    for (;;) {
    }

    Is valid code. So I would have to guess somewhere in the JIT compiler there is some logic we do not fully understand handling this and why it is by design. However I can’t say for sure.

  6. johnwood says:

    This is fascinating stuff. I haven’t gone through all of your previous posts but it seems what you’re describing is quite an obvious optimization the JIT could perform. One quick thought, in the WillNotRemoveChecksStaticReference disassembly wouldn’t it also make more sense to pull out [esi+4] just once, putting it into edi, then doing a mov eax, edi to copy the value into eax, and then testing eax for 0… therefore eliminating an extra memory look up per iteration? Or is an extra instruction more costly than looking stuff up in the L1?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>