OK I had left this out of my previous post an in depth look at for loops as I considered the behavior to be a JIT bug. I discussed this with some very smart people (including quite a few at Microsoft) and they agreed but my problem report was marked with “by design” so now I should explain what happens and the hoops you have to jump through in order to work around the issue.
We have discussed before the concept of bounds check removal. Since we know that the array in question has a length l, and we are storing the reference to that array in a register we know that the length cannot possibly change. We know that the length cannot possibly change because the size of an array object is immutable, we can change the data within the array but not the actual size. There is some discussion to be had on this point and future implementations but I will leave that for another discussion.
Let’s look at some C# code that illustrates this point. I am going to discuss VB.NET and other languages later as it applies to them as well (even more so than C#).
class Program { static int[] foo = new int[10000]; static int WillNotRemoveChecksStaticReference() { int total = 0; for (int i = 0; i < foo.Length; i++) { total |= foo[ i ]; } return total; } static int WillRemoveChecksLocal() { int total = 0; int[] bar = new int[10000]; for (int i = 0; i < bar.Length; i++) { total |= bar[ i ]; } return total; } static int WillRemoveChecksPassedIn(int[] bar) { int total = 0; for (int i = 0; i < bar.Length; i++) { total |= bar[ i ]; } return total; } static void Main(string[] args) { WillNotRemoveChecksStaticReference(); WillRemoveChecksForStaticReference(); WillRemoveChecksLocal(); WillRemoveChecksPassedIn(foo); } } |
Listing 1: C# Code to explore |
The problem is easily seen when we look through the WillNotRemoveChecksStaticReference method. Here is the disassembly of optimized code for that method. I have marked the code using the same standard as in the last post (green is preamble, blue is the code in the loop, yellow is the code to increment the counter and see if the loop should be ended).
static int WillNotRemoveChecksStaticReference() { int total = 0; 00000000 push edi 00000001 push esi 00000002 xor ecx,ecx for (int i = 0; i < foo.Length; i++) { 00000004 xor edx,edx 00000006 mov esi,dword ptr ds:[02275A44h] 0000000c mov eax,dword ptr [esi+4] 0000000f test eax,eax 00000011 jle 00000025 00000013 mov edi,dword ptr [esi+4] total |= foo[ i ]; 00000016 cmp edx,edi 00000018 jae 0000002A 0000001a or ecx,dword ptr [esi+edx*4+8] for (int i = 0; i < foo.Length; i++) { 0000001e add edx,1 00000021 cmp eax,edx 00000023 jg 00000016 } return total; 00000025 mov eax,ecx 00000027 pop esi 00000028 pop edi 00000029 ret 0000002a call 79443579 0000002f int 3 |
Listing 2: Disassembly of WillNotRemoveChecksStaticReference |
We can quickly see here that the bounds check is in fact not being removed as it would be if the array were either a local or passed in. This means that there is a compare and a branch added to every iteration of the loop. In the previous article we learned how expensive this can be.
If you go to my problem report you will see that someone named TAG has commented on it that this is impossible to have happen (or would require huge amounts of complex logic). On first glance I had suspected something similar; my guess (and the guess of nearly everyone I have talked to about this) was that it was due to volatility. Consider for a moment that there was another thread executing that changed foo to now be a smaller array, this could feasibly make our code write out into never-never land if it did not check every iteration of the loop as the size of the array could change .. right?
Then again maybe not!
The operation would in fact always be safe to remove the bounds check from so long as we are storing our pointer to the array in a register. I am fairly certain the JIT knows if it stored the reference in a register (it did apply this optimization so it is not a huge jump in logic to think that it knows that it did this). Since we are storing our pointer to the array in a register VOLATILITY DOES NOT APPLY (sorry for bold caps but this is an important point). Any change to the original reference will not affect our loop; our register will still point to the old “safe” copy of the array that we originally read the length from.
The reason I imagine that the JIT does not do this optimization automatically is the condition that exists if we cannot put the pointer to the array into a register. In this case the operation is no longer safe as we will see the change to the array if it is made from another thread.
In order to help prove the point (and show the workaround) we can in fact force the JIT to do exactly what we want by using the following C# code.
static int WillRemoveChecksForStaticReference() { int total = 0; int[] bar = foo; for (int i = 0; i < bar.Length; i++) { total |= bar[ i ]; } return total; } |
Listing 3: C# code that does remove bounds checking |
Before continuing, would you delete the seemingly useless code this contains in a single threaded app (or any app)? It does however make a very subtle difference in our disassembled code.
static int WillRemoveChecksForStaticReference() { int total = 0; 00000000 push esi 00000001 xor ecx,ecx int[] bar = foo; 00000003 mov esi,dword ptr ds:[02275A44h] for (int i = 0; i < bar.Length; i++) { 00000009 xor edx,edx 0000000b mov eax,dword ptr [esi+4] 0000000e test eax,eax 00000010 jle 0000001D total |= bar[ i ]; 00000012 or ecx,dword ptr [esi+edx*4+8] for (int i = 0; i < bar.Length; i++) { 00000016 add edx,1 00000019 cmp eax,edx 0000001b jg 00000012 } return total; 0000001d mov eax,ecx 0000001f pop esi 00000020 ret |
Listing 4: Disassembly of WillRemoveChecksForStaticReference |
Although the code itself is quite different, you will notice that the disassembly is almost identical. The only differences deal with the bounds removal and the placement of a single line of assembly.
In the first bit of code we had in the loop initialization
00000006 mov esi,dword ptr ds:[02275A44h]
In the second bit of code we had
00000003 mov esi,dword ptr ds:[02275A44h]
Just before the loop started, the JIT could very easily have done the bounds removal optimization for us as opposed to making us write the very non-intuitive code setting the static to a local first. The only situation where the two differ in behavior is if our array is not being moved into a register (esi in this case). In fact the JIT in the first case is essentially treating our array as if it “partially volatile”. If we needs this type of behavior we would probably mark our array volatile ourselves.
So after all of this I will create another rule for optimizing loops.
Never loop on an array that is or is contained in a field or static, always copy the reference to a local first to be explicit about the way that you want to handle changes to the original array on another thread.
Hopefully this will change in a future version of the JIT (though I wonder now since my issue was marked by design). It is still a good best practice to do this as you are being explicit in that you want to loop through every item of the array you started with, and that you are not interested in any changes to that array made by another thread during that process.
Although it is a good best practice to always use a local reference in order to be explicit I still think that the JIT should handle this for you in the case that it is optimizing to a register. I think I have also shown fairly well that this optimization is not impossible but would in fact be relatively simple (simple branch on whether or not the array reference has been optimized to a register).
EPi4Wb
Jeff .. you have brought up many points, I will do my best to address them.
1) for vs while vs etc. This is all handled by the C# compiler .. these things all look the same when viewed as IL. There is no such thing as a “for” in IL. The JIT only operates on the IL level so I don’t really see these items causing too much of an issue. These items can be fairly complex but they are complex at the compiler level (not so much at the JIT level). To see better what I am referring to, run ildasm on the examples from the original for article (or the examples listed here).
2) Arrays .. remember that there are specific instructions for different types of arrays in IL so it knows the difference between a SZ and MD case. http://www.codeproject.com/dotnet/arrays.asp is a great article on this. Since it is being given IL instructions to deal with a SZ array it knows that the array does not have multiple dimensions and that it is 0 based etc.
3) “So since the length is not part of an array’s type, what really is it.”, for SZ arrays which is what we are dealing with here the size of the array is kept with the array instance it is located at the object reference + 4. We can see this being accessed in the following instruction.
0000000b mov eax,dword ptr [esi+4] //access array length
The length of an array is immutable, you can only change the array to be a different array (which would have a different length). Since a reference to the array is being held in a register it does not have to deal with the volatility problem it might have to deal with if it were directly reading from memory (i.e. if another thread changes the original reference to point to another array, it will still be pointing at the array it started off with; it will not be updated to point to the new array).
4) “Because of Garbage collection and shifting memory around might also be the reason for this additional check.”
I am going to write a post on this but I will give you the abbreviated version here, I had a concern that was partially related (in code with bounds removed that garbage collection could possibly remove the item). This is however coverred in Jeffrey Richter’s CLR via C# (chapter 20). Registers are considerred to be GC roots (see pages around 462-464 for a detailed explanation of what happens). The garbage collector will in fact come into managed code and change register values in the case that memory shifts.
“Naturally, moving the objects in memory invalidates all variables and CPU registers that contain pointers to the objects. So the garbage collector must revisit all of the application’s roots and modify them so that each root’s value points to the object’s new memory location.” p464
As such in the case of a memory move the esi register in this case (which points to the array’s object reference) would be updated to reflect the new memory location. Since the array’s length is bound to the instance of the array the length would move as well. The code internal to the loop would continue working with the new memory address as the offset being dealt with is calculated every iteration through the loop based upon the value in esi and the loop counter (edx) in the following line.
00000012 or ecx,dword ptr [esi+edx*4+8]
So if the garbage collector were to interrupt midway through looping it is quite possible that you could end up with the situation ..
esi = 100000 edx = 0 write to 100008
esi = 100000 edx = 1 write to 100012
esi = 100000 edx = 2 write to 100016
GC
esi = 110000 edx = 3 write to 110016
esi = 110000 edx = 4 write to 110020
notice that GC would have changed the ESI register
Oh yeah and one more thing to note which is in the spec,
Because context is required to determine the type of an array initializer, it is not possible to use an array initializer in an expression context without explicitly stating the type of the array.
So also since the context of the array can not be determined ahead of time this also might be why you are seeing this odd behavior. Another thought though would be using the fixed keyword for the array The fixed statement prevents the garbage collector from relocating a movable variable. The fixed statement is only permitted in an unsafe context. Fixed can also be used to create fixed size buffers.
Because of Garbage collection and shifting memory around might also be the reason for this additional check.
Well I am not disagreeing with the whole hoisting thing and that it is faster what I am saying is that there is a lot of complexity in a for loop over an array that there may be something in there by design that causes this to happen that you or I are not thinking of. I have been very curious of your posts and some some of my own research and experiments and well can’t think of a reason for this behavior, however I am not a compiler optimization expert either. The biggest thing I can think of why you are seeing this behavior is that an array is a reference type this includes the length of the array.
Via the spec it defines a for loop as actually a while loop with everything auto generated.
for-initializer ;
while ( for-condition ) {
embedded-statement ;
LLoop:
for-iterator ;
}
Which is exactly what you are showing. However on the flip site of things the for loop allows any of the following constructs
for ( ; ; ) embedded-statement
for ( for-initializer ; ; ) embedded-statement
for ( ; for-condition ; ) embedded-statement
for ( ; ; for-iterator ) embedded-statement
for ( for-initializer ; for-condition ; ) embedded-statement
for ( ; for-condition ; for-iterator ) embedded-statement
for ( for-initializer ; ; for-iterator ) embedded-statement
for ( for-initializer ; for-condition ; for-iterator ) embedded-statement
Now adding in the complexity of an array, which array can be single dimensional, Multidimensional, or also rectangular and Jagged. The main key I think though to these array oddities you are seeing though in the spec it states
“On the other hand, the size of the array—as represented by the length
of each of its dimensions—is not part of an array’s type. This split is made clear in the language syntax, as
the length of each dimension is specified in the array creation expression rather than in the array type.”
So since the length is not part of an array’s type, what really is it. This is where I have kind of stopped research for a while. A question for another day. But I firmly believe the problem lies in the array not really in the for loop itself.
Jeff I am not sure that I am following the relation to your special case in the optimizer. Perhaps you can explain a bit more why you think there is a relation for this particular case.
Well if I would have to make a guess at why you are seeing this behavior and why it is by design it is because of the spec. For Loops have every Argument marked as optional. This might be the compilers way of handling it.
for (;;) {
}
Is valid code. So I would have to guess somewhere in the JIT compiler there is some logic we do not fully understand handling this and why it is by design. However I can’t say for sure.
This is fascinating stuff. I haven’t gone through all of your previous posts but it seems what you’re describing is quite an obvious optimization the JIT could perform. One quick thought, in the WillNotRemoveChecksStaticReference disassembly wouldn’t it also make more sense to pull out [esi+4] just once, putting it into edi, then doing a mov eax, edi to copy the value into eax, and then testing eax for 0… therefore eliminating an extra memory look up per iteration? Or is an extra instruction more costly than looking stuff up in the L1?