2009.07.02

(apologies in advance — this will be a rather technical post. It’s eventually intended for fdiv.net, but while it’s getting migrated I figured I’d plop it down here.)

Instance Variables (or, more briefly, ivars) are a pretty simple trend in Object Oriented Languages. They’re data that get carried along inside an object, helping to define its state. While it would be fun to elaborate more on this, this particular article isn’t intended to teach the basics of OOP. So if you’re unsure of what an ivar is, this article probably isn’t for you.

In Objective-C, (and possibly other languages), ivars have an interesting property — they’re volatile. volatile is one of those shady regions of C and C-derived languages (like Objective-C) that few people talk about, but what it means is that the data behind the variable can change at any moment, so the compiler should reload it from memory whenever it’s used (the alternative is to load it once, and store it in a register — this is typically simpler and faster, but sometimes wrong if the data is in fact volatile). This is all well and good, and it performs exactly as intended. However, because of this reloading, we get some negative performance characteristics to go along with it.

So, in a nutshell: we’ve got a Vertex structure, which holds an x, y, and z coordinate. Then, we have a PointCloud object, which holds a bunch of points. It also stores a max vertex, and a min vertex. These would be used to create crude axis-aligned bounding boxes, or for set normalization or something.

And our task is finding the maximum and minimum x, y, and z values in the cloud. (bonus points for using the phrase “in the cloud” without using buzzwords ;). The algorithm to do this is pretty simple: set our max/min value to the initial point in the cloud, and then check every other point to see if it’s higher than our max, or lower than our min. If it is, set the max or min as needed, and continue. In Computer Science class, they’ll tell you that this algorithm is O(N), which doesn’t suck too severely. I believe it’s the optimal way to solve this problem (for an unordered set, at least), but I could be mistaken.

Anyway, to solve our problem, I have supplied 3 example methods: calculateMaxMin1, calculateMaxMin2, and calculateMaxMin3. They all follow the same pattern, but they have measurably different performance characteristics.

Let’s talk about these methods. The first method uses only ivars. The loop checks against count (an ivar), and it updates max/min (ivars) based on the vertices array (also an ivar).

The second method, calculateMaxMin2, is basically the same as calculateMaxMin1, except that instead of vertexCount, it stores the count locally in a local variable called “lCount”.

The third and final method, calculateMaxMin3, still follows the above pattern, but uses local variables of count, max/min, and the vertices array.

A casual glance at the code would lead one to believe that the running time for any of the three methods would be similar (with the last one possibly taking a teeny-tiny bit longer because it’s doing a couple more copies… probably immeasurably small though, as in 15 nanoseconds tops). Instead of guessing though, we’ll run it and see what happens.

Running the program (for 4,194,304 points), we get this as the output:

Interpreting the results, we see that Method1 takes about 0.044 seconds to run, Method2 takes a little less, at 0.039 seconds, and Method3 takes a stunning 0.025 seconds to complete the exact same task.

Why does Method3 take so much less time? It takes just 55% as long as Method1, yet it’s doing the same amount of work!

The answer lies in how the compiler/optimizer treats ivars compared to local variables.

Let’s pull out some assembly output to see what’s happening behind the scenes. (This is going to get a bit more complicated, but don’t worry!)

Note all the jumps (jbe, highlighted in red), the overall length (+59 to +180 bytes, or 121 bytes in total (180-59)), and how our ivars are loaded/stored (movss (%ecx,%eax),%xmm0 loads max.x, for example, highlighted in blue).

No jump instructions, except for the loop’s jb at the end. Size is +62 to +142 (80 bytes, almost 30% smaller!), and most of our variables live in registers (note how xmm0-xmm7 are all utilized, and %ebp (stack spillage) is only used twice, at 0xf4, and 0xf8. Aside from the blue highlighted load/store lines, everything lives in a register, ready for screaming fast access (granted, most ivars will live in L1, so they’re only half as fast as registers, but as we see from the profiling above, that half-speed hit is actually measurable, even from L1!).

So, let’s analyze this a bit. Our loop code’s smaller (so it can complete about 33% more loops per unit of time). But that’s only part of the benefit. We also get no jumps. No jumps means no mis-predicted jumps, which helps the CPU pipeline stay loaded. And finally, we have almost no memory loading except for the vertex data itself (and a tiny bit of stack spillage). This means more CPU-RAM (or CPU-L1 Cache even) bandwidth is spent on loading relevant data, instead of reloading/storing ivars all the time.

Why are ivars implemented this way? Because a method needs to work on fresh data at all times — if another thread comes along and changes an ivar on the object, it wouldn’t make sense to keep using stale data. Local variables, in contrast, aren’t at all likely to get modified by another thread (you can do it, but it’s a rather contrived example). As such, the compiler knows exactly when a value changes, and can let it live on the cpu for as long as possible, for optimum speed.

Disclaimer stuff: This is a silly problem to solve, but it illustrates the point nicely. There are better solutions (using SSE to check 4 values at once, and then coalescing at the end, for example), but that’s not the point. Here we see how we can speed up (by almost a factor of 2!) loop intensive code by simply avoiding ivars in heavily trafficked loops.