More about cache contention

By Jim, on December 29th, 2011

When I started working on yesterday’s blog entry about cache contention, I built an example program that used an array to illustrate the problem. That is, rather than having a struct that contains four counters, I just allocated an array. It made the code somewhat simpler, as you can see here.

The purpose of the Multiplier in that program will become evident soon.

Run with a single thread, that code executes in about 1,700 ms on my work machine–same as the version that uses a struct. But run with two threads, the code takes a whopping 25 seconds! At first I thought that this was evidence of cache contention, so I changed the Multiplier to 8, which spreads out the counters so that they’re guaranteed to be on different cache lines. That is, the first thread will access c[0], and the second thread will access c[8].

That change did indeed improve the run time. The two thread case went from 25 seconds to about 12 seconds. Cache contention was part of the problem, but certainly not all of it. Remember, the two-thread version of my first test yesterday ran in about 2,200 ms on my system.

I ruled out array indexing overhead, figuring that if it was a problem it would show up in the single-threaded case. After ruling out everything else, I was left with two possibilities: either there’s some explicit mutual exclusion going on in the runtime, or there’s some other cache contention that I didn’t take into account.

It turns out that there is more cache contention. You just can’t see it directly because it has to do with the way that arrays are allocated.

When the runtime allocates an array, it allocates enough space for the array contents plus a little extra memory to hold metadata: the number of dimensions, the size of each dimension, etc. This is all contained in a single memory allocation. The layout of an array in memory looks like this:

The array metadata is smaller than 64 bytes, so the chances of the first array element sharing the same cache line as part or all of the metadata is very high.

That’s half of the problem. The other half of the problem is that whenever code needs to access an element in the array, it has to read the metadata in order to do bounds checking and to compute the offset into the memory block. So whenever you write x = a[i] or a[i] = x, the code accesses that metadata.

If the first array element is on the same cache line as the parts of the metadata used for indexing, then every time you modify that first element, any other thread’s access to the metadata is going to incur a wait for the cache to be flushed. Modifying the first array element invalidates the cached metadata.

The reason it’s worse with arrays than with yesterday’s program is because every time the code executes --c[x], it actually makes two array accesses: one to read the current value, and one to write the modified value. Every array access makes multiple requests for the metadata, so there can be multiple stalls per iteration. That’s not true when accessing fields in a structure like we did yesterday.

The solution is to put some padding in the front of the array, as well as between the items we’re incrementing. As it stands right now, the indexes being used are 0, 8, 16, and 24. Shifting that right by eight elements would give us 8, 16, 24, and 32. That’s an easy change to make, as you can see here.

I just showed the parts of the program that have to change. Instead of computing the index using Multiplier * i, I used Multiplier * (i + 1), the counters array is allocated with Multiplier * (numThreads + 1).

If you compile and run that program, you’ll see that the results for one, two, and three threads are almost identical. The result with four threads will be slightly higher, again because system services take a small amount of processor time away from the test program.