2010.07.31

A few days ago I was chatting with some friends, and the topic of caching came up. I mentioned MESI, which is the basis for modern multicore cache coherence (There are variants like MOESI and MERSI, but the general idea is the same). It then occurred to me that I’ve never actually made a test to see the effects of MESI in action.To begin, ere’s an easy-as-pie pathological performance demonstration:

That above nonsense is me compiling the same piece of code (analyzed later) being compiled with optimization (and taking ~12.6 seconds to complete), and without optimization (taking ~5.6 seconds to complete). Wait a minute. Optimized takes over 2x as long? If that’s not a teaser, I don’t know what is.

Modern CPUs have a few levels of cache between their registers (the fastest memory around) and the Harddrive (the slowest, unless you count the network). You’ve probably heard of one of the most important caches, RAM — it’s filled with stuff from the harddrive (and temporary working data) because accessing the harddrive is so laughably slow that if you understood it you’d never refer to your computer making that grinding noise as “thinking” — it’s actually doing quite the opposite: not thinking, but waiting for the slow disk to feed it more data.

Compared to the CPU, the harddrive is a glacier — it takes hundreds of thousands of clock cycles to get a response. It’s like sending a letter across the ocean on a ship. Compared to the CPU, even RAM, which is orders of magnitude faster than the harddrive, is really slow — it takes hundreds of cycles to get a response. So to work around the slowness of RAM, a few more caches are in place: L1, L2, and sometimes L3. These are made of very fast, very expensive memory, and they’re often very small. L1 and L2 live on the CPU, with L1 being per-core, and L2 usually being shared among all cores on the CPU, and on some newer processors L3 is also on the CPU and also shared (but sometimes it’s off the CPU). The L1 cache is further divided into L1-I and L1-D, for Instructions and Data, but that’s a topic for a different day. Generally L1 will be the smallest and the fastest, L2 will be slower but larger, and L3 will be slower but larger still.

RAM is divided into bytes. In the old days, you’d even read and write it as individual bytes. (some fancy-pants microcontrollers even offered bit addressing, but they typically worked on bytes and just hid that detail from you). These days, its read and written in larger chunks (there are some exceptions, but generally reads and writes work on much much larger units). These chunks are typically Cache Lines. A cache line on a current X86 processor is 64 bytes. So when your program reads byte at location 0, behind your back the cache will also fetch bytes 1-63 for you. This may seem totally bizarre and wasteful, but the reason for it is because often programs that ask for a value at a certain location will ask for nearby locations shortly after (phenomena referred to as “spatial locality” and “locality of reference”). The cache will hold on to that entire line until it seems like the program’s not interested in it any longer, at which point it evicts it, and asks for a new line. You probably undergo a similar optimization when you’re putting on your socks and shoes: you grab them in pairs, and put them on one at a time. You could put one on, and then go and grab the second and put it on, but it’s easier to just get them all at once.

When a CPU core holds a cache line, it’s free to modify it as it needs to — this is known as the M state (for Modified). When a line is “M”, it no longer matches what’s in other caches and RAM — this is ok, as long as it makes it back there eventually. If a line in cache holds no valid data, it’s Invalid, or in the I state. If a CPU holds a line unmodified, and no one else has it, it’s “Exclusive”, or E state. And finally, when a line is held by 2 or more cores, it’s in the Shared, or S, state — this is where it begins to get complicated.

Let’s say that 2 cores are working on the same cache line. Core 1 changes one of the bytes. This means that Core 2’s version is Invalid (I state!). So when Core 2 needs to read a byte from that line, it notes that it needs to get the latest version — this involves asking Core 1 for its latest copy. Core 2 then changes a byte, invalidating Core 1’s line, so Core 1 then asks for Core 2’s copy, and the cycle repeats. This, as you can imagine, gets really slow — the cache line is bouncing around between cores, continually invalidating each other, causing them to stall.

For memory that doesn’t get written to (the instructions that make up the program, for example), this isn’t a problem — the lines are Shared (S state), and never modified, so both cores can happily operate simultaneously, each using their own identical copy with no stalls or invalidations.

What we’re doing is spawning 2 threads, and having each of them count at a certain location. Note that our timing is not particularly accurate — it’s intended to give us the general magnitude of the duration, not the fine-grained timing information we have in other profiling posts.

When optimization is disabled, the compiler doesn’t do any cleanup, and just uses the variables as they’re described at the top (value1, value2, padding1, padding2). Because of the Padding variables (64bytes of them, to be exact), we guaranteed that our value1 and value2 variables are at least 64bytes apart (thus, on different cache lines) in memory — this allows each CPU core to operate on a different cache line, never interrupting the other cores.

When we enable optimization, the compiler reasons about the code some. It notices that we never actually use padding1 and padding2, and so it removes them (normally, fewer variables will improve performance, as the remaining live variables will be together in cache). In this case, it puts these two variables adjacent (on the same cache line), unaware that they’re about to be hammered by 2 different Cores with 2 different Caches. So even though the threads are operating on distinctly separate regions of memory, they’re the same logical cache line, and thus performance issues emerge.

One simple way to fix this is to tell the compiler to align each of the variables to its own cache line using the “aligned” attribute. This will ensure that we have plenty of space around our values.

Variables aren’t typically the source of cache line contention like this; the more frequent culprits are spinlocks and mutexs — thankfully, on OS X mutexes are big enough to be cache-line aligned already, but spinlocks are not. So if you have a large block of spinlock definitions at the top of some source file, it might be useful to separate them some if contention is actually rears its head.