Understanding Violations of Sequential Consistency

Memory consistency models have an almost mythical aura. They can puzzle the most experienced programmers and lead to bugs that are incredibly hard to understand and fix. If you have written multithreaded code, it is likely that you have stumbled upon memory model woes. Chances are that you have also lost bets with your colleagues because of memory consistency model disputes. In this blog post I will discuss some of the rationale of why memory models were created and give some specific examples of how that affects you.

Memory Consistency Models

First things first, lets define what a memory model is. A memory consistency model defines what values a given read operation may return. The simplest memory model is the Sequential Consistency model, in which each execution behaves as if there were a single global sequence of memory operations, and the operations of a given thread appeared to all threads in the same order as they appear in the program (program order). It is the most natural model for normal humans to think about, because the execution behaves as if it were run on a multitasking uniprocessor.

Consider the following example:

Two processors communicating through shared memory

The question is, what value can the read of data in P2 return? The most obvious answer here is 42. But what would you say if P2 observed the writes to data and flag in the opposite order? P2 could actually read data as “0″, which is surprising and not allowed by the sequential consistency memory model. And yet…

The main problem with sequential consistency is that it prevents systems from reordering memory operations to hide long latency operations and improve performance. For example, when a cache miss is being serviced, the processor could execute another memory access that comes after it in program order. That access may hit the cache and therefore complete earlier than the delayed access. Abandoning sequential consistency at the processor level results in dramatically improved performance.

But processors are not the only source of memory operation reordering. Many compiler optimizations effectively reorder code, e.g., loop-invariant code motion, common sub-expression elimination, etc. Furthermore, memory models of languages and memory models of the hardware they run on need not be the same. This is why compilers and synchronization libraries need to insert fences in the generated code. They have to map the language memory model to the hardware model. For example, Java and C++11 (the upcoming C++ standard) support memory models that guarantee sequential consistency for programs free of data races (although C++ also offers ways of relaxing sequential consistency without introducing races–the so called weak atomics).

Due the difficulty in improving performance under sequential consistency, a variety of “relaxed” memory models were conceived. For example, in the Weak Ordering memory model, there is no guarantee that a processor will observe another processor’s memory operations in program order. This is where a “memory fence” (a.k.a. “memory barrier”) comes into play. When a fence instruction is executed, it guarantees that all memory operations prior to it in program order are completed (and visible to other processors) before any operation after the fence in program order is allowed to proceed. You would be bored and stop reading if I described the multitude of consistency models in this post. However, I do encourage you to read more about memory models in this very nice tutorial by Sarita Adve and Kourosh Gharachorloo. Also, Paul McKenney’s paper has a nice table summarizing the ordering relaxations in modern microprocessors.

The x86 Memory Model

Now let’s talk about some of highlights of the x86 memory model. A big disclaimer first. This can change and probably does change between models, so it is always a good idea to check the manuals before endeavoring in sensitive code (8-8 Vol. 3 in this manual for Intel and Section 7.2 in this manual for AMD).

In a nutshell, recent implementations of the X86 ISA (P6 and on) follow, roughly, what is normally termed total-store order (TSO), which is stronger than “processor consistency” (and what people often think the x86 model to be). Its key ordering properties are:

reads are not reordered with respect to reads;

writes are not reordered with respect to reads that come earlier in the program order;

writes are not reordered with respect to most writes (excluding, e.g., multiple writes implicit in string operations like REP MOVSB);

reads may be reordered with respect to writes that come earlier in program order as long as those writes are to a different memory location;

reads are not reordered with respect to I/O instructions, locked instructions and other serializing instructions.

There are no guarantees whatsoever of ordering between writes of different processors, the outcome of concurrent writes to the same memory location is non-deterministic. Increment instructions have no atomicity guarantees; moreover, even some write operations that update multiple bytes are not guaranteed to be atomic (see Andy’s blog post). For example, if a write operation to multiple bytes happen to cross a cache line boundary, the operation is not guaranteed to be atomic.

Here is an example of how the x86 memory model can get you in trouble:

The outcome t1 == 0 and t2 == 0 is possible on the x86

An execution whose final state is t1 == 0 and t2 == 0 is allowed. Such an outcome is unintuitive, non-sequentially consistent, because there is no serialized execution that leads to this state. In any serialized execution, there will be an assignment in one processor (A = 1 or B = 1) prior to a read in the other processor (t1 = B or t2 = A).

Another way to look at the problem is to try to build a happens-before graph of the execution. In this representation, each node is an executed instruction. A directed edge from instruction P to instruction Q is drawn if Q has observed the effects of P, and P has not observed the effects of Q, so P “happens-before” Q. (Note that, in the C++11 model, “happens-before” has a slightly different meaning than what is used in this post.)

Here is the happens-before graph for the example above when the outcome is t1 == 0 and t2 == 0:

An attempt to draw happens-before arrows for non-sequentially-consistent execution

Edge (1) exists because the read t1 = B in P1 did not observe the write B = 1 in P2. The same applies to edge (2). Edges (3) and (4) are there because of program order. Since there is a cycle in the happens-before graph, there is no serialized order that would satisfy the happens-before relationship. Therefore, the execution is not sequentially consistent. What happened in this example is that the read operation t1 = B in P1 proceeded before the write operation in A = 1 had a chance to complete and become visible to P2.

Here is another example of how the x86 memory model leads to surprising results:

The outcome t2 == 0 and t4 == 0 is possible on the x86

The snippet of execution above might lead to an unexpected state where t2 == t4 == 0. Lets look at this from the perspective of P1. This may happen because the processor can quickly forward the value from the pending write to A (A = 1) to the read of A (t1 = A) on the same processor. In the meanwhile, the other processor, P2, can’t access this pending write (because it’s still waiting in the P1’s private write buffer) and reads the old value (t4 = A). Note that this outcome cannot be explained by a simple reordering of t2 = B and t1 = A! Intuitive, huh?

One final example for you to noodle about. Consider a boiled-down version of the Dekker’s mutual exclusion algorithm for two threads:

Unintuitive outcome of a boiled-down Dekker's algorithm: both threads enter the critical section

The gist of the algorithm is to use two flag variables, one for each processor, flag1 and flag2. P1 sets flag1 when it is attempting to enter the critical section, it then checks if flag2 is set; if it is not set, it means P2 has not attempted to enter the critical section, so P1 can safely enter it. Because the x86 memory model allows reordering of loads with respect to earlier stores, the read of flag2 can proceed before the setting of flag1 is completed, which can lead to both processors entering critical sections, since P2 might have just set flag2!

That is it! I hope this helped you get a better grasp of what a memory consistency model is and understand a few of the key aspects of the x86 model. And, if you come across something that looks like a memory consistency bug, try building that happens-before graph to find cycles and remember to look at the manual 🙂 . Have fun!

Share this:

Like this:

Related

4 Responses to “Understanding Violations of Sequential Consistency”

Thanks Luis. I’d like to point out that this TSO is a good model for “normal” x86 loads/stores to WB cacheable memory.

As soon as you start using non-temporal loads/stores or other memory types, the rules are a lot more complicated. But don’t worry about it. Unless the programmer specifically ask for these the compiler and OS won’t do anything except normal loads/stores to WB cacheable memory.

I disagree with “Note that this outcome cannot be explained by a simple reordering of t2 = B and t1 = A!”, as the manual says “(…) reads can be carried out speculatively and in any order, reads can pass buffered writes (…)”. IMHO, it’s correct to say that load-load reorderings are not allowed except for store buffer forwarding, which means that in practice load-load reorderings are in fact allowed. (There should be courses about the exegesis of hardware manuals.)

Another key aspect is that “memory ordering respects transitive visibility”. Not having this property can lead to even less intuitive scenarios!

With your permission I would like to quote this post in a talk I’ll be giving next Wednesday ( http://seminaire.lrde.epita.fr/ – Reusable Generic Look Ahead Multithreaded Cache – a case study for a high resolution player ). Shall I ?