Understanding the C/C++ memory model

I know that a lot of people struggle with understanding the memory model introduced in C11/C++11. In a recent conversation I was alerted to the existence of this blog post, which while correct (by my understanding) is in my opinion aimed at those who already have a good understanding of some of the underlying concepts. I’m going to try to set out my understanding of the model, which is hopefully correct, in more straightforward terms and with a more “ground up” explanation.

Edit: some redditors complained that the description of cache operation (such as it was) in the original version of this post contained technical inaccuracies, if interpreted as a description of current actual mainstream architecture. I believe those inaccuracies were not important for the understanding of the C/C++ memory model, however, they have now been rectified. Still: If you really want to understand anything about how processor architecture works, this is not the article for you. Rather, this is about a software model that is designed to abstract away much of that detail, and is deliberately light on hardware-level specifics.

Introduction: Program order versus memory order

You have your C/C++ program and it stores values (sets a variable or stores a value through a pointer) and it loads values (reads a variable or reads a value through a pointer). The order that your program specifies these loads and stores is the “program order”. However, there are a number of things that can cause the order in which the stores and loads actually occur to your system’s main memory to be different to that of the program order. It’s worth noting that the effect of these can never be visible to a single-threaded program, because the compiler must make the sure that the code behaves as if the loads/stores are in the order that the program specifies. Some examples:

the compiler will often cache values in registers rather than storing them to memory immediately;

the compiler may re-order reads and writes to memory when it can determine that such re-ordering won’t affect program behaviour (so long as the program obeys the language rules);

the processor may perform out-of-order execution in some cases;

even when the generated code stores values to memory rather than a register, the store might go into a processor cache and stay there for some time before it goes through to main memory;

similarly, loads may read values which are already in the cache, which means they don’t need to be read from main memory (at all);

finally, when a cache writes lines back out to main memory it may do so in an order that is different to the order in which the values entered the cache.

The good news is, you don’t really need to understand any of the above points: all you need to understand is that loads and stores can be re-ordered before they go to memory, but this won’t affect the actual behaviour of a single-threaded program.

Now for the important part: in a multi-threaded program, some of these re-orderings – together with some additional issues associated with the methods used by the hardware for maintaining cache coherency – also can cause different threads to see loads and stores made by other threads as occurring in a different order. It’s probably worth noting that a program in which this actually happened could well be exhibiting undefined behaviour – what should happen is that the program uses the memory consistency primitives to make sure that it sees a “correct” ordering. We’re coming to that now.

Atomic operations

If we forget about standard synchronisation tools such as mutexes, then there are two essential mechanisms to making ordering consistent between threads: atomic operations, and memory barriers.

Atomic operations are operations which complete indivisibly. One classic example is the “atomic increment”. A “classic” (non-atomic) increment can be broken into three separate parts: it loads a value from memory, it increments the value, and then it stores the value back to the same location in memory. In multi-threaded programs, the problem with this that the value might be changed by another thread in between the load and store, and that results in storing the wrong value; the intermediate store value is lost. In particular consider two threads that are both trying to increment the value starting at 0:

Thread A loads the value 0

Thread B loads the value 0

Thread A increments the value, calculating a result of 1

Thread B also calculates a result of 1

Thread A stores the value 1

Thread B stores the value 1

The problem is that the value has gone from 0 to 1, despite there being two increments – one of the increments got lost. (Again, if this actually happened in a C/C++ program, that program would be exhibiting undefined behaviour). An atomic increment happens as one step, atomically (indivisibly). So, to avoid the problem above, you could use an atomic integer type and an atomic increment operation.

Another classic atomic operation is “compare and exchange”. This can be used to implement a spin lock (a kind of mutex) – you treat a value of 0 as “unlocked” and a value of 1 as locked. Then, to lock the mutex, a thread can atomically compare the current value with 0 and, if it is equal to 0, set it to 1. The operation returns the original value – if 0, the lock was successfully, if 1 then another thread has the lock and so it must be retried until the other thread releases the lock (typically this would be done in a loop, “spinning” on the lock until it becomes available).

Before we go any further, let’s note the following important points about atomic operations:

Atomic operations on a particular object (variable) always have a total ordering which is consistent with program ordering. This means that every atomic operation from every thread which operates on the same atomic variable happens either before or after other every other operation, and if a thread does two operations one after the other then they will appear in the same sequence in the total ordering.

If an atomic operation is performed in one thread and another thread then performs another atomic operation on the same variable, the second thread will definitely see the effect of the operation performed by the first (assuming there are no other threads stepping in to perform other another operation in the meantime).

What atomic operations do not necessarily do is enforce ordering on any loads and stores that occur around them (I say necessarily because, as you may be aware, you can specifying memory ordering constraints when you perform an atomic operation*. We’re getting to that).

Consider the case of the spin lock that we discussed above: the lock is probably designed to protect some data structure. So you might have a sequence that looks like this:

Thread A acquires lock

Thread A stores values to data structure

Thread A releases lock

Thread B acquires lock

Thread B reads values from data structure

Thread C releases lock

The problem is that the atomic operations used to implement the lock by themselves don’t impose ordering on steps 2 and 5, so thread B might not see the values written by thread A, or even worse, might see some of the values but not all of them (this is a data race, and is not actually allowed to happen in a correct program; a correct program would use the appropriate memory ordering controls to prevent it from happening).

To prevent re-ordering of non-atomic loads and stores, we use barriers (aka fences).

Barriers

A barrier prevents loads and stores from being re-ordered past the barrier, in one or in both directions. Specifically: an acquire barrier operation prevents loads/stores which are after the barrier from being re-ordered to appear before the barrier, and a release barrier prevents loads/stores which are before the barrier from being moved after the barrier (there are other types of barrier operation, but I won’t cover them here right now).

Through the use of appropriate barriers, we can almost fix our spinlock example:

Thread A acquires lock

Thread A issues an acquire barrier

Thread A stores values to the data structure

Thread A issues a release barrier

Thread A then releases the lock

Thread B acquires the lock

Thread B issues an acquire barrier

Thread B reads values from the data structure

Thread B issues a release barrier

Thread B then releases the lock

There is still one minor issue, which is that the operations which actually acquire and release the lock could be moved past their corresponding barriers (e.g. step 5 could be re-ordered in front of step 4). There are two ways that C++11 resolves this:

First, you can combine atomic operations which acquire or release some lock with a barrier, so that the operation and the barrier are essentially indivisible (this is performed by passing the appropriate consistency constraint constant as a parameter to the atomic operation);

Second, a standalone barrier (std::atomic_thread_fence) actually acts as a stronger barrier than what is stated above. Specifically, a standalone release barrier actually also prevents subsequent stores – including atomic stores – from being moved in front of the barrier, and a standalone acquire barrier also prevents earlier loads from being moved behind the barrier.

Conclusion and key take-aways

Some key points that I wanted to cover which some more advanced tutorials tended to gloss over:

Memory barriers are useless without some kind of synchronisation (which atomic operations provide). Memory barriers do not themselves provide any synchronisation. Atomic operations themselves do not provide a memory barrier;

Atomic operations can have a memory ordering constraint specified as a parameter. This effectively creates a barrier of the specified type, and forces the atomic operation to stick on the appropriate side of that barrier.

There are of course a lot of things I didn’t cover here, but there’s plenty of other material out there which covers the concepts in more detail. I hope the explanations here, which I’ve tried to reduce to the bare essentials, are helpful to some.

* Edit: it should be noted that the default ordering constraint on atomic operations is actually memory_order_seq_cst which implies both acquire and release semantics (depending on the operation) and provides additional guarantees beyond. So, while an atomic operation doesn’t itself imply a memory barrier, an atomic operation with the default memory order constraint does so. It may be preferable to use acquire/release semantics explicitly however since the default constraint can affect performance.

This is wrong. It is true that atomic operations on many platforms do not serve as a full memory barrier. However on many of them atomic load serves as a load barrier and atomic store serves as a release barrier. This is certainly the case for x86. The C11/C++11 memory model removes any platform differences and requires any compliant implementation threat atomics as an acquire/release barrier.

> Atomic operations can have a memory ordering constraint specified as a parameter. This effectively creates a barrier of the specified type, and forces the atomic operation to stick on the appropriate side of that barrier.

No. C11 mandates that atomic operations are always creating acquire/release barriers.The parameters there are to relax the barrier semantics in performance critical code on platforms where a full acquire/release barriers are too expensive, this is almost exclusively limited to ARM.

Just stick with a default unless you are: 1) on ARM 2) your performance is hurt.

>> Atomic operations themselves do not provide a memory barrier
> This is wrong. It is true that atomic operations on many platforms do not serve as a full memory barrier. However on many of them atomic load serves as a load barrier and atomic store serves as a release barrier.

It is not a statement about hardware. It is a statement about the language-defined model. I believe it is correct. Furthermore it is generally unsafe to assume particular characteristics even when the hardware characteristics are known. An appropriate memory barrier also prevents the compiler from re-ordering loads/stores around the barrier as appropriate.

>> Atomic operations can have a memory ordering constraint specified as a parameter. This effectively creates a barrier of the specified type, and forces the atomic operation to stick on the appropriate side of that barrier.

> No. C11 mandates that atomic operations are always creating acquire/release barriers.The parameters there are to relax the barrier semantics in performance critical code on platforms where a full acquire/release barriers are too expensive

The default is actually memory_order_seq_cst, which is stronger than acquire/release. However the statement you quoted specifically concerns the case where a specific semantic is supplied – one of memory_order_acquire or memory_order_release, i.e. the two I actually discuss – and again, I believe it is correct.

> I do agree “sequentially consistent acquire and release” is not a full barrier. But it doesn’t mean atomics don’t produce barriers at all.

That is exactly wrong.
1. “memory_order_seq_cst” does imply a full barrier (and more besides).
2. On the other hand, atomic operations with no memory ordering constraint (i.e. using any atomic_* function with the constraint parameter specified as memory_order_relaxed) do not provide a barrier.

In my understanding: there are (at least) three types of memory barriers proposed in C/C++11 for building a data race free programs: acquire, release and a full memory barrier. The acquire and release are “sequentially consistent” flavor, more on that later.

The acquire barrier prevents loads and stores to be reordered before the barrier but allows them to be moved past the barrier. An atomic load by default is an acquire.

The release barrier prevents loads and stores to be reordered past the barrier but allows them to be moved before the barrier. An atomic store is by default is a release.

The above two rules mean that more loads and stores can be moved into a critical section but no loads and stores can be moved out.

According to above rules an acquire can not be moved past release. But a release barrier can be moved past acquire (possible of a different critical section). This is why we need memory_order_seq_cst. It simply closes this loophole and prevents any acquire/release barriers to be reordered. This is called “sequentially consistent acquire/release” and it allows for building sequentially consistent data race free programs.

A full barrier would simply prevent any loads and stores to be reordered in any direction around the barrier. This kind of barrier put much more restrictions on the compiler and CPU than a “one-directional” acquire/release kind.

> On the other hand, atomic operations with no memory ordering constraint (i.e. using any atomic_* function with the constraint parameter specified as memory_order_relaxed) do not provide a barrier.

This is correct. However the default is memory_order_seq_cst. I don’t see any reason to use non-default here.

We were discussing (ed.) the memory_order_seq_cst constraint, which does provide a full barrier if used with read-modify-write operation (atomic_compare_exchange_weak for example). But you’re right that it does not do so with just a load or store operation (I believe a program that could observe this limitation would have to be pretty funky).

> But a release barrier can be moved past acquire (possible of a different critical section)

Not, as far as I understand, if they are correctly protected by the same atomic variable, because operations on a single atomic variable observe a total ordering, and the atomic operations are pegged to the barrier. From the post: “… the operation and the barrier are essentially indivisible”.

> This is correct. However the default is memory_order_seq_cst. I don’t see any reason to use non-default here

For performance. Even if you’re lucky enough to have a single target architecture, and it’s one where the default ordering doesn’t impose a significant cost, a memory_order_seq_cst still prevents the compiler from re-arranging loads/stores around the barrier in a way that a weaker ordering would not, thus limiting optimisation.

By all means, stick with the default ordering constraint if you’re more comfortable doing that, though.

> Not, as far as I understand, if they are correctly protected by the same atomic variable

Note, that’s a single atomic variable. If you have multiple atomic variables used for some kind of locking, then you may need sequential consistency. But that’s not what the examples in the post are showing.

If you can point at a specific part of the video that you think contradicts something I’ve said, I’m happy to look at it, but I’m not prepared to watch all 80 minutes just for the sake of this conversation.

When dealing with things like background I/O, it’s often helpful to have a function receive a pointer to a block of memory and initiate some process that operates upon it outside the control of the compiler, and then have a function or functions that can indicate whether such an operation is complete and/or forcibly terminate it. User code would be expected not to use the memory in question while the external operation is in progress. The usefulness of this would be severely undermined if it were necessary to declare memory in question “volatile”.

Is there any nice way to achieve that on an implementation that can’t meaningfully implement an atomic 64-bit compare-and-swap? Note that “atomic” operations that aren’t lock-free will be meaningless on a free-standing implementation whose environment provides no standard means of coordination.

I don’t think I fully understand your question. What you’re talking about sounds a lot like POSIX asynchronous I/O, and as far as I understand, use of that interface requires neither volatile nor compare-and-swap (and I don’t see why it would). Could you elaborate?

For this to work for data transmission, it’s necessary that the compiler not reorder any operations on the buffer across calls to start_sending or send_bytes_left. Unfortunately, declaring tx_data and tx_count volatile is not sufficient to do that. I think the memory barriers in may suffice, but the Standard provides no means for an implementation to indicate that it supports anything in atomics.h unless it can implement everything. It allow implementations to indicate that certain operations aren’t done in lock-free fashion, but presupposes that they can be meaningfully handled in some possibly-blocking fashion.

I understand. I’m a little fuzzy on the specifics, but I think memory barriers should suffice, yes.

> It allow implementations to indicate that certain operations aren’t done in lock-free fashion, but presupposes that they can be meaningfully handled in some possibly-blocking fashion.

I don’t think that matters; the support for memory barriers isn’t optional (except as part of atomics as a whole).

Edit: and in case this was significant in your question, take the statement in the post about memory barriers being useless without synchronisation as meaning that they are useless for inter-thread communication, not that they do nothing at all.

A problem I see with the memory barriers in atomics.h is that the Standard provides no mechanism by which an implementation can say that it supports a basic memory barrier without also saying that it supports things in (atomics.h) that will be unsupportable on many implementations. The Standard allows implementations to say that their implementations of things like compare-and-swap aren’t lock-free, but doesn’t allow them to say that they won’t work reliably between threads.

In embedded systems code, what a programmer often needs is a guarantee that a compiler won’t reorder certain things. If the programmer configures the hardware cache circuitry, the programmer can make sure that any cache synchronization operations that are needed get performed, but if a compiler moves a non-qualified store across a volatile store to a “force synchronization” register, the hardware synchronization won’t do any good.

Personally, I think there should be a recognizable category of implementations that guarantee that if code calls a function which performs a volatile access, anything that happens before the function call will be done before the first access in the function, and anything that occurs after the function call won’t happen until after the last volatile access in the function. Such a guarantee would be “naturally” supported by any compiler which either avoids reordering things across any volatile access, or avoids reordering things across function boundaries, and would ensure compatibility with a lot of code that would otherwise need memory barriers. If memory barriers need to be done manually, however, there should be a simple way of handling a basic compiler barrier (something *every* implementation should be capable of supporting) that would be available even on platforms that couldn’t handle any other aspects of (atomics.h).