Dealing with Benign Data Races the C++ Way

Data races lead to undefined behavior; but how bad can they really be? In my previous post I talked about benign data races and I gave several examples taken from the Windows kernel. Those examples worked because the kernel was compiled with a specific compiler for a specific processor. But in general, if you want your code to be portable, you can’t have data races, period.

You just cannot reason about something that is specifically defined to be “undefined.” So, obviously, you cannot prove the correctness of a program that has data races. However, very few people engage in proofs of correctness. In most cases the argument goes, “I can’t see how this can fail, therefore it must be correct” (maybe not in so many words). I call this “proof by lack of imagination.” If you want to become a concurrency expert, you have to constantly stretch your imagination. So let’s do a few stretches.

One of the readers of my previous post, Duarte Nunes, posted a very interesting example of a benign data race. Here’s the code:

I highlighted in blue the parts that are executed under the lock (in a correct program, ‘Exit’ will always be called after the lock has been acquired). Notice that the variable ‘count’ is always accessed under the lock, so no data races there. However, the variable ‘owner’ may be read outside of the lock– I highlighted that part of code in red. That’s the data race we are talking about.

Try to analyze this code and imagine what could go wrong. Notice that the compiler or the processor can’t be too malicious. The code still has to work correctly if the data race is removed, for instance if the racy read is put under the lock.

Here’s an attempt at the “proof” of correctness. First, Duarte observed that “The valid values for the ‘owner’ variable are zero and the id of any thread in the process.” That sort of makes sense, doesn’t it? Now, the only way the racy read can have any effect is if the value of ‘owner’ is equal to the current thread’s ID. But that’s exactly the value that could have been written only by the current thread– and under the lock.

There are two possibilities when reading ‘owner’: either we are still under that lock, in which case the read is not at all racy; or we have already released the lock. But the release of the lock happens only after the current thread zeroes ‘owner.’

Notice that this is a single-thread analysis and, within a single thread, events must be ordered (no need to discuss memory fences or acquire/release semantics at this point). A read following a write in the same thread cannot see the value that was there before the write. That would break regular single-threaded programs. Of course, other threads may have overwritten this zero with their own thread IDs, but never with the current thread ID. Or so the story goes…

Brace yourself: Here’s an example how a compiler or the processor may “optimize” the program:

You might argue that this is insane and no compiler in its right mind would do a thing like this; but the truth is: It’s a legal program transformation. The effect of this modification is definitely not observable in the current thread. Neither is it observable by other threads in the absence of data races. Now, the unfortunate thread whose ID just happens to be 42 might observe this value and take the wrong turn. But it can only observe it through a racy read. For instance, it would never see this value if it read ‘owner’ under the lock. Moreover, it would never see it if the variable ‘owner’ were defined as ‘atomic’:

std::atomic<int> owner = 0

Stores and loads of atomic variables are, by default, sequentially consistent. Unfortunately sequential consistency, even on an x86, requires memory fences, which can be quite costly. It would definitely be an overkill in our example. So here’s the trick: Tell the compiler to forgo sequential consistency on a per read/write basis. For instance, here’s how you read an atomic variable without imposing any ordering constraints:

owner.load(memory_order_relaxed)

Such ‘relaxed’ operations will not introduce any memory fences– at least not on any processor I know about.

Here’s the version of the code that is exactly equivalent to the original, except that it’s correct and portable:

So what has changed? Can’t the compiler still do the same dirty trick on us, and momentarily store 42 in the ‘owner’ variable? No, it can’t! Since the variable is declared ‘atomic,’ the compiler can no longer assume that the write can’t be observed by other threads.

The new version has no data races: The (draft) Standard specifically states that ‘atomic’ variables don’t contribute to data races, even in their most relaxed form.

With those changes, I believe that our “proof” of correctness may actually be turned into a more rigorous proof using the axioms of the C++ memory model (although I’m not prepared to present one). We can have our cake (correct, portable code) and eat it too (no loss of performance). And, by the way, the same trick may be used in the case of lossy counters from my previous post.

Warning: I do not recommend this style of coding, or the use of weak atomics, to anybody who is not implementing operating system kernels or concurrency libraries.

Acknowledgments

I’d like to thank Luis Ceze, Hans Boehm, and Anthony Williams for useful remarks and for verifying my assumptions about the C++ memory model.

Bibliography

Announcement

In the effort to raise the awareness about concurrent programming, I’ll be hosting a free webinar, “The Language of Concurrency,” Tuesday May 24, 2011 at 12pm PDT. It will be a broad overview of the domain and the explanation of common terms, such as message passing, data races, atomics, sequential consistency, etc. Nothing sophisticated, but you might recommend it to your coworkers who are still learning the ropes or need a refresher. You can pre-register here, get an email reminder, and download the slides. If this is successful, I’ll start working on some more advanced topics (suggestions are welcome).

Advertisements

Share this:

Like this:

Related

18 Responses to “Dealing with Benign Data Races the C++ Way”

Bartosz, I’m going to take issue with the language in the C++ draft standard. Definitely those atomic operations are still data-races. It is absolutely possible for the owner.load() and owner.store() to occur in any order and there is no other memory location providing synchronization.

What the C++ draft standard is really stating is that neither the compiler nor the hardware is allowed to move _other_ memory operations across the owner.load() and owner.store() operations. It’s also ensuring that owner is only read/written by owner.load() and owner.store() and the observed data values and times of observation are as-if the compiler/CPU fully fenced those operations.

So I think there are still data races in this example. But now they are the ones introduced by the programmer rather than ones introduced by the compiler and/or CPU. And the C++ draft language needs to be updated to indicate that distinction.

That said, I fully advocate protecting yourself against the compiler/CPU doing horrible things to your multithreaded code. For those who are using C/C++ compilers that predate the new C++ draft standard, you can get much the same effect by using helper routines. (I’ll insert some gcc inline asm stuff here as soon as I verify the syntax)

Or you can rely on the compiler/optimizer behaving in ways that you have seen the compiler behave in the past… which carries some inherent risk. But this is how the linux kernel deals with the issue, via smp_mf() and pals.

As promised, some x86 gcc syntax to ensure that the program execution only experiences the races that you the programmer intended.

/* This routine provides an x86 function that will always, regardless of
* gcc optimization do a single access to read a pointer from memory.
*
* It also fences the access sufficiently to allow Dekker’s algorithm to
* work on x86 cacheable memory (the classic definition of sequential
* consistency used by CPU architects).
*/

// we explicitly tell gcc to clobber all memory, forcing any variable
// that is visible outside this function to be consistent in memory.
// this also prevents the compiler from caching one of those variables
// across this load.
asm (“mfence ; mov %[src], %[dest]”
: [dest] “=r” (value)
: [src] “m” (*address)
: “memory”);

// note that the mfence prevents this load from being moved above any
// earlier stores. This is one of the few reorderings that x86 CPU’s
// do when doing normal loads and stores to cacheable memory.

// generally other than Dekker’s, this mfence is just unecessary overhead
return value;
}

// we explicitly tell gcc to clobber all memory, forcing any variable
// that is visible outside this function to be consistent in memory.
// this also prevents the compiler from caching one of those other
// variables across this store.
asm (“mov %[dest], %[src] ; mfence”
: [dest] “=m” (*address)
: [src] “r” (value)
: “memory”);

// note that the mfence prevents later loads from being moved above
// this store. This is one of the few reorderings that x86 CPU’s
// do when doing normal loads and stores to cacheable memory.

// generally other than Dekker’s, this mfence is just unecessary overhead
}

In the C++0x standard the term “data race” has a very particular meaning. A “data race” occurs where you have multiple accesses to the same memory location, at least one of which is a write, at least one of which is non-atomic, and which have no defined ordering between them.

By definition therefore, the load() and store() operations on owner do not introduce data races if owner is atomic. This is important because a data race produces undefined behaviour in and of itself. You might still have a race condition, which might be benign, or might have consequences that are problematic, but there is no “data race”.

With respect to the particular example, it is assumed that the lock.TryEnter() and lock.Exit() calls behave as proper mutex locks, and therefore introduce ordering constraints with any necessary fences.

The use of atomic loads and stores guarantees that the value of owner read by a given thread is indeed a value that has been written at some point by some thread.

The ordering rules then provide the appropriate guarantees — a value written inside a lock will be seen by any thread that acquires the lock, if a thread sees a given value of a variable it can’t subsequently read an earlier value, and so forth. These rules applied to atomic operations with memory_order_relaxed just as much as any other operations, and Bartosz’s reasoning is correct.

For those that don’t think the compiler can introduce dummy writes like “owner=42”, I know from experience that it can.

I’ve seen compilers use variables as scratch storage for other values (e.g. for saving registers, or intermediate results) at high optimization levels, where the code includes an unconditional store to the overwritten variable after its use as scratch storage. I discovered this because I was trying to track down a bug, and couldn’t understand why a given pointer variable had such a random value in it at a particular point in the code when viewed through the debugger. In the disassembly there were lots of spurious writes of seemingly-random values, as the compiler was reusing the variable as scratch storage.

Goodness gracious, this is such a strange approach to the problem. What is wrong with creating a supervisory task which locks and unlocks under command of the locking manager? The locking manager can be clicked the button to lock and another click the button to unlock.

Hans Boehm, who is the author of the C++ memory model, provided these additional comments (Thanks Hans!):

[quote]
1) I think a more likely breaking “transformation” here is that the owner field may be misaligned or, for an embedded processor, larger than the atomically updateable word size. In either case, updates could become visible to observers in two sections. This means the observed value could be different from any value actually stored. In unlikely cases, it could match the thread id of the querying thread.

2) Memory_order_relaxed is sometimes a bit more expensive than an ordinary load. On Itanium, it should be implemented with an ld.acq. The difference is that it still requires “cache coherence”, i.e. operations on an individual variable must look like they are performed in a single consistent order. Ordinary loads on Itanium don’t have that property. It’s conceivable that we may see more architectures like this, since the data-race-free programming model allows some hardware optimizations for ordinary data operations that people previously considered risky. I have no idea whether this is enough of a benefit that other architectures will bother. I do hope that architectures will eventually move away from fence-based memory ordering.
[end quote]

Interesting! I must say that I based the example on C# code that I have, where the .NET memory model forbids the JITter introducing reads or writes.

I don’t think a sane processor would fabricate writes, but that “optimization” seems fair for a compiler. Doesn’t C++0x have compiler only barriers, aside from the “volatile” modifier? (I’m still waiting on my dead tree edition of C++ Concurrency in Action 🙂

The specific optimization Anthony mentioned is a bit weird. Using a variable in shared memory as scratch storage, even if the compiler assumes that it’s safe, might impact shared variables on the same cache line and lead to false sharing. Maybe that happened a few years ago, when concurrency was not given particular importance?

C++0x has std::atomic_signal_fence(), which is the closest to a compiler barrier you can get — it provides ordering between a thread and a signal handler on that thread.

The code where the compiler used the variable as scratch was strictly single-threaded, so false sharing wasn’t an issue. However, it demonstrates that compilers can do this; whether they do so in particular code is thus just a matter of how things turn out.

From a normal user perspective, I find it useful to distinguish between semantics and legal code at 1) the language level and 2) the machine level.

In particular, it’s clear that the intended semantics we are converging on with language standards, Java excepted, is that there is no such thing as a benign data race: all language level data races are errors/result in unspecified behavior. (Java’s memory model, from what I remember when I was abusing it in implementation, allows transformations which, at the very least, lead to unclear outcomes due to them being not obviously disallowed/incorrect.) This is desirable from a user point of view, it’s a simpler rule and enables compiler optimizations. From the implementer’s point of view, allowing language level data races means certain optimizations need to be extremely paranoid – e.g. seemingly benign user level code such as assignment to a variable of it’s own value, or multiple writes of the same value or arguably only-valid-values can cause incorrect code to be generated (classic example being a spurious write enabling legal register promotion).

From a machine perspective, it’s clear that there is such a thing as a well-defined benign data race. Indeed, previous work has been done in sorting the sheep from the goats here. However, from my user point of view, it seems clear that we should segregate code which depends on these machine details from code that does not: we may code it in assembly, or what we lovingly know as “close your eyes and pretend this is really C and not assembly with curly brackets”. It’s often simpler to audit small assembly code sequences than continually need to verify that a compiler is producing a valid equivalent sequence from higher level code which deliberately uses races (in practice I find sometimes the hardest part is to convince the toolchain to do /exactly/ what we want and not try to be “clever” – particularly as compilers increase in “sophistication” in their “optimization”). Pragmatically, we may know what the toolchain is doing and choose to write it in “C” but I think it’s obvious we are aware in these cases that we’re not writing Standard C or C++; if we are not aware of this then it’s almost certainly an error on our part.

Now there’s an obvious problem for those tools working at the machine code level in being able to (definitively) see whether code came from a compiler or was hand-written in assembly. The desired behavior from an analysis tool would be to allow blessing of certain code fragments and disallow data races everywhere else.

The C++ data-race-free promise is great, even if the name is kind of objectionable for a machine hacker like myself: reasoning about a machine providing sequential consistency is a huge advantage for developers, especially those targeting architectures with weaker memory models than the x86 provides.

Niall: to your last point, it would be great to have compilers emit debugging data describing the location of instructions that the compiler considers to be atomic. You can’t figure this out from an x86 instruction stream, and it’s probably at least very hard on architectures with weaker memory models.

Dynamic analysis tools such as our product, Jinx, could certainly consult this information were to it exist (does it?) and filter out report entries that were things that look like races at the machine level, but aren’t at the language level. Jinx would still need an awareness of machine-level-races, because that’s how it explores the space of outcomes for a compiled program.

Presumably if this annotation were based on commonly-available assembly directives you could also “bless” the assembly code your team is writing. This would extend the DRF promise to your team such that you can also use dynamic analysis tools to point out violations when they occur.

@Peter wrote:
> Niall: to your last point, it would be great to have compilers emit
> debugging data describing the location of instructions that the
> compiler considers to be atomic. You can’t figure this out from an > x86 instruction stream, and it’s probably at least very hard on
> architectures with weaker memory models.

Certainly for memory_order_relaxed, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_consume I think you are hosed: everything just looks like an x86 MOV. But, for memory_order_seq_cst, any store should be LOCKed (but any overlapping read need not be). Pre-caffeine thought, handwaving furiously, if you are tracing execution and see a LOCKed store, then any overlapping reads could be checked for races under the assumption of DRF being desired.

As you note, it would be nice to augment the debug info with atomicity assumptions but I’m not aware of compilers producing that at the moment (but need to check).

> Presumably if this annotation were based on commonly-available > assembly directives you could also “bless” the assembly code your > team is writing.

Yes, that would be ideal. It’s comforting to explicitly bless the parts where we know we sail close to the wind and assume all other abuses are errors, at at the very least unintentional potential errors (for my purposes that’s the same thing).

Actually, with these hypothetical debug annotations, we could communicate some more interesting hints beyond “trust us, we think we know what we’re doing”. (Leaving aside any suggestions that there is exactly the point you get absolutely paranoid in checking our code!)

We’ve been known to play games such as having hot buffers we fill with data, then, when some particular buffer is full, copy via non-temporal stores to the desired final location for that buffer- why this two-step? To avoid partial writes caused by overwhelming the limited number of write combining buffers when writing to multiple streams.. we batch ’em up and “guarantee” write combining to a particular stream in a single memory burst. And then signal availability. Missing an sfence would be rather suboptimal.. and we’d happily try to communicate that after certain writes, any reads prior to a fence is an error is that would be useful to a tool.

@Peter: Now that you mentioned sequential consistency– C++ provides only a limited DRF guarantee. If your program has no data races and no weak atomics, then it is sequentially consistent. But once you use one of the “memory_order_relaxed,” etc., operations, you break sequential consistency. You can still reason semantics, but you have to use a much more complex axiomatic memory model. This is why I didn’t offer a proof of correctness in my post.

Here’s the relevant passage from the draft Standard (29.3.8)

[ Note: memory_order_seq_cst ensures sequential consistency only for a program that is free of data races and uses exclusively memory_order_seq_cst operations. Any use of weaker ordering will invalidate this guarantee unless extreme care is used. In particular, memory_order_seq_cst fences ensure a total order only for the fences themselves. Fences cannot, in general, be used to restore sequential consistency for atomic operations with weaker ordering specifications. —end note ]

[…] Programs with ‘Benign’ Data Races.” I should confess now that after publishing my blog post about benign data races I got in contact with Hans and he gave me a peek at this paper. Essentially, things are as bad or […]