Benign Data Races

Can a data race not be a bug? In the strictest sense I would say it’s always a bug. A correct program written in a high-level language should run the same way on every processor present, past, and future. But there is no proscription, or even a convention, about what a processor should (or shouldn’t) do when it encounters a race. This is usually described in higher-level language specs by the ominous phrase: “undefined behavior.” A data race could legitimately reprogram your BIOS, wipe out your disk, and stop the processor’s fan causing a multi-core meltdown.

Data race: Multiple threads accessing the same memory location without intervening synchronization, with at least one thread performing a write.

However, if your program is only designed to run on a particular family of processor, say the x86, you might allow certain types of data races for the sake of performance. And as your program matures, i.e., goes through many cycles of testing and debugging, the proportion of buggy races to benign races keeps decreasing. This becomes a real problem if you are using a data-race detection tool that cannot distinguish between the two. You get swamped by false positives.

Microsoft Research encountered and dealt with this problem when running their race detector called DataCollider on the Windows kernel (see Bibliography). Their program found 25 actual bugs, and almost an order of magnitude more benign data races. I’ll summarize their methodology and discuss their findings about benign data races.

Data Races in Windows Kernel

The idea of the program is very simple. Put a hardware breakpoint on a shared memory access and wait for one of the threads to stumble upon it. This is a code breakpoint, which is triggered when the particular code location is executed. The x86 also supports data breakpoints, which are triggered when the program accesses a specific memory location. So when a thread hits the code breakpoint, DataCollider installs a data breakpoint on the location the thread was just accessing. It then stalls the current thread and let all other threads run. If any one of them hits the data breakpoint, it’s a race (as long as one of the accesses is a write). Consider this: If there was any synchronization between the two accesses, the second thread would have been blocked from accessing that location. Since it wasn’t, we have a classic data race.

Notice that this method might not catch all data races, but it doesn’t produce false positives. Except, of course, when the race is considered benign.

There are other interesting details of the algorithm. One is the choice of code locations for installing breakpoints. DataCollider first analyzes the program’s assembly code to create a pool of memory accesses. It discards all thread-local accesses and explicitly synchronized instructions (for instance, the ones with the LOCK prefix). It then randomly picks locations for breakpoints from this pool. Notice that rarely executed paths are as likely to be sampled as the frequently executed ones. This is important because data races often hide in less frequented places.

Pruning Benign Races

90% of data races caught by DataCollider in the Windows kernel were benign. For several reasons it’s hard to say how general this result is. First, the kernel had already been tested and debugged for some time, so many low-hanging concurrency bugs have been picked. Operating system kernels are highly optimized for a particular processor and might use all kinds of tricks to improve performance. Finally, kernels often use unusual synchronization strategies. Still, it’s interesting to see what shape benign data races take.

It turns out that half of false positives came from lossy counters. There are many places where statistics are gathered: counting certain kinds of events, either for reporting or for performance enhancements. In those situations losing a few increments is of no relevance. However not all counters are lossy and, for instance, a data race in reference counting is a serious bug. DataCollider uses simple heuristic to detect lossy counters–they are the ones that are always incremented. A reference counter, on the other hand, is as often incremented as decremented.

Another benign race happens when one thread reads a particular bit in a bitfield while another thread updates another bit. A bit update is a read-modify-write (RMW) sequence: The thread reads the previous value of the bitfield, modifies one bit, and writes the whole bitfield back. Other bits are overwritten in the process too, but their new values are the same as the old values. A read from another thread of any of the the non-changed bits does not interfere with the write, at least not on the x86. Of course if yet another thread modified one of those bits, it would be a real bug, and it would be caught separately. The pruning of this type of race requires analysis of surrounding code (looking for the masking of other bits).

Windows kernel also has some special variables that are racy by design–current time is one such example. DataCollider has these locations hard-coded and automatically prunes them away.

There are benign races that are hard to prune automatically, and those are left for manual pruning (in fact, DataCollider reports all races, it just de-emphasizes the ones it considers benign). One of them is the double-checked locking pattern (DCLP), where a thread makes a non-synchronized read to be later re-confirmed under the lock. This pattern happens to work on the x86, although it definitely isn’t portable.

Finally, there is the interesting case of idempotent writes— two racing writes that happen to write the same value to the same location. Even though such scenarios are easy to prune, the implementers of DataCollider decided not to prune them because more often than not they led to the uncovering of concurrency bugs. Below is a table that summarizes various cases.

Benign race

Differentiate from

Pruned?

Lossy counter

Reference counting

Yes

Read and write of different bits

Read and write of the whole word

Yes

Deliberately racy variables

Yes

DCLP

No

Idempotent writes

No

Conclusion

In the ideal world there would be no data races. But a concurrency bug detector must take into account the existence of benign data races. In the early stages of product testing the majority of detected races are real bugs. It’s only when chasing the most elusive of concurrency bugs that it becomes important to weed out benign races. But it’s the elusive ones that bite the hardest.

The classic examples of deliberate data races that come to mind are numerical algorithms: chaotic relaxation and asynchronous iterative methods. There are also systems where cooperating threads or processes are maintaining a best estimate matrix of probabilities and may race to provide updates.

(Of course there are also some nasty results showing degraded convergence rates as number of processors increase..).

One instance of a benign race that I’ve wanted to look into would be doing multi-threaded unsynchronized path compression on a disjoint-set forests. This would be an idempotent write as it would depend on any value that is valid now remaining valid (but possibly sub-optimal) indefinitely.

Only if you are extremely careful! First, there is lock-free programming, which is very hard and error-prone but usually race free. Then there is lock-free programming with intentional races, which is another order of magnitude harder.

Interesting topic Bartosz! Speaking about lossy counters, I personally consider them as a bug, always. Practically, there are no cases in which one could accept a raced counter as a reliable value. This not because it loses some increments, but instead, because the error is likely to grow indefinitely when multiple writers are involved.
For a sake of performance, I personally like the use of sparse counters, which are per-cache line distributed counters. Sparse counters are lossy only when a thread/core performs a sampling to compute the actual value. They are affected by races as well, but since by design there’s a single writer per counter-part, the error of the sampling has an upper bound, dependently of the number of threads.
I think sparse counters are benign races. My two cents, Nicola.

Hmm.. I don’t know if this constitutes valid proof, but we can reason about the allowed state transitions. The valid values for the “owner” variable are 0 and the id of any thread in the process. A thread only reads its id if the thread itself wrote it there after taking the lock. Conversely, the thread stops reading its id only after it set “owner” to 0, which happens before the lock is released. I’m assuming the lock implementation provides correct acquire/release semantics and also that when a thread sets “owner” to 0 and goes to read it on a subsequent TryEnter(), any in-between context switch provides the appropriate barriers for the thread to _not_ read its id if it is scheduled on another CPU, but I guess this is the hard part to proof.

You also assume that the write and the read of the variable are atomic, don’t you? This is implied by your statement ‘The valid values for the “owner” variable are 0 and the id of any thread in the process.’

[…] I recently came across the idea of the benign data race, through Bartosz Milewski’s new concurrency blog. The idea is that you might know your program can give rise to data races, but you might allow it […]