Just because you can't write a test doesn't mean it's not broken. Undefined behaviour which usually happens to work as expected (C and C++ are full of that), race conditions, potential reordering due to a weak memory model... – CodesInChaos 7 hours ago

@CodesInChaos if it cant be reproduced then the code written to 'fix' cant be tested either. And putting untested code into live is a worse crime in my opinion – RhysW 5 hours ago

...has me wondering if there are any good general ways to consistently trigger very infrequently occurring in production problems caused by race conditions in test case.

5 Answers
5

After having been in this crazy business since about 1978, having spent almost all of that time in embedded real-time computing, working multitasking, multithreaded, multi-whatever systems, sometimes with multiple physical processors, having chased more than my fair share of race conditions, my considered opinion is that the answer to your question is quite simple.

No.

There's no good general way to trigger a race condition in testing.

Your ONLY hope is to design them completely out of your system.

When and if you find that someone else has stuffed one in, you should stake him out an anthill, and then redesign to eliminate it. After you have designed his faux pas (pronounced f***up) out of your system, you can go release him from the ants. (If the ants have already consumed him, leaving only bones, put up a sign saying "This is what happens to people who put race conditions into XYZ project!" and LEAVE HIM THERE.)

There is no way to be absolutely sure various kinds of undefined behavior (in particular race conditions) don't exist.

However, there are a number of tools that show up a good number of such situations. You may be able to prove that a problem exists currently with such tools, even though you cannot prove that your fix is valid.

Some interesting tools for this purpose:

Valgrind is a memory checker. It finds memory leaks, reads of uninitialized memory, uses of dangling pointers and out-of-bounds accesses.

Helgrind is a thread safety checker. It finds race conditions.

Both work by dynamic instrumentation, i.e. they take your program as-is and execute it in a virtualized environment. This makes them unintrusive, but slow.

UBSan is an undefined behavior checker. It finds various cases of C and C++ undefined behavior, such as integer overflows, out-of-range shifts and similar stuff.

MSan is a memory checker. It has similar goals as Valgrind.

TSan is a thread safety checker. It has similar goals as Helgrind.

These three are built into the Clang compiler and generate code at compile time. This means that you need to integrate them into your build process (in particular, you have to compile with Clang), which makes them much harder to initially set up than *grind, but on the other hand they have a much lower runtime overhead.

All the tools I listed work on Linux and some of them on MacOS. I don't think any work on Windows reliably yet.

The best tool I know for these sort of problems is an extension of Valgrind called Helgrind.

Basically Valgrind simulates a virtual processor and runs your binary (unmodified) on top of it, so it can check every single access to memory. Using that framework, Helgrind watch system calls to infer when an access to a shared variable is not properly protected by a mutual exclusion mechanism. That way it can detect a theorical race condition even if it has not actually happened.

ThreadSanitizer is a similar tool. It works differently than Helgrind, which gives it the advantage of being much faster, but requires integration into the toolchain.
–
Sebastian RedlApr 26 '13 at 17:16

Exposing a multi-threading bug requires forcing different threads of execution to perform their steps in a particular interleaved order. Usually this is hard to do without manual debugging or manipulating the code to get some kind of "handle" to control this interleaving. But changing code that behaves unpredictably will often influence that unpredictability, so this is hard to automate.

A nice trick is described by Jaroslav Tulach in Practical API Design: if you have logging statements in the code under question, manipulate the consumer of those logging statements (e.g. an injected pseudo-terminal) so that it accepts the individual log messages in a particular order based on their content. This allows you to control the interleaving of steps in different threads without having to add anything to production code that isn't already there.

I have done similar before using injected repository's to sleep the threads that call it in specific orders to force the interleave I want. Having written code that does it, I'm inclined to +1 @John's answer above. Seriously, this stuff is so painful to employ correctly, and still gives only best guess guarantees because there could be slightly different interleaves with different results; the better approach is to just eliminate all possible race conditions through static analysis and or careful combing of code for any and all shared state
–
Jimmy HoffaApr 25 '13 at 14:48