Just because you can't write a test doesn't mean it's not broken. Undefined behaviour which usually happens to work as expected (C and C++ are full of that), race conditions, potential reordering due to a weak memory model... – CodesInChaos 7 hours ago

@CodesInChaos if it cant be reproduced then the code written to 'fix' cant be tested either. And putting untested code into live is a worse crime in my opinion – RhysW 5 hours ago

...has me wondering if there are any good general ways to consistently trigger very infrequently occurring in production problems caused by race conditions in test case.

6 Answers
6

After having been in this crazy business since about 1978, having spent almost all of that time in embedded real-time computing, working multitasking, multithreaded, multi-whatever systems, sometimes with multiple physical processors, having chased more than my fair share of race conditions, my considered opinion is that the answer to your question is quite simple.

No.

There's no good general way to trigger a race condition in testing.

Your ONLY hope is to design them completely out of your system.

When and if you find that someone else has stuffed one in, you should stake him out an anthill, and then redesign to eliminate it. After you have designed his faux pas (pronounced f***up) out of your system, you can go release him from the ants. (If the ants have already consumed him, leaving only bones, put up a sign saying "This is what happens to people who put race conditions into XYZ project!" and LEAVE HIM THERE.)

The best tool I know for these sort of problems is an extension of Valgrind called Helgrind.

Basically Valgrind simulates a virtual processor and runs your binary (unmodified) on top of it, so it can check every single access to memory. Using that framework, Helgrind watch system calls to infer when an access to a shared variable is not properly protected by a mutual exclusion mechanism. That way it can detect a theorical race condition even if it has not actually happened.

ThreadSanitizer is a similar tool. It works differently than Helgrind, which gives it the advantage of being much faster, but requires integration into the toolchain.
–
Sebastian RedlApr 26 '13 at 17:16

There is no way to be absolutely sure various kinds of undefined behavior (in particular race conditions) don't exist.

However, there are a number of tools that show up a good number of such situations. You may be able to prove that a problem exists currently with such tools, even though you cannot prove that your fix is valid.

Some interesting tools for this purpose:

Valgrind is a memory checker. It finds memory leaks, reads of uninitialized memory, uses of dangling pointers and out-of-bounds accesses.

Helgrind is a thread safety checker. It finds race conditions.

Both work by dynamic instrumentation, i.e. they take your program as-is and execute it in a virtualized environment. This makes them unintrusive, but slow.

UBSan is an undefined behavior checker. It finds various cases of C and C++ undefined behavior, such as integer overflows, out-of-range shifts and similar stuff.

MSan is a memory checker. It has similar goals as Valgrind.

TSan is a thread safety checker. It has similar goals as Helgrind.

These three are built into the Clang compiler and generate code at compile time. This means that you need to integrate them into your build process (in particular, you have to compile with Clang), which makes them much harder to initially set up than *grind, but on the other hand they have a much lower runtime overhead.

All the tools I listed work on Linux and some of them on MacOS. I don't think any work on Windows reliably yet.

Exposing a multi-threading bug requires forcing different threads of execution to perform their steps in a particular interleaved order. Usually this is hard to do without manual debugging or manipulating the code to get some kind of "handle" to control this interleaving. But changing code that behaves unpredictably will often influence that unpredictability, so this is hard to automate.

A nice trick is described by Jaroslav Tulach in Practical API Design: if you have logging statements in the code under question, manipulate the consumer of those logging statements (e.g. an injected pseudo-terminal) so that it accepts the individual log messages in a particular order based on their content. This allows you to control the interleaving of steps in different threads without having to add anything to production code that isn't already there.

I have done similar before using injected repository's to sleep the threads that call it in specific orders to force the interleave I want. Having written code that does it, I'm inclined to +1 @John's answer above. Seriously, this stuff is so painful to employ correctly, and still gives only best guess guarantees because there could be slightly different interleaves with different results; the better approach is to just eliminate all possible race conditions through static analysis and or careful combing of code for any and all shared state
–
Jimmy HoffaApr 25 '13 at 14:48

It seems most of the answers here mistake this question as "how do I automatically detect race conditions?" when the question is really "how do I reproduce race conditions in testing when I find them?"

The way to do it is to introduce synchronization in your code that are used for testing only. For example, if a race condition occurs when Event X happens in between Event A and Event B, then for testing your application, write some code that waits for Event X to happen after Event A happens. You will likely need some way for your tests to talk to your application to tell it ("hey i'm testing this thing, so wait for this event at this location").

I'm using node.js and mongo, where some actions involve creating consistent data in multiple collections. In these cases, my unit tests will make a call to the application to tell it "set up a wait for Event X", and once the application has set it up, the test for event X will run, and the tests will subsequently tell the application ("i'm done with the wait for Event X") so the rest of the tests will run normally.