Tuesday, January 10, 2012

[This article was originally posted on AltDevBlogADay]
Let's say we are given the task of optimizing a serial program to run on a multi-core system. In the process, we've isolated one block of the code where we can achieve a very nice speedup by delegating it to a second thread, thus allowing the main thread to get a head start on the next portion of work. At some later point in time the main thread then syncs to this worker thread before it uses the results. Since the original serial code is all working correctly, we're trying to be as minimally invasive as possible so as to not inadvertently break anything; we decide to keep all the data structures as they are and allow the worker thread to directly modify the single 'global' copy of the data. Also, since we're interested in performance, any fine grained, per variable, locking is frowned upon. All of this should be fine, we reason, since as the worker thread is modifying the data, the main thread won't need access to it, and we have a sync in place for when the main thread does eventually need access to guarantee the worker thread is done. Here is the scenario in code:

Main thread:
// single shared variable here, but may be hundreds of variables in practice
int shared_var = 2;
uint32_t task = SubmitAsyncTask(DoWorkerTask);

// main thread goes to work on next task...

SyncOnTask(task);

// prints 5
printf("%d\n", shared_var);

Worker thread:
DoWorkerTask()
{
shared_var += 3;
}

All is well until a new feature request comes in months down the road which causes us to innocently modify some of the shared data during a portion of the frame which we really shouldn't have:

Main thread:
// single shared variable here, but may be hundreds of variables in practice
int shared_var = 2;
uint32_t task = SubmitAsyncTask(DoWorkerTask);

// main thread goes to work on next task...

// BAD, we shouldn't be touching this variable during this portion of the frame shared_var = 5;

SyncOnTask(task);

// is this 5, or 8?
printf("%d\n", shared_var);

Don't let the simplicity of this contrived example fool you, the same access pattern, when intermixed with another 100k lines of code, may not be quite so easy to spot.

From my experience, this is a pretty common pattern in games programming. It manages to completely avoid the overhead from any fine grain synchronization primitives, such as critical sections, when accessing data. Instead, it depends on an implicit contract between the main thread and the worker thread, which goes something like: "I, the main thread, promise not to modify any elements belonging to a small, well defined, subset of data which I will share with you, during any point in the frame that it is possible that you, the worker thread, may be transforming it. In return, you, the worker thread, may only modify this small, well defined, subset of this data and nothing else."

Already we see that this contract is pretty fragile at best, filled with good intentions, but with little in the way of actual concrete enforcement. Maybe it's documented with comments, maybe not. Either way, the code as it stands is tough to maintain and modify. Fellow co-workers are likely to dread the thought of going near it.

The problem is that it's non-obvious, from just looking at any block of code partaking in this contract, just what data it's allowed to touch at that point in time. It may even be quite tough to reason about when in a frame it even gets executed, and even trickier if that execution point jumps around from frame to frame, due to unpredictable events. Comments can only get you so far, ideally we'd like some compiler guarantees about what we can and can't access. I don't know how to make the compiler detect these cases, but the next best thing, runtime checking, is actually quite simple.

During some recent delving into what's been going on in the C++11 world with regards to concurrency (lambdas, you rock!), I came across this surprisingly straightforward solution to the problem (the relevant section is around 19:45). Overlooking the syntactical C++ sugar coating, the concept itself is very basic and translates to any language; let's simply transfer ownership of the shared data when we cross these thread boundaries. We can do this by determining what data actually needs to be shared, putting it into its own class/struct, and then ensuring that only one pointer can ever point to it at any one instance in time.

In C++11, the machinery for this is built-in by making use of std::unique_ptr and the move semantics which r-value references allow (not important for the understanding of this article). It's perhaps simpler to visualize in C though; taking the previous example and wrapping the shared data up, as described, might look something like the code below:

Notice how the conditions of the contract between the main and worker thread are now more explicit. We can see which data is intended to be shared between threads since they're accessed through specific pointers. We can even search in a text editor for all the spots in a program in which we are altering any particular block of shared data. Furthermore, we've introduced the concept of data ownership. Either the main thread or the worker thread owns the data at any point in time. If the main thread makes any attempt to modify a shared variable whilst the worker thread owns it, we'll dereference a null pointer and can readily see in the debugger where the race condition happened, rather than having to backtrack from the artifacts the race leaves behind at a later part in the frame (on someone else's machine; on a bug which can only be reproduced once every 10 hours of testing; on a final code built; oh, and we have to hit gold by next week; just sayin').

The DataPointerAcquire function should be called on the thread which wants to take ownership of the shared data. Later, a matching call to DataPointerRelease should be also called from the same thread when the threads want to relinquish ownership. Those functions might looks something like:
// called on the thread which wants to take
// ownership of the shared data
template <class T>
inline void DataPointerAcquire(T **old_owner, T **new_owner)
{
assert(*old_owner);
T *ptr = *old_owner;
*old_owner = nullptr;
MemoryBarrier();
*new_owner = ptr;
}

// make the master pointer (orig_owner) visible to all other threads
MemoryBarrier();

*thread_ptr = nullptr;
}

You may have noticed there is now an extra deference required to access the data. This, however, seems a small price to pay for the additional peace of mind and maintainability, both for you, and your co-workers.