Monday, October 24, 2011

Concurrent programming is hard, and many abstractions have been developed over the years to properly manage concurrent resources. The .NET base class libraries have mutexes, spinlocks, semaphores, compare-and-set (CAS) instructions, and more.

The Problem

Unfortunately, many of these standard abstractions are rather opaque, so predicting their behaviour is difficult, and enforcing a particular thread schedule is nearly impossible. For example, consider the issue of "fairness". Many abstractions to deal with threads are not fair, which is to say that they are not guaranteed to release threads in FIFO order, so a thread could starve.

Furthermore, these opaque abstractions do not allow the application to easily specify domain-specific scheduling behaviour. Suppose threads T0, T1, and T2 are blocked on a resource X, in that order. X becomes available, and according to FIFO order, T0 should acquire control next. However, a domain-specific requirement may state that T2 should be processed first, but there is no way to check the wait queue for the presence of T2 when using any of .NET's thread abstractions. We'd have to build some auxiliary data structures to log the fact that T2 is waiting for X, and force T0 and T1 to relinquish their control voluntarily, but these data structures already exist as the locking queue, so we're duplicating work. Furthermore, both T0 and T1 must run before T2 to relinquish control, so we're wasting valuable CPU time. It's rather wasteful all around.

What I want to cover here is a pattern consisting of a few simple primitives which forms a general concurrent API. Using this API, we can easily implement all of the above threading abstractions, while still allowing arbitrary scheduling policies. I devised this approach while developing a software transactional memory (STM) library for C#, since STM requires deadlock detection and is rather sensitive to scheduling policies. None of this is achievable using the standard concurrent abstractions on the CLR, without modifying the VM.

The paper entitled Light-weight Locks really crystallized the path I was on, and gelled the final components I was missing into an elegant, general solution. The solution I present here is merely one of the abstractions found in that paper.

The Solution

The approach is simple, and consists of only three fundamental concepts:

A lock-free list is associated with each resource being accessed concurrently. If structured as I describe below, managing this list requires no memory allocation.

An AutoResetEvent is associated with each Thread via thread-static data. Each time the thread must block, it calls WaitOne on its own AutoResetEvent after pushing itself onto the resource's lock-free list.

A thread releasing its control of a resource walks the list of waiting threads, and wakes up the ones that should assume control next by calling Set on that Thread's AutoResetEvent.

AutoResetEvent is fundamentally thread-safe, and serves as a combined signal and mutex, and the only other point of contention is the lock-free list. The list requires only one CAS to push a waiter on the front, and occasionally a CAS to pop the front element [1].

Using only the above, we can construct a mutex abstraction, like Monitor, that allows exclusive access to a resource, but permits arbitrary scheduling policy, such as FIFO, LIFO, or something custom (like priorities that don't suffer from inversion). Even more, this requires no allocation at runtime. Here's all we need [2]:

What makes this work is the property that a thread can only be on one blocked list at any given time. This means we need only one piece of thread-static data to track which list it's on, and one AutoResetEvent to block and unblock the thread. Now suppose Locked has two public operations: Enter and Exit, which mimic the behaviour of Monitor.Enter and Monitor.Exit [3]:

The SelectNext method can provide a variety of domain-specific selection behaviour, such as LIFO, FIFO, etc. to dequeue the next lock owner. All that's required is that the thread releasing the resource select the candidate from the list and assign the lock to it, and then signal it to continue via AutoResetEvent. Unblock is not shown for brevity, but it's not too complicated.

The paper I linked to above uses this pattern to implement all the common concurrency abstractions, using very compact structures (2 bytes per mutex, 4 bytes per rw-lock, 4-bytes per condition variable).

I use it in my STM library to dynamically detect deadlocks and abort transactions, and I will soon extend this to implement reader-writer lock semantics similar to a pessimistic version of TLRW.

Regardless, associating a kernel AutoResetEvent, or pthread signal+mutex on Linux, permits far more flexible and compact concurrency primitives. I recommend this "MetaThread" API be exposed in any language's standard threading library.

The Future

This design can also inform the design of concurrency abstractions in kernels. Consider a reader-writer lock where a writer currently has exclusive access to the resource, and a list of N readers is waiting for the release of the lock. The writer must then invoke AutoResetEvent.Set N times, once for each waiting reader. That's N user-kernel transitions, which can be quite expensive for large N [4].

The ideal solution would be to group a number of individual AutoResetEvents so we only need to make one user-kernel transition to signal the whole group, sort of like a multicast for EventWaitHandles. This would provide efficient single thread and multithread suspend/resume semantics.

Footnotes

[1] A CAS may be needed to remove elements further in the list if more than one thread can operate on a resource at a time.
[2] I'm skipping a few steps to simplify the presentation. MetaThread.Current must actually be lazily-initialized. If I were starting a new programming language, this structure would be part of the standard Thread abstraction. Furthermore, when implementing abstractions which allow multiple threads through, you will probably need more synchronization. For instance, a reader-writer lock requires an atomically incremented/decremented read counter. See the paper linked above for more details.
[3] I'm again simplifying here for presentation, so I apologize for any errors. This code is adapted from my STM library, so I haven't tested this version thoroughly, but it's at least derived from correct code. There are many optimizations to avoid redundant CAS operations, but this is the simplest presentation.
[4] N here is the number of threads, which in general are probably less than a hundred, even for heavily threaded workloads. Still, user-kernel transitions can cost upwards of 1,000 cycles, so wasting 100,000 cycles is nothing to scoff at.