Monday, October 24, 2011

The Most General Concurrent API

Concurrent programming is hard, and many abstractions have been developed over the years to properly manage concurrent resources. The .NET base class libraries have mutexes, spinlocks, semaphores, compare-and-set (CAS) instructions, and more.

The Problem

Unfortunately, many of these standard abstractions are rather opaque, so predicting their behaviour is difficult, and enforcing a particular thread schedule is nearly impossible. For example, consider the issue of "fairness". Many abstractions to deal with threads are not fair, which is to say that they are not guaranteed to release threads in FIFO order, so a thread could starve.

Furthermore, these opaque abstractions do not allow the application to easily specify domain-specific scheduling behaviour. Suppose threads T0, T1, and T2 are blocked on a resource X, in that order. X becomes available, and according to FIFO order, T0 should acquire control next. However, a domain-specific requirement may state that T2 should be processed first, but there is no way to check the wait queue for the presence of T2 when using any of .NET's thread abstractions. We'd have to build some auxiliary data structures to log the fact that T2 is waiting for X, and force T0 and T1 to relinquish their control voluntarily, but these data structures already exist as the locking queue, so we're duplicating work. Furthermore, both T0 and T1 must run before T2 to relinquish control, so we're wasting valuable CPU time. It's rather wasteful all around.

What I want to cover here is a pattern consisting of a few simple primitives which forms a general concurrent API. Using this API, we can easily implement all of the above threading abstractions, while still allowing arbitrary scheduling policies. I devised this approach while developing a software transactional memory (STM) library for C#, since STM requires deadlock detection and is rather sensitive to scheduling policies. None of this is achievable using the standard concurrent abstractions on the CLR, without modifying the VM.

The paper entitled Light-weight Locks really crystallized the path I was on, and gelled the final components I was missing into an elegant, general solution. The solution I present here is merely one of the abstractions found in that paper.

The Solution

The approach is simple, and consists of only three fundamental concepts:

A lock-free list is associated with each resource being accessed concurrently. If structured as I describe below, managing this list requires no memory allocation.

An AutoResetEvent is associated with each Thread via thread-static data. Each time the thread must block, it calls WaitOne on its own AutoResetEvent after pushing itself onto the resource's lock-free list.

A thread releasing its control of a resource walks the list of waiting threads, and wakes up the ones that should assume control next by calling Set on that Thread's AutoResetEvent.

AutoResetEvent is fundamentally thread-safe, and serves as a combined signal and mutex, and the only other point of contention is the lock-free list. The list requires only one CAS to push a waiter on the front, and occasionally a CAS to pop the front element [1].

Using only the above, we can construct a mutex abstraction, like Monitor, that allows exclusive access to a resource, but permits arbitrary scheduling policy, such as FIFO, LIFO, or something custom (like priorities that don't suffer from inversion). Even more, this requires no allocation at runtime. Here's all we need [2]:

What makes this work is the property that a thread can only be on one blocked list at any given time. This means we need only one piece of thread-static data to track which list it's on, and one AutoResetEvent to block and unblock the thread. Now suppose Locked has two public operations: Enter and Exit, which mimic the behaviour of Monitor.Enter and Monitor.Exit [3]:

The SelectNext method can provide a variety of domain-specific selection behaviour, such as LIFO, FIFO, etc. to dequeue the next lock owner. All that's required is that the thread releasing the resource select the candidate from the list and assign the lock to it, and then signal it to continue via AutoResetEvent. Unblock is not shown for brevity, but it's not too complicated.

The paper I linked to above uses this pattern to implement all the common concurrency abstractions, using very compact structures (2 bytes per mutex, 4 bytes per rw-lock, 4-bytes per condition variable).

I use it in my STM library to dynamically detect deadlocks and abort transactions, and I will soon extend this to implement reader-writer lock semantics similar to a pessimistic version of TLRW.

Regardless, associating a kernel AutoResetEvent, or pthread signal+mutex on Linux, permits far more flexible and compact concurrency primitives. I recommend this "MetaThread" API be exposed in any language's standard threading library.

The Future

This design can also inform the design of concurrency abstractions in kernels. Consider a reader-writer lock where a writer currently has exclusive access to the resource, and a list of N readers is waiting for the release of the lock. The writer must then invoke AutoResetEvent.Set N times, once for each waiting reader. That's N user-kernel transitions, which can be quite expensive for large N [4].

The ideal solution would be to group a number of individual AutoResetEvents so we only need to make one user-kernel transition to signal the whole group, sort of like a multicast for EventWaitHandles. This would provide efficient single thread and multithread suspend/resume semantics.

Footnotes

[1] A CAS may be needed to remove elements further in the list if more than one thread can operate on a resource at a time.
[2] I'm skipping a few steps to simplify the presentation. MetaThread.Current must actually be lazily-initialized. If I were starting a new programming language, this structure would be part of the standard Thread abstraction. Furthermore, when implementing abstractions which allow multiple threads through, you will probably need more synchronization. For instance, a reader-writer lock requires an atomically incremented/decremented read counter. See the paper linked above for more details.
[3] I'm again simplifying here for presentation, so I apologize for any errors. This code is adapted from my STM library, so I haven't tested this version thoroughly, but it's at least derived from correct code. There are many optimizations to avoid redundant CAS operations, but this is the simplest presentation.
[4] N here is the number of threads, which in general are probably less than a hundred, even for heavily threaded workloads. Still, user-kernel transitions can cost upwards of 1,000 cycles, so wasting 100,000 cycles is nothing to scoff at.

Concurrent programs can be executed sequentially on a single processor by interleaving the execution steps of each computational process, or executed in parallel by assigning each computational process to one of a set of processors that may be close or distributed across a network. [Wikipedia]

These are shared memory concurrency abstractions. If you're interested in scaling shared memory abstractions across a network, you have to resort to distributed shared memory. Again, that's distributed programming which often require a whole different set of abstractions.

So if I understand correctly, you only need two pointers (plus a bit of thread-local data) to implement a lock that supports any fair scheduling policy? Nice. I wonder if it uses more or less memory than a standard .NET lock (if you use lock(x) in C#, does it cause a permanent memory allocation associated with x, or does it use some lock technique that only needs one word of storage, in which case the first word of the two-word object header could be used?)

Oh, does your library offer a 2-word 'struct' version of the lock? That's the only way to let people have two-word overhead.

I realize that using a struct can be dangerous since you can't prevent accidental copying of the lock; that's why you should offer both a low-level struct version (nongeneric, no 'value' field) and a high-level generic class version. Then people can use the struct version as an optimization if they need very fine-grained locking (e.g. their program will have over 100,000 locks in it).

In the library I've been developing, there are two collection classes that I offer in 'struct' form, InternalList and InternalDList (a deque). They are intended as primitives to be used inside other collection classes, e.g. I also wrote a B+ tree which uses InternalDList inside each leaf node, and a V-List that uses InternalList. High-level code should of course use classes instead (List and DList, the latter being a wrapper for DListInternal).

@Qwertie, the CLR reserves at least a 32-bit header in every object for storing lock information (might actually be a 64-bit header, I don't recall exactly). It tries to acquire this lock word in a loop using a CAS. After X number of failures, it allocates a WaitHandle and blocks on that in a queue rooted at that object.

It's clear that the WaitHandle should just be part of any thread abstraction, so we only need to create one for the entire thread lifetime.

The Locked<T> class represents the headers needed in an object if I were to do this on a low-level. Right now it uses two words, one for each MetaThread pointer, to keep it simple. You can replace these with two Int15, and use those to indirect to a global thread table. This latter technique is used in the paper above to make lock space overhead very small.

You'll find that most of the purely functional collections in my Sasa library are only available as structs. The internals use classes, but I keep the public interface as a struct so I can properly handle empty collections without the clients triggering NullReferenceExceptions.

I doubt the suggested solution -- C++/CLI with gcroot in an unmanaged block -- is viable for a high-performance data structure. I don't know how it's implemented, but if the garbage collector sees it as a "root" then it probably requires a word or two of bookkeeping per gcroot, plus gcroot would probably need to do some processing in its constructor and destructor. I'm just guessing tho.

Yes, I'm my B+ tree isn't designed for maximum performance but rather maximum flexibility. For instance, it supports an indexer and RemoveAt, O(1) freezing and cloning, and there is an unsorted version called A-List which supports appending and splitting.