Some time ago Terry Wilmarth posted the blog about Aggregators in TBB. In this blog I want to explore the design space of the pattern. The post contains lots of code, and I will use mostly the term Combiner below.

So what is a combiner? It is a mutual exclusion primitive similar to a mutex. But when using a combiner you explicitly pass the critical section function, and it allows more flexibility with respect to its execution. Namely a thread can execute the critical section on behalf of another thread. Here is a simple usage example:

combiner_t* c = combiner_create(&my_critical_section);
...
combiner_execute(c, &arg);
// at this point my_critical_section(&arg) function has been executed,
// but not necessary by the current thread.

The reasonable question -- how is it better than mutex? If we combine/aggregate several critical sections from different threads (here is where the name came from) into a single critical section and give it to a single thread to execute, it can have significant impact on cache performance. Note if there is no combing opportunities, the primitive behaves exactly as a mutex -- lock, execute the critical section in the current thread, check that there is no combining opportunities, unlock, return.

Now we are ready to outline applicability of the primitive. It can provide benefits in moderate-to-high contention scenarios; in low contention scenarios, there are few combining opportunities; and in very high contention scenarios other alternatives should be considered (to reduce contention in the first place -- partitioning, replication, batching, privatization, etc). And I would expect it to be useful for large complex data structures (e.g. trees), because for simple data structures like queues there are efficient non-blocking algorithms (note that combiner in it’s standard form is still blocking), and for small data structures cache locality may not pay back.

Let’s create a simple implementation and then see how we can improve it and what aspects we can vary. Let’s start with the interface:

This simple implementation can be fine for computational applications, but it can be not suitable for other types applications because a single combiner executes potentially unbounded number of operations. Indeed I’ve observed that a single thread executes millions of operations when swamped with requests from 31 concurrently executing threads.

Bounding

What we want is to bound number of operations executed by a single thread. If we bound it by, say, 32, it’s still provides good amortization of overheads. To do it we need to introduce the additional combiner parameter -- limit:

While TBB algorithm prevents execution of unbounded number of operations by a single thread, it still can be arbitrary high (bounded only by number of threads), and at the same time it misses lots of combining opportunities. First, combiner can not aggregate more than one operation from a single thread (they will necessary go into different batches). And second, if a thread submits an operation while another thread is combining, the operation won’t be joined to the batch (while it possibly could). We will see effects of this weak combining in the evaluation section.

Async operations

Combiners provide another interesting opportunity -- the operations can be executed asynchronously. That is, if a thread is not interested in the result of the operation (e.g. it just wants to insert/remove a node from a container), it can submit the operation into the queue and just return immediately. This makes the algorithm fully non-blocking, threads do not wait for each other.Async operations require the minimal modification to the algorithm:

void combiner_execute(combiner_t* c, combiner_arg_t* arg, int async) {
// Enqueue the operation or become the combiner.
// The same as above, omitted.
// ...
if (cmp) {
// If the caller does not need the result, just return.
if (async)
return;
while (__atomic_load_n(&arg->next, __ATOMIC_ACQUIRE) != 0) {
}
} else {
// Combiner algorithm is the same
// ...
}
}

However, the issue is who owns the arg parameter, how to allocate it and how to free it? In the previous algorithms it is natural to allocate arg on the stack of the caller thread. But it is impossible with async operations, because the arg may be used after the caller thread have destroyed the stack frame.Several solutions are possible. The simplest option is to allocate the arg with malloc() and free() it after execution. If, for example, we want to insert a node into a container, then we can use the node itself as the arg. However, I want to show a general and efficient solution -- each thread has N local arg objects and reuses them when the previous async operations finish:

As we will see in the evaluation section, asynchronous operations provide good performance improvements.Note that if a thread must observe effects of previous operations (e.g. insert a node into a container, then search for the node), the batches of operations must be reversed to get FIFO order. Otherwise with async operations, the search can be executed before the insertion.

Flat combining

There are two recent interesting research papers on combining, one of them is “Flat Combining and the Synchronization-Parallelism Tradeoff“. The idea is as follows. In the algorithms above threads enqueue operations into a central queue with a CAS operation, this inevitably incurs additional costs per operation. In flat combining each thread has own persistent descriptor that is enqueued into the list (array), then in order to submit an operation the thread uses an atomic store (instead of CAS). As the result, the combiner has to poll all thread descriptors to find the “armed“ ones:

The advantage of the algorithm is that threads do not need to execute CAS to submit operations. The downside is that the combiner needs to poll descriptors (potentially useless work). The algorithm seems to be designed for SPARC machines where write sharing is cheap but CAS operations are expensive. As authors say, the algorithm degrade quickly as contention decreases (the combiner senselessly polls descriptors just to find no pending operations). So this algorithm is ideal for synthetic benchmark with no thread-local work.

Threads also use an atomic store to submit operations (no CAS), and another advantage is that the dedicated combiner thread always has the protected data structure in the cache (other threads do not even touch it). The downsides are that single-threaded latency increases (worker threads always need to communicate with the dedicated thread) and the dedicated thread always burns CPU. This algorithm seems to be good as band-aid for legacy applications that suddenly find themselves executing on highly parallel machines.

Implementation notes

Some implementation notes before moving to evaluation.As always it is important to put cache line padding in proper places to prevent false sharing. In particular, original flat combining algorithm does not use paddings between thread descriptors, but on Intel processors I’ve observed significant speedup if the padding is added. In general I’ve observed up to 2x speed difference with and without the paddings.

Since some of the combiner algorithms use linked lists of arguments, it’s beneficial to use software prefetching. That is, while we are executing the current operation, prefetch the next. The actual implementation of the execution loop I used in benchmarks is as follows:

I used simple active spin loop to wait for completion. It is fine for benchmarks. However, in real implementation it may be beneficial to use something more complex (e.g. at least sched_yield() after some number of iterations).

Full source code for all algorithms and the benchmark driver is attached to the post.

Evaluation

In the benchmark all threads execute N operations, number of threads is varied as 1, 2, 4, 8, 16, 32. The protected operation is traversal of a linked list of length 30. For benchmarking I used a machine with 2 Intel Xeon E5-2690 CPUs running at 2.90GHz (16 HT cores total).

The first experiment is with per-thread local work consisting of 100 division instructions:

The second experiment is with per-thread local work consisting of 1000 division instructions (fewer combining opportunities):

As expected, remote core locking has bad single-threaded performance (2-6 times slower), but behaves acceptably under contention. Flat combining degrades when local work is increasing (more senseless polling). TBB combining algorithm suffers from missing combining opportunities, e.g. with 4 threads TBB algorithm combines a mean of 1.75 operations, while the proposed "Bounded" algorithm - 3.40 operations, this explains the difference in performance. Async version performs best, which is not surprising since it better interleaves critical sections and local work. Note that async operations can be applied to other algorithms as well (e.g. flat combining).