Design

Measuring Parallel Performance: Optimizing a Concurrent Queue

By Herb Sutter, December 01, 2008

When it comes to scalability and concurrency, more is always better.

Example 3: Reducing Head Contention

In Examples 1 and 2, the producer was responsible for lazily removing the nodes consumed since the last call to Produce. But that's bad for performance for several reasons, notably because it forces a producer to touch both ends of the queueand every thread that uses the queue, whether producer or consumer, has to touch the queue's head end. Even though a producer and a consumer don't use the same spinlocks and so can run fully concurrently with respect to each other, the fact that they touch the same memory inherently adds invisible contention, as updates to the memory containing the head nodes have to be propagated to all threads on other cores, not just to consumer threads that naturally have to touch the head end to do their work.

In Example 3, we'll let each consumer be responsible for trimming the node it consumed (which it was touching anyway) and this gives better locality. The first thing we notice is that we can get rid of divideritself a source of contention because it was used by both consumers and producers:

Next, Produce becomes simpler because we can eliminate the lazy cleanup code. However, just eliminating that code leads to a very subtle pitfall because one existing line also has to change. Can you see why?

Changing Responsibilities Can Introduce Bugs

Note that line B used to be last = last->next;. That was always slightly inefficient because it needlessly reread last (a holdover from the original code written by someone else). Now, if left unchanged, it becomes something much worse: a small race window. Now that there's no divider and consumers clean up consumed nodes, the way consumers know there's an item available to be consumed is to check first->next; if it's not null, it's okay to go ahead and consume a nodeand delete what used to be the first one because that node is no longer needed. The trouble arises when a sequence like the following occurs:

The consumer performs an entire call to Consume the just-published node, including deleting the now unnecessary previous first node before it

Then the producer dereferences last

last = last->next; // B: update last
// oops: accesses freed memory.

The key is that the act of publishing the new node (line A) not only advertises that the new node is ready to be consumed, but also implicitly transfers ownership of the preceding node to the consumer. Hence, line B must not dereference last again, but should just assign from tmp directly.

"But," someone might object, "will this interleaving really happen? After all, A-B is a very small window for a call to Consume to fit into." True, it won't happen often. Based on experience, however, I can report that under heavy stress on a multicore system, this tends to fail once for every few tens of millions of items moving through the queue. This was the only race I wrote (that I know of) when putting these examples together, and it was a real pain to reproduce and diagnose.

Moral: When you change responsibilities for cleanup, code that used to be innocuous can suddenly turn into a subtle race window.

Measuring Example 3

But back to the main event: How well does moving the cleanup responsibility and reducing contention on the head of the queue really help? Again, before seeing my results, consider how much, and why, you think this is likely to affect throughput, scalability, contention, and the oversubscription penalty.

Figure 3 shows the Example 3 performance results. The effects are mainly on the left-hand small object graph, with only incremental improvements for large objects. For small objects, peak throughput has improved by nearly another factor of two, and we've again improved scalability and actually get close to reaching the dashed line, which represents our capacity for getting more work done using more cores. There is some dropoff due to contention as we exceed about 20 active threads (e.g., 12 producers and 8 consumers), and for the first time we can actually see the oversubscription wall on the left-hand graph beyond 24 threads. Although we'd like to scale that wall, right now we're happy to just be able to approach it in the first place!

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!