Lock-Free Data Structures. Yet Another Treatise

Queues can be different. They can differ in the number of producers and consumers (single/multi producer — single/multi consumer, 4 variants). They can be bounded, on the basis of a pre-allocated buffer, and unbounded, on the basis of a list. They can either support priorities, or not. They can be lock-free, wait-free or lock-based, with the strict (fair) and not strict (unfair) conformance to FIFO, etc. These types of queues are described in detail in the article by Dmitry Vyukov. As a rule, the more specialized requirements to a queue are, the more efficient its algorithm is. In this article, we will consider the most common version of queues – multi-producer/multi-consumer unbounded concurrent queues without the support of priorities.

I guess the queue is the favorite data structure for researchers. On the one hand, it’s really simple; on the other hand, it is not as easy as the stack, as it has two ends, not just one. Since there are two ends, interesting problems arise, such as: how to manage them in a multithreaded environment? The number of publications with different variations of the queue algorithm is off the charts, so it impossible to cover all of them. I’ll dwell briefly on the most popular of them, and start with the classic queue.

Classic Queue

A classic queue is a list (no matter if it’s a singly or doubly linked list) with two ends – the head and the tail. We read from the head and write to the tail.

Don’t look for it, there’s no concurrency, it’s just an illustration of how simple the subject for conversation is. The article will show us what happens to simple algorithms if they’re adapted to the concurrent environment.

The algorithm of Michael & Scott is considered to be the classic (1996) algorithm of a lock-free queue.

The code from the libcds library is provided in case it has a short form of the implementation of the algorithm under consideration. To see the full code – refer to the cds::intrusive::MSQueue class. Comments are in the code. I tried to make them not so boring.

bool enqueue( value_type& val )
{
/*
Implementation detail: node_type and value_type in my implementation -
are not the same and require converting from one type to the other one.
For simplicity, we can assume that node_traits::to_node_ptr -
is just static_cast<node_type *>( &val )
*/
node_type * pNew = node_traits::to_node_ptr( val ) ;
typename gc::Guard guard; // A guard, for example, Hazard Pointer
// Back-off strategy (of the template-argument class)
back_off bkoff;
node_type * t;
// As always in lock-free, we’ll deal with it, till we make the right thing...
while ( true ) {
/*
Protect m_pTail, as we’ll read its fields
do not want to get into a situation of reading from the deleted memory
*/
t = guard.protect( m_pTail, node_to_value() );
node_type * pNext = t->m_pNext.load(
memory_model::memory_order_acquire);
/*
An interesting detail: the algorithm assumes
that m_pTail can point not to the Tail,
and hopes that further calls will set up the Tail correctly. A typical example of the multithreaded mutual help
*/
if ( pNext != nullptr ) {
// Oops! It’s necessary to clear (literally) the Tail
//after the next thread
m_pTail.compare_exchange_weak( t, pNext,
std::memory_order_release, std::memory_order_relaxed);
/*
Need to start all over again, even if CAS is not successful
CAS is not successful, which means that m_pTail has been changed
before we read it.
*/
continue ;
}
node_type * tmp = nullptr;
if ( t->m_pNext.compare_exchange_strong( tmp, pNew,
std::memory_order_release,
std::memory_order_relaxed ))
{
// Have successfully added a new element to the queue.
break ;
}
/*
We’ve failed — the CAS didn’t work.
This means that someone has got there before us.
Concurrency has been detected, so, not to aggravate it,
let’s back off for a very brief moment in time
and call the back_off functor
*/
bkoff();
}
/*
Generally, we can use the counter of elements...if we want to.
As can be seen, this counter is not very accurate:
The element has already been added, and we change the counter just now.
Such counter can only speak of the order of the number of elements,
but we cannot use it as the sign of emptiness of a queue
*/
++m_ItemCounter ;
/*
Finally, trying to change the m_pTail tail.
We are not interested whether we’ll succeed or not,
if not, they’ll clean up after us
see the 'Oops!!' above and below, in dequeue
*/
m_pTail.compare_exchange_strong( t, pNew,
std::memory_order_acq_rel, std::memory_order_relaxed );
/*
This algorithm always returns true.
Other ones, like bounded queues, can
return false, when a queue is full.
For consistency of the interface, enqueue
always returns a sign of success признак успеха
*/
return true;
}
value_type * dequeue()
{
node_type * pNext;
back_off bkoff;
// We need 2 Hazard Pointers for dequeue
typename gc::template GuardArray<2> guards;
node_type * h;
// Keep trying till we can execute it…
while ( true ) {
// Read and guard our Head m_pHead
h = guards.protect( 0, m_pHead, node_to_value() );
// and the element that follows the Head
pNext = guards.protect( 1, h->m_pNext, node_to_value() );
// Check: is what we have just read
// remains bounded?..
if ( m_pHead.load(std::memory_order_relaxed) != h ) {
// no, - someone has managed to spoil everything...
// Start all over again
continue;
}
/*
The sign that the queue is empty.
Note that unlike the tail, the H=head
is always at the right place
*/
if ( pNext == nullptr )
return nullptr; // the queue is empty
/*
We read the tail here, but we don’t have to protect it with
the Hazard Pointer, as we are not interested in the content
it points to (fields of the structure)
*/
node_type * t = m_pTail.load(std::memory_order_acquire);
if ( h == t ) {
/*
Oops! The tail is not in place: there is the Head,
the element following it,
and the Tail points to the Head.
I guess we should help them here...
*/
m_pTail.compare_exchange_strong( t, pNext,
std::memory_order_release, std::memory_order_relaxed);
// After helping them, we have to start over.
// Therefore, the CAS result is not important for us
continue;
}
// The most important thing is to link a new Head
// That is, we move down the list
if ( m_pHead.compare_exchange_strong( h, pNext,
std::memory_order_release, std::memory_order_relaxed ))
{
// Success! Terminate our infinite loop
break;
}
/*
Failed… which means that someone interfered.
Not to disturb others, let’s step back for a moment
*/
bkoff() ;
}
// Change the not very useful counter of elements,
// see the comment in enqueue
--m_ItemCounter;
// It’s the call of 'remove the h element' functor
dispose_node( h );
/*
!!! Here’s an interesting thing!
We return the element that follows the [ex] Head
Note that pNext is still in the queue —
It’s our new Head!
*/
return pNext;
}

As you can see, the queue is represented by a singly linked list from head to tail.

What’s the important point of this algorithm? It’s to be able to control two pointers – to a head and to a tail – by using the normal (not the double) CAS. This is achieved by the fact that the queue is never empty. Look at the code. Are there any checks of the head/tail for nullptr? You won’t find them. To provide physical (but not logical) non-emptiness in the queue constructor, one dummy element is added to it, which is the head and the tail. dequeue returns an element that becomes the new dummy element (the new head), and the former dummy element (the former head) is removed:

This should be taken into account when designing an intrusive queue – the returned pointer is still a part of the queue, and we will be able to remove it only during the nextdequeue.

Secondly, the algorithm assumes that the tail can point not to the last element. Each time we read the tail, we check whether it has the next m_pNext element. If this pointer is not nullptr, the tail is not in place and we should move it forward. But there’s another pitfall here: it may happen that the tail will point to the element before the head (the intersection of the head and the tail). To avoid this, we implicitly check m_pTail->m_pNext in the dequeue method: we have read the head, and the m_pHead->m_pNext element that follows the head, have made sure that pNext != nullptr. Then, we see that the head is equal to the tail. Thus, there is something after the tail, as there’s pNext and we should move the tail forward. This is a typical case when threads help each other, which is very common in the lock-free programming.

Memory ordering

I will shamefully hide my confession behind the spoiler. The provided above code is not a model for the arrangement of memory ordering of atomic operations. The things is, that from viewpoint of the C++11 memory ordering, I have never seen a detailed analysis of the algorithms that are more complex than Treiber’s stack algorithm. Therefore, memory orderings are placed rather on intuition, using the brain. My intuition is backed by many years of running tests, not only on x86. I assume (and even suspect) that there are weaker barriers for this code, so I’m ready to discuss this point.

In 2000, a small optimization of the algorithm was offered. It was noted that the MSQueue algorithm in the dequeue method the tail was read at each iteration of the loop, which was unnecessary: we should read the tail (to verify that it is really a tail and points to the last element) only when the successful update of the head took place. Thus, it can be expected to reduce the pressure on m_pTail for certain types of load. This optimization is presented in libcds as the cds::intrusive::MoirQueue class.

Baskets queue

An interesting variation of MSQueuewas introduced in 2007. Nir Shavit, a fairly well-known researcher in the world of lock-free, and his associates took a different approach to the optimization of the classic lock-free queue of Michael & Scott.

He presented the queue as a set of logical baskets. During some short period of time, each of them was available for inserting a new element. When the interval passes, a new basket is created.

Each basket is an unordered set of elements. It would seem that this definition violates the basic feature of the queue – FIFO; that is to say that the queue becomes unfair. FIFO is observed for baskets, but not for elements in them. If the interval of the basket availability for the insertion is quite short, we can ignore the disorder of items in it.

How to determine the duration of this interval? Actually, it is not necessary to determine it, — say the authors of the Baskets Queue. Let’s take a look at the MSQueue queue. In the enqueue operation, when the concurrency is high and the CAS of changing the tail did not work, that is, where the back-off is called in MSQueue, we cannot determine the order, in which the items will be added to the queue, as they’re added concurrently. That is the logical basket. Turns out, the abstraction of logical baskets is a kind of the back-off strategy.

I do not like to read miles of code in review articles, so I won’t provide the code here. The example with MSQueue has already shown us that the lock-free code is really verbose. Those wanting to look at the implementation, refer to the cds::intrusive::BasketQueue class in the libcds library, cds/intrusive/basket_queue.h file. Meanwhile, to explain the algorithm, I’ll provide another picture borrowed from the work of Nir Shavit & Co:

A, B, C threads want to insert items to the queue. They see that the tail is in the right place (remember that the tail in MSQueue can point not to the last element in the queue) and try to change it concurrently.

A thread is the winner, as it has inserted a new item. B and C threads are losers – their CAS with the tail is unsuccessful. Therefore, both of them begin to insert their items to the basket, using the read previously, old value of the tail.

B thread has managed to be the first one to perform the insertion. At the same time, D thread also invokes enqueue and adds its item successfully by changing the tail.

C thread also successfully completes the insertion. Just look, it added the item into the middle of the queue! During the insertion, it uses the old pointer to the tail that it has read when entering the operation, before it performed the unsuccessful CAS.

It should be noted that during such insertion an item will be added before the head of the queue. For example, look at the item before C at the picture Nr.4 above: while C thread is in enqueue, another thread can delete this item before C. To prevent this situation, it is proposed to apply the logical deletion, which is to mark the deleted elements with the special deleted flag. Since it is required that the flag and the pointer to the item could be read atomically, we will store the flag in the least significant bit of the pointer to the pNext item. This is acceptable, as memory allocation in modern systems is 4-bytes aligned, so the least significant 2 bits of the pointer will always be zeros. Thus, we have invented the marked pointers approach, widely applied in lock-free data structures. We will come across this approach more than once in the future. Applying the logical deletion, that is, setting the pNext least significant bit to the value of 1 with the help of CAS, we will exclude the possibility to insert an item before the head. The insertion is carried out by CAS as well, and the deleted item contains 1 in the least significant bit. Thus, CAS will be unsuccessful (of course, when inserting the item, we do not take the whole marked pointer, but only its most significant bits that contain the address; we assume that the least significant bit is equal to zero).

The last improvement introduced by BasketQueue refers to the physical deletion of items. It has been observed that changing the head at each successful call of dequeue can be unfavorable, as CAS is also called, and, as you know, it’s quite heavy. Therefore, we will change the head only when there are several logically deleted elements (by default, there are three of them in the implementation of libcds). We can also change it when the queue becomes empty. Thus, we can say that the head changes in hops in BasketQueue.

All these optimizations are designed to improve the capacity of the classical lock-free queue in a situation of high concurrency.

Optimistic Approach

In 2004, Nir Shavit and Edya Ladan Mozes introduced another approach to optimizations in MSQueue, that they called optimistic.

Warning!

If anyone is interested in the original article — Be Careful! There are two articles of the same name, written in 2004 and 2008. In the article of 2004 they provided some strange (that also seems unworkable) [pseudo-] code of the queue. The article of 2008 provides the absolutely different code, nice and workable.

They have noticed that the dequeue operation in the algorithm of Michael and Scott requires just one CAS, while enqueue requires two (see the picture above).

The second CAS in enqueue can significantly affect performance even at low load, as CAS is a quite heavy operation in modern processors. Is it possible to somehow get rid of it?

Let’s consider how two CASs appeared in MSQueue::enqueue. The first CAS links the new item to the tail – changes pTail->pNext. The second one moves the tail forward. Can we change the pNext field by an atomic record, and not by CAS? Yes, we can, if the direction of our singly linked list would be different, not from the head to the tail, but vice versa. We could use the atomic store (pNew->pNext = pTail) for the pNew->pNext task, and then change pTail by CAS. But if we change the direction, how to perform dequeue then? There will be no pHead->pNext link anymore, as the list direction has changed.

The authors of the optimistic queue suggested using a doubly linked list.

But there’s one problem: an efficient algorithm of a doubly linked lock-free list for CAS is not yet known. There are known algorithms for DCAS, but there’s no DCAS implementation in the hardware. We know the MCAS emulation algorithm (CAS for M unbounded memory cells) for CAS, but it’s inefficient (requires 2M + 1 CAS) and represents rather the theoretical interest.

The authors came up with the following solution: the link in the list from the tail to the head (next is the kind of link we do not need for the queue, but it will allow us to get rid of the first CAS in enqueue) will always be consistent. As for the reversed direction, from the head to the tail, the most important link – prev – can be not really consistent, meaning that its violation is permissible. Finding such violation, we can always restore the correct list, following the next references. How to detect such violation? Actually, it’s really simple: pHead->prev->next != pHead. If this inequality is found in dequeue, the fix_list auxiliary procedure is invoked:

[taken from the cds::intrusive::OptimisticQueue class of the libcds library]

fix_list searches the queue from the tail to the head, using the obviously correctpNext references, and corrects pPrev.

The violation of the list from the head to the tail (prev pointers) is possible rather due to delays, not due to heavy load. Delay is a displacement or an interrupt of a thread by the operating system. Take a look at the following code for OptimisticQueue::enqueue:

It turns out that we are optimistic: we have built the pPrev list (the most important one for us), hoping for success. If we find a mismatch between the direct and the reversed lists, we’ll have to spend time on conforming them (run fix_list).

So, what’s the bottom line? Both enqueue and dequeue have one CAS each. The price paid for this is running fix_list when the violation of the list is detected. Large or small is the price — the experiment will tell us.

You can find the code in the cds/intrusive.optimistic_queue.h file, the cds::intrusive::OptimisticQueue class of the libcds library.

Wait-Free Queue

To cover the subject of the classic queue in full, we should mention the wait-free queue algorithm.

Wait-free is the most strict requirement among others. It says that the execution time of the algorithm must be finite and predictable. In practice, wait-free algorithms are often (surprise!) much inferior in performance to their less strict brothers, lock-free and obstruction-free. But they surpass the latter in number and code complexity.

The structure of many wait-free algorithms is pretty standard: instead of performing an operation (enqueue/dequeue in our case), they declare it at first – store the operation descriptor with arguments in some publicly accessible shared storage– and then start helping concurrent threads. They browse descriptors in the storage and try to perform what is written in them. As a result, several threads perform the same work at high load, and only one of them will be the winner.

The complexity of the implementation of such algorithms in C ++ is mainly how to implement this storage and how to get rid of the memory allocation for descriptors.

The libcds library has no implementation of the wait-free queue, as the authors of the queue provide in their research quite disappointing data on its performance.

Test Results

In this article, I decided to provide test results of the provided above algorithms.

Tests are synthetic, the test machine is the dual-processor Debian Linux, Intel Dual Xeon X5670 2.93 GHz, 6 cores per processor + hyperthreading, a total of 24 logical processors. At the time of the test, the machine was almost free — idle at 90%.

The compiler is GCC 4.8.2, the optimization is -O3 -march=native -mtune=native.

The tested queues are from the cds::container namespace. Thus, they are not intrusive, which means that memory allocation is performed for each element. We’ll compare them to the standard implementations of std::queue> and std::queue> to the synchronization by mutex.

T type is a structure of two integers. All lock-free queues are based on the Hazard Pointer.

Endurance Test

This test is quite specific. There are 10 million enqueue/dequeue pairs of operations. In the first part, the test performs 10 million enqueue. 75% of operations are enqueue, the remaining 25% are dequeue (i.e. there are 10 million enqueue and 2.5 million dequeue in the first part). In the second part, dequeue is performed 7.5 million times, till the queue becomes empty.

The idea in the design of this test was as follows: minimize the impact of the cache allocator, if, of course, the allocator has cache.

Absolute data is the time of the test:

What can we say? The first thing that catches your eye is that the locked std::queue> turned out to be the fastest one. How come? I think that the whole thing is in working with memory: std::deque allocates memory in blocks of N elements, and not for each element. This suggests that we should eliminate the impact of the allocator in tests, as it brings quite long delays. Besides, it usually has mutex(es). Well, libcds has intrusive versions of all containers that do not allocate memory for their elements; you should try to test them.

As for our lock-free queues, it is clear that all the considered optimizations that we looked at with regard to MSQueue have borne fruit, although they’re not that great.

Producer/Consumer Test

The test is quite practical. There are N producers and N consumers of the queue. Total performed 10 million write operations, and 10 million read operations, respectively. The number of threads in charts is the sum of the number of threads of consumers and producers.

Absolute data is the time of the test:

Lock-free queues behave more decently here. The winner is OpimisticQueue, which means that all the assumptions laid in its structure proved to be correct.

This test is closer to reality, as there is no mass concentration of elements in the queue. That’s why, I think, internal optimizations of the allocator come into play. Well, it’s all right. In the end, the queue is made not for a huge concentration, but to buffer bursts of activity, and the normal state for a queue is the absence of elements in it.

Bonus About the Stack

Since we’re talking about tests…

Between this and the previous article about lock-free stacks, I implemented the elimination back-off for the Treiber’s stack. The implementation itself, or rather the path from the description of the approach/[pseudo-] code to the finished product in C++, deserves a separate article (that is most likely will never be written, as there would be too much code in it). In fact, the result was what the authors of the elimination back-off wrote, but if you look at the code – it’s completely different. So far it’s only available in the libcds repository.

I will also provide the results of the synthetic tests. The test machine is the same.

Producer/Consumer test: some threads write to the stack (push), while others read (pop). There’s the same number of consumers and producers, their total number is the number of threads – 10 million (that is, 10 million push and 10 million pop). For standard stacks, the synchronization is performed by the mutex.

The absolute time of the test:

I think the chart speaks for itself.

What provided such performance growth of the elimination back-off? Seemingly due to the fact that push/pop annihilate each other. But if we look at the internal statistics (all classes of containers in libcds have their own internal statistics, disabled by default), we’ll see that only 10-15 thousand push/pop annihilated of 10 million (together with 64 threads), which is about 0.1%, while the total number of attempts (meaning the number of entries to the elimination back-off) is about 35 thousand! It turns out that the main advantage of the elimination back-off is that some threads fall asleep when the load is too high (in the provided examples, the sleep of a passive thread in the elimination back-off lasts about 5 milliseconds), thus, automatically reducing the overall load on the stack.

Summary

We have reviewed the classic lock-free queue, represented as a list of elements. Such queue is characterized by the presence of two points of concurrency – the head and the tail. We have considered the classic algorithm of Michael and Scott, as well as some of its improvements. I hope that the considered optimizations have been interesting for you and can be useful in everyday life.

According to the test results, it can be concluded that despite the fact that queues are lock-free, the magic CAS did not give any special performance gain. Therefore, it is necessary to look for some other approaches to abandon the bottlenecks (the head and the tail) and somehow parallel the work with queues.

Comments

Most of us work with strings one way or another. There’s no way to avoid them — when writing code, you’re doomed to concatinate strings every day, split them into parts and access certain characters by index. We are used to the fact that strings are fixed-length arrays of characters, which leads to certain limitations when working with them. For instance, we cannot quickly concatenate two strings. To do this, we will at first need to allocate the required amount of memory, and then copy there the data from the concatenated strings.