Mutexes and Condition Variables using Futexes

Mutexes and Condition Variables differ from spinlocks and spin read-write locks because they require threads to be able to sleep in some sort of wait-queue. In order for this to happen, some communication is required with the operating system kernel via system calls. Since system calls are relatively slow compared to atomic instructions, we would like to minimize their number and avoid as many userspace-kernelspace context switches as possible. The "futex" API in Linux is aimed at satisfying this goal.

Programming using raw futexes is not simple. There are many race conditions one needs to be aware of, and part of the API is broken beyond repair. (Notably the FUTEX_FD option which associates a file descriptor to a futex can miss wakeups.) Thus wrapping futexes in a highly efficient library which implements the easier to understand and use mutex and condition variable primitives is helpful.

Ulrich Drepper has written a seminal article on the use of futexes called Futexes Are Tricky. It describes how one may construct a mutex library using them, and several of the pitfalls along the way. However, this document is now out of date due to expansion of the futex API over the past few years. In addition, the man pages for the futex system call are also out of date. Instead, to understand what is really available, one needs to investigate the Linux Kernel commit logs in git.

Futexes are Tricky

Implementing mutexes is more complex than spin locks. A spin lock may only have two states: locked and unlocked. A mutex must have a third state which describes whether or not the lock is contended, with some threads waiting inside the kernel. In his article, Ulrich defines these three states to be the integers 0 for unlocked, 1 for locked, and 2 for locked and contended. By constructing a state diagram we can look at all possible transitions, and thus write some code.

Note that one may naively want to use the faster ticket spinlock algorithm as a base for a mutex implementation. However, ticket locks are problematic with sleeping waiters. Since only the "correct" waiter can proceed after an unlock, it prevents any queue-jumping. Thus the full overhead of a context switch is always required as soon as the lock becomes contended. A further problem is that the futex implementation doesn't allow one to choose exactly which thread can be woken up. The best one can do is use the undocumented FUTEX_BITSET flag and have effectively 32 different wait-lists. Unfortunately, the results are not particularly fast, so a more traditional design is better.

To implement mutexes we firstly require a definition of the kernel syscall interface. Unfortunately, the futex include file does not have this as futexes are a low-level interface typically used in assembly language. However, this doesn't stop us making our own:

Not all of the parameters are used in every futex operation (which is multiplexed by the "op" parameter. So this is slightly inefficient, but since system calls are quite slow compared to a few extra asm mov instructions this doesn't matter much.

For data hiding, define a type for the mutex state. (It needs to be an integer because that's the size of datatype that futexes use.) Note that the glibc pthread_mutex_t is much more complex. That needs to worry about multiple different types of mutex such as recursive or error-checking varieties. Here, we assume that your code only needs the simplest and most common variety, and is process-local.

Since we only have one type of mutex, we ignore the pthread_mutexattr_t parameter passed in mutex_init. This also means that mutex_destroy is a no-op since we don't check for the error case of destroying an in-use mutex with waiters. Since these functions always return 0, you may want to change them to return void if plug-in compatibility with pthreads is not required.

However, the above is a little faster in most cases than the algorithm shown in the "Futexes are Tricky" tutorial. Here, we try extra hard to avoid going into the kernel in the unlock operation. By spinning for a short while, we can see if some other thread takes the lock. If so, we can convert to a contended lock and then exit all without making a context switch. For some use-cases this is much faster than not spinning.

Note how the above use the FUTEX_PRIVATE version of the wait and wake operations. This currently undocumented flag converts the operations to be process-local. Normal futexes need to obtain mmap_sem inside the kernel to compare addresses between processes. If a futex is process-local, then this semaphore isn't required and a simple comparison of virtual addresses is all that's needed. Thus using the private flag speeds things up somewhat by reducing in-kernel contention.

Condition Variables

The resulting code above is quite fast, faster than the standard default mutexes in the glibc pthreads library. However, it isn't as useful as things stand. To correct that we need to implement the condition variable primitives. The implementation of those inside glibc is extremely complex due to the extra error-checking done. The extra state requires a lock to protect it, and a variety of complex futex operations have been added to the kernel to try and avoid some of the overhead introduced by this internal lock.

If we use a similar philosophy to the mutex code above, we can construct much simpler and faster condition variables. By not checking for errors we can avoid a large amount of overhead. The only internal state we require is a pointer to the mutex "attached" to the condition variable, and a sequence number that we use as a lock against concurrent wakes and sleeps.

A thread waiting on the condition will sleep on the seq sequence-lock. We can then wake up a single thread in a cond_signal and all the threads in a cond_broadcast. Since the a cond_wait is specified to return with the mutex locked, we can optimize by transferring the waiters from one wait-queue to another in the broadcast operation. By always waking up at least one thread, we make sure that the mutex is set into the correct contended state.

The wake-up operations turn into relatively small wrappers over futex system calls:

Every time we do a wake operation, we increment the seq variable. Thus the only thing we need to do to prevent missed-wakeups is to check that this doesn't change whilst falling to sleep. This isn't 100% fool-proof. If 232 wake operations can happen between the read of the seq variable and the implementation of the FUTEX_WAIT system call, then we will have a bug. However, this is extremely unlikely due to amount of time it would take to generate that many calls.

The cond_wait operation is slightly tricky. We need to make sure that after we wake up that we change the mutex into the contended state. Thus we cannot use the normal mutex lock function, and have to use an inline version with this constraint. The only other thing we need to do is save the value of the mutex pointer for later. This can be done in a lock-free manner with a compare-exchange instruction.

This completes a condition variable implementation. How fast is it? We can use one of the tests in the glibc source code to benchmark this. tst-cond18.c implements a job-server like application. A main thread repeatedly uses pthread_cond_signal and pthread_cond_broadcast to wake up other threads which then unlock and lock the protecting mutex. It does this for 20 seconds, whilst maintaining a count of the number of successful wakeups. We can alter the program to print out this number on exit.

The default mutexes in glibc perform 890k operations on this machine in 20 seconds. The mutex + condition variable implementation above do about 2 million operations in the same time, so we've made a fair amount of improvement. However, we aren't quite done with optimization...

Optimized Mutexes

We can further improve performance by changing the state diagram for the mutex implementation. We will add a fourth state: Contended and unlocked. This fourth state complements the other three, and allows us to specify two bits in the mutex integer. The first bit states whether or not the mutex is locked. The second, whether it is contended.

The obvious thing to do now is to change the locked + contended value to 3, and add the new state as the value 2. However, we can do better. By moving the two bits into separate bytes, we can more efficiently operate on them by using byte-addressing instructions. This allows some operations to avoid the lock prefix, and increase performance.

We set the low bit in the least significant byte to hold the lock status. The low bit in the next most least significant byte in the integer will hold the contended status. This gives the values of the four states on our little-endian machine as: 0 unlocked + uncontended, 1 locked, 256 unlocked and contended, 257 locked and contended.

To implement this without worrying too much about aliasing problems we use a union.

Notice how the unlock operation may not require any interlocked instructions at all under contention. This can greatly improve performance in that case. We can also avoid using compare-exchange instructions in the lock operation, and use simpler exchange operations instead.

We can now optimize the condition variable implementation to use the new type of mutexes. Fortunately, the only code that actually cares about the internals of the mutex type is the cond_wait operation. We change it from exchanging and waiting on the value 2, to exchanging the value 257:

Running the same tst-cond18.c benchmark gives a 4.2million wakeups in 20 seconds. This is twice as fast as the previous mutex and condition variable code, and more than four times faster than the standard ones in the pthreads library. However, we still aren't quite done with optimization...

Mutexes and Condition Variables in Assembly Language

Some of the performance in the pthreads library is gained by using assembly language. We've improved the algorithms pretty much as far as they can go in C, we can now use the same trick of moving to asm to get extra speed. The most obvious change is that we now don't need to use the syscall C function, and can use the syscall instruction directly. This allows us to avoid worrying about the values of parameters which aren't required in the futex operations we are using.

The mutex algorithm conversion is relatively straight forward. (We strictly don't need the initialization and destruction functions in asm as the C compiler will do a good job with them, but it doesn't hurt.)

The condition variable functions are also all quite simple except for cond_wait. At least on this machine, it was faster to not inline the call to mutex_unlock within it. Since we know which registers mutex_unlock will modify, we can hide extra parameters in registers it doesn't touch. This saves the time it would take to create and tear down a stack frame.

Unfortunately, the speed increase isn't all that much. The above assembly code does 4.5 million wakeups in the 20 seconds run in tst-cond18.c This is a few percent faster than the C version. It seems that gcc optimizes the user-space operations well enough, and the extra code spent in the system call setup is vastly outweighed by the overhead of the system calls themselves. Anyway, the above code is about 5 times faster in the benchmark than the implementation in glibc. The disadvantage is that the above is not nearly as flexible, only supporting a single type of mutex.

Timed Waits

Conspicuously absent from the above have been the timed wait operations pthread_mutex_timedlock and pthread_cond_timedwait. The problem with these is a subtle conflict between the requirements of userspace and kernelspace. In userspace, one would like the guarantee that an operation could complete before a given time. This is often a requirement for real-time systems. The problem with this is that userspace can only obtain the time via another system call.

The gettimeofday function uses a vsyscall, so is quite fast, but the problem is that the process may be scheduled out immediately after obtaining the time. Since we would like to meet the deadline even if we are scheduled out on the way to waiting, userspace tends to want absolute timeouts. This is the reason why the timed wait operations in the pthreads API are absolute.

The kernel sees things differently. There, much effort has gone into making as few kernel-userspace context switches required as possible. Since a relative timeout means that userspace doesn't need to call gettimeofday in order to use it, it is a more efficient interface. Thus sleeping calls like select, poll and FUTEX_WAIT use relative timeouts.

Thus there is a mismatch between the pthreads API and the kernel. An application which wants to wait no more than five seconds for a lock first needs to call gettimeofday once to obtain the time. Then the pthreads library needs to call it again to convert back to a relative time. Finally, after the wait, the pthreads library needs to call it a third time to check for timeout. (The system call may return ETIMEOUT, but it is possible to be scheduled out just after a system call that didn't time out, and then be restarted after the time limit has expired.)

This results in many calls to gettimeofday being executed even for an application that doesn't really care all that much about exactly how long something waits, as long as it doesn't wait too long. So to keep things simple we've avoided making these functions. If you want a timed delay, simply pass in a relative time in struct timespec * format for the fourth parameter to FUTEX_WAIT.

Summary

By avoiding much of the complexity introduced by the pthreads library supporting multiple mutex types, we have constructed a single-purpose but extremely fast mutex and condition variable library. The resulting code is up to five times faster in the contended case. In constructing the algorithms, we only required the simpler futex operations. Contrary to the statements in "Futexes are Tricky", FUTEX_CMP_REQUEUE wasn't needed as the use of FUTEX_REQUEUE doesn't have a race in this particular condition variable implementation.

Comments

Dmitry V'jukov said...

Hi,
Thank for the interesting post!

In the second mutex implementation you use atomic operations of different sizes, the problem is that Intel IA-32/Intel64 docs prohibit that:

3A/I 8.1.2.2
"Software should access semaphores (shared memory used for signalling between
multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access."
-----

In the second mutex_unlock() there is a code:
/* We need to wake someone up */
m->b.contended = 0;
sys_futex(m, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
So we reset 'contended' and wake-up only single waiter, I do not quite get as to who will wake up other waiters. Since contended==0, some threads can stay blocked. What I am missing?
-----

Also you may want to try to apply 'wakeup throttling' technique to it, it reduces number of syscalls and reduces contention in oversubscribed environment. Google by 'wakeup throttling' and you will find that paper on Java locks implementation. In short: if we unblock a waiter, and he is not yet re-acquired the mutex, then we does not unblock other waiters -> we know that there is a thread still coming and he will unblock other waiters if required.

> In the second mutex implementation you use atomic operations of different sizes, the problem is that Intel IA-32/Intel64 docs prohibit that.

Hmmm... I haven't seen that constraint before. At least in my experimentation, when any access is atomic it synchronizes everything, and you are ok. The "lock" prefix acts like an mfence instruction. When two accesses are of different sizes and neither are atomic then there are problems... I'm guessing the second situation is what that paragraph is trying to describe. It'd be interesting to see if there is any x86 implementation where this doesn't work. The Linux Kernel ticket spinlocks use the multi-sized access trick for unlock as well. If they didn't work I'm sure there would be some mention of the problem there.

I tried the obvious spin without cmpxchg in testing the algorithm. I was surprised to find it was actually slower! I'm not sure why. This was the C version though... perhaps the asm-only version will have different characteristics. (In my spinlock article the obvious way is faster though...)

The trick with the contended flag is that it is set in the xchg instruction under wakeup in the mutex_lock() function. If there are no waiters, then it stays unset - which is the correct answer, if there is a waiter it converts back to being in the contended state. This is the protocol Uli uses in his "Futexes are Tricky" article.

The algorithm here has some sort of wakeup throttling already. Under unlock, it spins a little bit waiting for a locker to appear. This avoids making a system call if someone grabs the lock in the meantime. I'm not quite sure how to do the same sort of trick that the java implementation does. The difference here is that the wait-queue is in kernel space rather than userspace. We don't actually know if anyone is really waiting, or if anyone else is just about the acquire the lock. Counting the number of waiters is possible... but it slows down the important non-contended case in my tests.

It'd be interesting to try some of the other ideas though. :-)

Dmitry V'jukov said...

Regarding multi-sized accesses, I don't know how it possibly can break, but the statement is still there. Perhaps they meant a situation when a variable crosses cache-line boundary.

> I tried the obvious spin without cmpxchg in testing the algorithm
Mmmm... What algorithm do you mean? And it's slower than what?

> The trick with the contended flag is that
Aha! I see.

> Counting the number of waiters is possible... but it slows down the important non-contended case in my tests.
How can counting of waiters possibly slow-down non-contended case w/o waiters? ;)

Of course the slowdown bug could have been the fact that I didn't reinitialize c properly. Hmmm...

You are right, if you do the same trick of putting the locked bit and contended counts in different bytes, then it all should work. I didn't think of that. The idea that it was bad to count the contenders came from trying to implement a ticket lock mutex. That had abysmal performance for all sorts of reasons.

It looks like there are a few more low-hanging fruit to be picked. Perhaps the goal of 6x faster than standard pthreads is obtainable. :-P

This does 3.6 million "spins".
for (i = 0; i < 100; i++)
{
if (!xchg_8(&m->b.locked, 1)) return 0;

for (;i < 100; i++)
{
if (!m->b.locked) break;
cpu_relax();
}
}

I really would have thought the second would be better. Perhaps it just isn't waiting as long, since the non-locked operations are faster. Increasing the timeout to 1000 helps... improving to 4.0 million spins, but increasing the timeout further hurts performance. (Using a timeout of 1000 in the first case also gives about 4 million spins.)

Dmitry V'jukov said...

> You are right, if you do the same trick of putting the locked bit and contended counts in different bytes, then it all should work.

Since you want to use 32 bits for the lock. (That's what a futex will address.) You probably want a layout more like:

struct mutex
{
char lock;
char pending;
unsigned short waiters;
};

You might also be able to use the lsb of waiters for the pending bit if that works better. Of course, apparently Microsoft has done some experiments that show that not waiting for the woken-up thread is faster due to the lock-convoy problem... I guess more benchmarks are on order, since Linux has faster syscalls, and the convoy problem therefore shouldn't be as bad.

Keeping N at 100, I varied M from 1 through to 200 stepping up in powers of two or so. There was no statistical difference from M=1 up to and including M=20, at M = 40 it was noticeably slower, and kept on getting worse from there. It looks like cpu_relax() [rep; nop] is a good-enough hardware back-off operation on this machine without a loop. Surprising, I know. :-/

Chris Anderson said...

This is slower, it seems, but this should comply with the restraint: "Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths."

Your code looks good... except for one tiny issue. You need to alter the cpu_relax() macro to have a dependency on memory otherwise the compiler may turn the loop in mutex_unlock() into an infinite one.

Something like:

#define cpu_relax() asm volatile("pause\n": : :"memory")

will do.

Chris Anderson said...

Thanks. In my microbench (8 producer threads feeding 8 consumer threads), this is twice as fast as posix locks, but it depends on heavy contention. If the contention isn't there, then the difference isn't very much. The spinlock/nanosleep backoff case is better until the contention is insignificant, although spinlock/nanosleep should produces more latency spikes than cond/locking. I need to measure that at some point, Code: http://www.eetbeetee.org/pc.c

Right, if there is no contention everyone uses a fast-path where a single locked instruction is executed, so no speed difference should be visible. (The extra overhead is miniscule compared to the bus-lock time.)

The algorithms here, and the one you've tested, really shine when there is contention. The reason for that is the improved unlock logic, where we can avoid calling into the kernel some of the time. By exploiting the (I think) legal behaviour of using variable-sized memory accesses we can avoid a bus-locked instruction in unlock path further speeding things up.

Like you, I get a 2x speedup for the kernel-avoiding unlock, but I also get a 4-5x improvement if the key unlock step is a simple memory store. (Assuming heavy contention of course.)

Chris Anderson said...

The voodoo in the unlock... I think I might try to calculate the 200 loop after the barrier() as a percentage of futex() syscall overhead, see if it matters on different hardware. Anyway, I'm trying to come closer to the nanosleep spinlock contention case.

sfuerst said...

My guess is that it is hardware and kernel dependent. On windows, the fastest mutex algorithm I found (using keyed events) doesn't have the 200-loop spin.

hi said...

Thanks for the article, but what header to I have to include for all those cmpxchg, xchg_32 etc functions? and how do you link?

sfuerst said...

Have a look at the spin lock and read-write lock article, it gives definitions for all of the atomic functions used here in terms of gcc intrinsics and inline asm. They have been abstracted into functions to make them more compiler-agnostic and the code more readable.

n00b said...

Hi,

I have couple of doubts about above code.
1)in cond_signal()
Futex_wake is called for cv->seq + 1
In cond_wait()
Futex_wait is called for the threads waiting for cv->seq. how does both point to same set of threads work?

2)cv->seq is an integer.But syscall expects it to be an address which points to the list of threads waiting.Please clarify

1) cond_signal increments the sequence number so that any threads about to call futex_wait inside cond_wait will no longer match cv->seq. Since they don't match, they will immediately wake up. (Remember, the futex wait syscall will check the value before sleeping.) This prevents missed wakeups.

2) cv->seq is indeed an integer. However, its address is passed to the futex syscall when needed.

3) I suppose it should be. However, because the mutex type is a union, it doesn't matter in practice because the locked struct member is first.

Note that the cv implementation here basically assumes that most cond_signal and cond_broadcast calls will have at least some threads waiting. If this is not the case, it is possible to have faster code that doesn't enter the kernel at the cost of an extra atomic instruction. Have a look at the "events" code to see how this might work.

Samy Al Bahra said...

> In the second mutex implementation you use atomic operations of different sizes, the problem is that Intel IA-32/Intel64 docs prohibit that.

I've seen this cause breakage on Nehalem boxes before. You may be able to reproduce this, see http://carte.repnop.org/releases/ck-0.0.1.tar.gz. If you remove the barriers from ck_bytelock.h, you may encounter a race condition after several runs.

Samy Al Bahra said...

I forgot to mention, a unit test (which should eventually break without barriers) is in regressions/ck_bytelock/validate/validate.c.

Borislav Trifonov said...

If one is to attempt to modify the above mutex implementation to use FUTEX_LOCK_PI and FUTEX_UNLOCK_PI, does the futex implementation use for TID comparison all bytes of the futex integer? If so, how do we store the contended flag?
Also, it's not clear how trylock would change to include FUTEX_TRYLOCK_PI

sfuerst said...

Have a read of the docs for the priority inheritance version in the kernel. The upper bit is free for the presence of waiters or not:

#define FUTEX_WAITERS 0x80000000

Unfortunately, this means that the unlock can't use a simple non-atomic store. (Which is why the algorithm here is so fast.)

Borislav Trifonov said...

Can unlock be safely called on an unlocked mutex? The futex call will get executed even though there are no waiters.

sfuerst said...

Double-unlocking a mutex is really dumb. What happens when another thread locks it in the mean-time?

enosys said...

What is the pad for in the condition variable? Both the mutex and sequence number are ints, so why the 3-int sizing?

However, the real reason it is there though is to have ABI compatibility when Linux finally adds 64bit futexes. Having a 32bit roll-over problem is rare... a 64bit roll-over bug is basically impossible.

Joseph said...

I have tried both the ticket spinlock and the mutex described here againts pthreads spinlocks and mutexes, and I don't see a 5 fold performance improvement. In fact pthread locks come out faster.

The test locks a spin or mutex increments a counter and unlocks a million times. Time seems to double for each new thread, and totals are longer than with pthreads.

Fred Z said...

Hi,

thanks for your articles, it's the top of what I found on the web 8)

I have a question about your union structures, as you use pthread futex & atomic instructions, it's pretty cross platform, but on big-endian OS, union and 257 value may not work properly ?

Thanks by advance.

John J. said...

Good stuff. I was porting a library from windows to linux pthreads and was disappointed with the mutex/condition performance. This code is considerably faster. I'll need to go over the algorithm thoroughly to reassure myself of the correctness, but so far it appears pretty solid. Is there a more "correct" version available in light of kernel changes and the comments above?

Thanks for making this available. I'll have a look at other aspects of the website as well.