Oracle Blog

Blog for roland

Real-time Java and futexes on Linux

We've just made available an early access release of Sun's Java RTS 2.1 on Linux. Java RTS (Real-Time System) is Sun's implementation of the Real-Time Specification for Java (RTSJ) based on Java SE and the hotspot virtual machine. The product has been available on Solaris (sparc and x86) for some time and we've been working on making it available on Linux. Our development and test systems run pretty standard Linux distributions with RT enabled kernels.

I contributed to this porting effort and one Linux subsystem I had to take a closer look at is the futex subsystem. Java RTS on Linux only relies on the POSIX APIs. We make use of pthread mutexes (with the priority inheritance protocol) and condition variables. On recent Linuxes, pthread synchronization primitives are built on top of the clever futex mechanism.

A futex keeps part of its state in user-land and part at the kernel level. This allows synchronization primitives to be implemented with a fast-path entirely in user-land when possible (such as a lock operation on an uncontended mutex) and a fall back that uses the futex system call: the Linux kernel provides a single system call for the futex support. A command is passed as an argument to the futex syscall to designate what operation to perform on the futex. Four commands are of interest to us: FUTEX_LOCK_PI/FUTEX_UNLOCK_PI to implement the priority inheritance mutex lock/unlock operations and FUTEX_WAIT/FUTEX_WAKE to implement pthread condition variable wait (optionally with a timeout) and signal operations.

The user-land part of the futex is a 32 bit integer, its value is the user-land part of its state. When the application calls the futex syscall to operate on the futex, the futex "object" is identified with the virtual address in the process address space of the 32 bit integer. As an example, a PI mutex is implemented with a single futex. The 32 bit integer is initialized with a value of zero: mutex unlocked. The locking operation on the futex consists in atomically setting the futex's value to the thread id of the new owner if the the futex's value is zero. This is performed with a compare-and-exchange or similar operation. Unlocking the futex is similarly performed by atomically reseting the futex's value to zero if the value is still the thread id of the current thread. If, because the futex is already owned by another thread, the fast path locking fails, then the thread calls the futex syscall with the FUTEX_LOCK_PI command. The futex syscall then takes care of changing the user-land futex value (so that the owner thread cannot unlock with the fast path but is forced through the futex syscall with the FUTEX_UNLOCK_PI command), suspending the thread until the futex is available and finally of updating the user-land value to the thread id of the new owner.

A futex can be shared between processes. If it's not then the command passed to the futex syscall should be or'ed with the FUTEX_PRIVATE constant.

Interestingly, and probably because a Java VM and the RT VM in particular puts an unusual load on the system, pretty much all the problems we have had so far with Linux itself were related to the futex mechanism. One of them is a performance issue on one of our own real-time Java benchmark. The benchmark programs a real-time thread so that it is woken up at a particular absolute time in the future using RTSJ APIs. Internally the VM uses condition variables and mutexes in the process of waking up the thread when the absolute time is reached. We measure the absolute time at which the thread is effectively woken up and back to executing Java code. The difference between the measured time and the requested time is called the latency. Worse case execution time (and not mean execution time) is what defines performance.

The benchmark is run with and without some load, including some load that triggers the garbage collector. Without the load, the latency is always very low. With the load, the latency is sometimes ten times higher than without. Against, worst case execution is all that matters to us in this case. What we've found is that the extra latency happens in the futex syscall and that the extra delay is caused by a non-real-time thread in the VM performing calls to mmap. Indeed, in the kernel sources we found that the futex syscall synchronizes with the mmap calls. This should not happen with process private futex operations (commands that are marked private). A pthread mutex or condition variable can be marked either shared between processes or private to a single one. We use only process private mutexes and condition variables so we expects them to be handled only by private futex commands. Inspecting the code of the glibc, we discovered that the process private attribute of the futex is not always taken advantage of, even in the most recent glibc releases. Doing some more investigation, we found that VM memory locking (through the mlock(2) call to prevent indeterminism caused by paging activities) exacerbates the problem. On a simple C test case, we measured that mmap'ing a 100MB in a non real-time thread could cause a latency of up to 1 second in a real-time thread.

So why do we have so much mmap activity in our real-time VM? When run with one of the standard hotspot GC and lots of garbage produced, the GC repetitively grows and shrinks the heap which is done with mmap calls. In any case, mmap is widely used (for instance, malloc is implemented with mmap) so, even if we use our real-time GC (which does not grow/shrink the heap) instead of one of hotspot's GCs, suffering extra latency in a real-time thread caused by a non real-time thread's mmaps is still possible. Fixing the process-private attribute of pthread mutexes/condition variables is thus important for improved and guaranteed latency of complex real-time systems on Linux.