Lets say you're interested in using HotSpot as a vehicle for synchronization research ...

What follows is a brief overview of HotSpot's synchronization subsystem.

First, we'll assume you're starting with the 1.6 sources. That's a good idea as I overhauled the bulk of the synchronization subsystem between 1.5 and 1.6. As a caveat, I should also point out that the HotSpot C++ sources contain Monitor and Mutex classes. Monitors and Mutexes are used for internal native synchronization only - there's no relationship between Monitors and Mutexes and Java-level monitors. Java-level synchronization uses an entirely different subsystem.

I'll try to give a brief overview of how java-level synchronization works in hotspot. The following description relates to 1.6, but the general concepts still apply to 1.5 and 1.4. First, most synchronization is handled through what we call "fast-path" code. We have two just-in-time compilers (JITs) and an interpreter, all of which will emit fast-path code. The two JITs are "C1", which is the -client compiler, and "C2", which is the -server compiler. I assume the reader is more interested in C2 as it's higher performance. C1 and C2 both emit fast-path code directly at the synchronization site. In the normal case when there's no contention, the synchronization operation will be completed entirely in the fast-path. If, however, we need to block or wake a thread (in monitorenter or monitorexit, respectively), the fast-path code will call into the slow-path. The slow-path implementation is in native C++ code while the fast-path is emitted by the JITs.

The fast-path is emitted by cpu/i486/vm/i486.ad "FastLock" and cpu/sparc/vm/assembler_sparc compiler_lock_object() for IA32 and SPARC, respectively. The slow-path is implemented in the native C++ objectSynchronizer and objectMonitor classes. See synchronizer.cpp. In 1.6 the synchronization system is mostly platform-independent (synchronizer.cpp), with a very narrow platform-specific park()-unpark() abstraction used to block and wake threads. The platform-independent code also relies on compare_and_swap and various platform-specific memory barrier primitives. The park-unpark operators are implemented in the ParkEvent native class in os_solaris.cpp, os_linux.cpp and os_win32.cpp.

Per-object synchronization state is encoded as follows. Each object instance has a two-word header followed by the constituent non-static member fields. The header consists of a so-called mark word followed by a class pointer. The class pointer is the same for all instances of a type and points to a type-specific structure containing class static fields, vtable, runtime type information (RTTI), etc., The mark work is multiplexed, containing the identity hashcode value (if assigned), garbage collector "age" bits, and synchronization information. See markOop.hpp for details.

The synchronization state is encoded in the low-order bits of the mark:

Neutral: Unlocked

Biased: Locked/Unlocked + Unshared

Tantamount to deferring unlock until contention Avoids CAS atomic latency in common case 2nd thread must revoke bias from 1st

Stack-Locked: Locked + Shared but uncontended

The mark points to displaced mark word on the owner thread's stack. (The actual type is called a BasicLock). Stack-locking is clever in that he address in the mark conveys ownership, and the target of the address provides for access to the displaced mark work. We only use stack-locking when the analysis phase of the JIT can prove that lock operations are balanced. Note that javac only emits balanced lock operations, so it's extremely rare to encounter unbalanced locking. If we find that a method has unbalanced locking we run it in the interpreter in perpetuity.

Inflated: Locked/Unlocked + Shared and contended

Threads are blocked in monitorenter or wait(). The mark points to heavy-weight "objectmonitor" structure. Note that HotSpot inflates on-demand, only as necessary. Deflation occurs at stop-the-world garbage collection time. Conceptually, the objectMonitor serves as an optional extension to the object. An objectMonitor is associated with an object on-demand.

Note that the biased state is new in 1.6 and some 1.5 update releases. It serves solely to avoid local CAS latency (100s-1000s of instructions on some processors, particularly some Intel CPUs). In 1.6 we also have lock coarsening (again to deal with CAS latency) as well as lock elision through rather simplistic escape analysis. If we can prove an object is thread or method local we can elide the synchronization operations on the object, as well possibly allocating the object on-stack instead of in the heap. Coarsening reduces (amortizes) CAS latency, but at the possible risk of increasing critical section length and decreasing potential parallelism. See Amdahl's law.

For synchronization, all the mark word encodings in HotSpot are pointer-based. The low order bits of the mark serve as a tag, and the upper bits constitute a pointer to either a BasicLock (for stack-locked objects) or a objectMonitor (for inflated objects). Some sources says that HotSpot uses thin-locks, but that's not accurate and has never been the case. Thin-locks were devised by David Bacon at IBM research. Briefly, a thin lock is value-based and encodes an owner thread-id, recursion count, etc., into a single object header word. Thin locks are typically revert to inflation when the object is contended.

In the text above I mentioned balanced locking. Balanced locking deserves additional explanation as it enables certain optimizations. Balanced locking simply means that the JIT can prove that enter and exit operations in the bytecode are applied in a stack-like manner. For example, {enter A; enter B; ....; exit B; exit A} is properly balanced, whereas {enter A; enter B; ... exit A; exit B;} is not. If the JIT can prove balance, it will be able to associate each enter and exit bytecode with a so-called "BasicLock". The JIT allocates BasicLocks in the frame containing the synchronization site. Javac will never generate unbalanced locking operations so it's rare. Note too, that javac will always wrap a synchronized region in an implicit try-finally block, just to make sure that control doesn't escape the block without releasing the lock. Interestingly, the verifier -- which is really just an abstract interpreter -- doesn't require balanced locking. If we can't prove balance then the code in question will run in the interpreter forever; we'll never bother compiling it. The synchronization code in the interpreter can tolerate unbalanced lock operations. Once the JIT has established that all synchronization operations in a method are balanced, that bytecode is eligible for stack-locking and biased locking.

I'll now describe the operation of stack-locking. The inlined enter code emitted by the JIT will first fetch & examine the mark word. We call this step mark word "triage". (That code first checks to see if the object is biased, but we'll ignore biased locking for a moment and remain focused on stack-locking). If the mark word is neutral as determined by the low-order bits of the mark, we'll try to use stack locking. First, in anticipation of a successful compare-and-swap (CAS), the code will store the just-fetched mark value into the on-frame word that was allocated by the JIT and is associated with that particular bytecode offset. Next, the inline enter code will attempt to CAS the address of that on-frame location over the mark word. If successful, we've locked the object. The mark word now points to a word on the stack and the original mark word value (which may contain a hashcode value) is said to be displaced. In a sense the mark word identifies the current owner (simply by virtue of pointing into the owner's stack) as well as providing access to the original displaced mark value. The inlined exit path will reverse the operation, using CAS to try to restore the displaced mark value from the on-frame location back into the actual mark word in the object.

The JIT also emits fast-path operations for the inflated case, where the object's mark work points to a native objectMonitor structure. The objectMonitor is, in a sense, and optional extension to an object, containing synchronization metadata. Briefly, we need to inflate (associate or bind an object to an otherwise unused objectmonitor) whenever we need to maintain lists of threads that are blocked on entry or are a wait()ing on the object. Inflation occurs on-demand. Deflation occurs at all stop-the-world safepoints, where the JVM will attempt to scavenge and deflate otherwise idle objectMonitors. The objectMonitor has an "owner" field. The object is locked if and only if the owner field is non-null. The fast-path enter code uses CAS to try to swing the owner field from null to non-null.

The fast path code emitted by the JITs currently provides for biased locking (where an enter-exit pair requires 0 atomics), stack-locking (where an enter-exit pair requires 2 atomic CAS operations) and inflated operations where there's no need to block or wake threads (which requires 1 atomic for an enter-exit pair). Rather oddly, the inflated fast-path is currently faster than the stack-locking fast-path. The stack-locked encoding is also redundant with the inflated+locked encoding. Given that, I'd like to eventually eliminate stack locks and switch to single lock encoding such as that found in relaxed locks or ExactVM's Meta-Lock. Relaxed locks, for instance, inflates at lock-time and deflates aggressively at unlock-time, removing the need to scavenge.

Obviously, this was a rather abridged description of hotspot's synchronization subsystem. It's rather complicated, but it's also by far the fastest (in 1.6) of all known JVMs, as well as being considerably better (lower latency and better scalability) than native pthreads synchronization implementations.

Assuming you're interested in transactional memory, the most basic approach in a JVM would be to start a transaction (usually abridged "txn") in monitorenter and try to commit in monitorexit. And of course you'll need to fetch and examine the mark, making it part of the txn's read-set in order to monitor that the object is unlocked (in the traditional mutual exclusion sense) at the start of the txn and that the object remains unlocked over the course of the txn. (We call this "commuting" a synchronized block to a hardware txn). And of course you need to deal with interesting issues such as aborts, infeasible txns (too long, traps, wait()), etc., in which case you'd probably want to revert to normal mutual exclusion locking. Given that, I suspect that the inlined fast-path code would be of the most interest.

Join the discussion

Comments ( 6 )

Derek Morr Saturday, August 19, 2006

Very nice blog entry. The links to the presentations at the end don't work. Could you please fix them? Thanks!

Dave Dice Sunday, August 27, 2006

Derek, Thanks - the links should be fixed.

guest Monday, August 28, 2006

Could I contact you directly about using hotspot for a research project?

Alessandro Wednesday, May 30, 2007

Hi Dave,

I have some questions on park/unpark methods and couldn't find any answers on the web. I hope you can help.

java.util.concurrent APIs do not lean on wait and notify, but on park and unpark.Are there any major weaknesses of wait and notify that park and unpark help to overcome?

Are there big differences in the way a thread is managed by the scheduler in case it stopped by park or by wait?

Is it recommended to move wait/notify based code to java.util.concurrent APIs because of performance and scalability reasons?

As the park method may exit "for no reason" is there any guarantee that placing it in a cycle won't give rise to a loop?

Thank you in advance!

Cheers,

Alessandro

Dave Wednesday, May 30, 2007

Hi Alessandro,Generally I consider park-unpark to be more primitive and fundamental than wait-notify -- they're very useful for constructing other higher level synchronization primitives, but I'd advise programmers against using them directly. They're also easy to misuse. That is, even though park-unpark are exposed via j.u.c.LockSupport, I'd recommend using wait-notify or with j.u.c conditions. All the tools, books, accumulated wisdom, programming idioms, etc., assume the programmer will use wait-notify or perhaps j.u.c. When a thread calls wait() it'll eventually block on an internal (non-visible) form of park(). So there's really no appreciable difference to the scheduler. Ultimately, all blocking synchronization operators reduce to the same underlying primitives in the JVM. (Although the implementation of the park-unpark abstraction is platform-specific, of course).You'll enjoy the same scalability with wait-notify as with park-unpark so I'd recommend that you use whichever is more convenient. See also http://blogs.sun.com/dave/entry/java_util_concurrent_reentrantlock_vsAs for park() returning for "no reason", that's the same as wait() -- both admit the possibility of spurious wakeup and must be used in a loop that checks the condition of interest. RegardsDave

Alessandro Wednesday, May 30, 2007

Dave,

Thank you very much for your answer.

I guessed that wait/notify were less efficient because, upon notify, the waiting thread is still blocked until the notifying thread releases the monitor. So I thought we would have had to migrate our implementation to park/unpark in order to improve scalability. Our code is based on a staged event-driven architecture that employes queues and thread pools. This implies that wait and notify can be called many times in a second (potentially thousands of calls) by a small number of threads (tens of threads).