Locks with LIFO admission order

Why would we ever want a lock with a LIFO admission policy? First, a LIFO lock provides a useful measure of real-world scalability. Lets say we have a set of threads that each iterate as follows : acquire some lock L; execute a fixed-length critical section body of duration C; release L; and finally execute a non-critical section of length N. We run T threads concurrently, and at the end of a measurement interval we report the total number of iterations completed, as well as per-thread iteration counts. Amdahl's law says the maximum ideal speedup relative to a single thread should be (N+C)/C. We can run our experiments, varying the thread count, measure aggregate throughput and see compare to see how close we come to Amdahl's bound. Assuming we have a heterogeneous system and ignoring any potential superlinear effects, the observed peak speedup will be capped by Amdahl's bound. And if we use a fair FIFO lock, such as MCS, the threads will all have approximately equal completion counts. It's worth noting that Amdahl's law is sometimes misapplied to locks and critical sections. In the classic Amdahl model, during the serial phase no other threads may be executing concurrently, while with locks, when one threads is in the critical section other threads may be executing concurrently in their critical sections. That is, classic Amdahl's law applies to barriers. See also Gustafson's law, Gunther's universal scaling law, and in particular Eyerman's model. Critically, though, the maximum speedup bounds still hold. Now lets say we switch to a LIFO lock. Ideally, the aggregate throughput will be the same as we saw with the FIFO lock. If N=30 and C=10, then the ideal speedup is 4X. If we run with 10 threads under a LIFO lock, when we examine the distribution of per-thread completion counts we expect to see 4 threads dominate with about equal performance, and 6 threads should have starved completely. This gives us another non-analytic empirical way to guage the maximum speedup over a lock. Put another way, can we figure out how many threads we can "squeeze" or pack into a contended lock before we hit saturation. We keep increasing T until some threads show evidence of starvation. This lets us discern the N/C ratio. Of course we could try to derive the ratio using FIFO locks, varying T, and using Amdahl's law, but in practice there are quite a few practical confounding factors. The LIFO approach gives us a fairly direct reading of the number of threads that will "fit" before we reach over-saturation. LIFO locks are also useful in their own right. While they are deeply unfair, they work very well with spin-then-park waiting strategies. If we imagine the lock as implemented with a stack of waiting threads, threads near the head are mostly likely to be spinning, and are also most likely to be next granted the lock. If the lock is over-saturated, then under a LIFO policy, ownership will circulate over just a subset of the contending threads. In turn this can reduce cache pressure and yield benefits arising from thermal and energy bounds. Of course we have to take measures to ensure long-term eventual fairness, but many locks intentionally trade-off short-term fairness for throughput. (See our "Cohort" locks, for example). A possibly systemic downside to LIFO locks is that arrivals and departures may need to access the same lock metadata, creating an acute coherence hot-spot. With a contended MCS lock, for instance, an unlock operation doesn't need to access the "tail" field. I wondered if there was a LIFO analog to the classic FIFO ticket lock and put the question to my colleagues in Oracle Lab's Scalable Synchronization Research Group, and collected some of the designs, which I'll report below. It's an interesting exercise and puzzle, and hard to resist for concurrency folks. Alex Kogan, Victor Luchangco, Tim Harris, Yossi Lev and I contributed. Any mistakes are mine. The most basic LIFO algorithm I could think of was to implement an explicit stack of waiting threads with a central "head" pointer which points to the most recently arrived thread. The approach is pretty obvious and yields a true LIFO admission order. Expressed in a pidgin Python/C++ dialect and assuming atomic<T> for all shared variables, the following sketch describes that form. The lock is very similar to the Thread::MuxAcquire() and ::MuxRelease() primitives that I wrote for the HotSpot JVM. (Those are internal locks used by JVM to get over a bootstrapping phase where the normal native C++ HotSpot Monitor:: and Mutex:: classes aren't yet initialized). We call this form "E3". (I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism).

The next few variations forgo explicit nodes, and as such, we'll have global spinning. The broad inspiration for this family is the CLH lock, where a thread knows the adjacent thread on the queue, but the queue is implicit. We call the following "E5" because it was the 5th experimental version.

The first thing to notice is that the "P" encoding can result in two waiting phases in Acquire() : arriving threads may first wait while Head == P and then for their specific turn. The interlock protocol to hand-off feels rather synchronous. P state is effectively a lock that freezes out arrivals until the successor manages to depart. In addition, a group of threads could be waiting while Head == P, but subsequently "enqueue" themselves in an order that differs from their arrival order, so we don't have strict pedantic FIFO. (See also FCFE = First-Come-First-Enabled). We can streamline E5 slightly, yielding E5B :

The next version, E6, eliminates the P encoding and switches to a seqlock-like lock for the hand-off transition. The lock instance requires just a single "Next" field. When the low-order bit of Next is set, arrivals are frozen during the hand-off. E6 appears the fastest on non-NUMA systems, possibly because the paths are relatively tight.

For E7 we revert to using an "inner lock" to protect an explicit stack of waiting threads. An MCS or CLH lock work nicely for that purpose. E7 provides local spinning and, depending on the inner lock implementation, is true FIFO. We use an encoding of Head == 1 to indicate the lock is held but no threads are waiting.

E9 uses a latch -- really a lock that allows asymmetric acquire and release. Specifically, if thread T1 acquires the latch then T2 may subsequently release the latch. (This is precisely the "thread-oblivious" property required by top-level cohort locks). For our purposes we can use a ticket lock. Our lock structure contains an inner ticket lock and Depth and Admit fields.