Sunday, November 16, 2014

I recently added a synchronization primitive mechanism in my HdrHistogram and LatencyUtils code, which I think has generic use for some very common operations. Specifically, when wait-free writers are updating stuff that background analyzers or loggers needs to look at. I've isolated it in what I now call a WriterReaderPhaser. The name is very intentional, and we'll get to that in a moment. And to the code (all 66 actual lines of it, 200 with elaborate comments). But first, I'll stray into some "how did this come about" storytelling.

WriterReaderPhaser is a new (I think) synchronization primitive: It provides a straightforward interface and API to coordinate wait-free writing to a shared data structure with blocking reading operations of the same data. Readers view a stable (i.e. non changing, coherent) data set while writers continue to modify data without waiting. And readers are guaranteed forward progress, and will only block for other readers and for writers that may have been "in flight" at the time the reader establishes a stable view of the data.

How did this come about?

This sometimes happens when I build stuff: I find myself in need of some behavior that I thought would be common, but for which I can't find an existing implementation, or a name, or a description. This can obviously be ascribed to my weak Google-fu skills, but after a while I give up and just build the thing, because "it's not that complicated". So I build a one-off implementation into whatever I am doing at the time, and move on with life. At some later point, I find myself needing the same thing again. And since I had already solved that problem once, I go back to my old code and (let's be honest) copy-and-paste my first implementation into whatever new thing I'm working on. Sometimes the little guy on my right shoulder wins over the other guy, and I come back and refactor the behavior into a separate class and build an API for more generic use, at which point the "does this deserve it's own library? It's own repo?" thinking starts, coupled with much Yak Shaving [1]. Sometimes the guy on the left shoulder wins, and I actually get on with the real work I was supposed to be doing. I'll leave it to you to decide which little guy is red and which is white.

Sometimes (usually much later) I realize that what I built was actually new. That even though I thought it was a common use case, and built my version simply out of impatience or frustration at not finding something I could use as-is, I may actually be the first person to solve it. Most of those times, this realization is quickly followed by someone showing me a paper or a piece of code that is 30 years old that makes me go "oh... right.". But sometimes that doesn't happen. Sometimes it really is new.

HdrHistogram itself started this way. It was nothing more than about 100 lines of code in a one-off "JitterMeter" tool I was playing with, which needed to record latencies very quickly and report accurate percentiles with many nines in them. Then I found myself building all sorts of variations on jitter meters and sharing them (jHiccup is an evolved version with a better name). And then I found that people (myself included) were taking the code and ripping out just the histogram trick inside, because they needed a histogram that was actually useful for talking about latencies. Recognizing that a fast histogram with good precision and accurate and fine grained quantile reporting capability is actually a very common use case, I decided to build a Yak shaving co-op on github and called it HdrHistogram. The first Yak hair I produced was Java-colored but others have recently added other colors and breeds.

HdrHistogram is a [presumably] successful example of this process going the distance. More often than not, it doesn't. That's probably what my stale repos on github with 2 stars and no forks represent.

WriterReaderPhaser is currently about halfway through this precarious process, but at this point I'm pretty sure it's not going to die. It's a class on it's own, but not yet it's own library. Certainly not it's own repo yet. It will need to find a home, but org.giltene.stuff is probably not where it needs to end up. Since it's so short, this blog entry is as good a home as any for now.

Most importantly, it looks like it may actually be a new and generically useful synchronization primitive. More accurately: nobody has shown me that "oh... right." link or paper yet, and I'm done holding my breath for now.

So what is WriterReaderPhaser about?

Have you ever had a need for logging or analyzing data that is actively being updated? Have you ever wanted to do that without stalling the writers (recorders) in any way? If so, then WriterReaderPhaser is for you.

I'm not talking about logging messages or text lines here. I'm talking about data. Data larger than one word of memory. Data that holds actual interesting state. Data that keeps being updated, but needs to be viewed in a stable and coherent way for analysis or logging. Data like frame buffers. Data like histograms. Data like usage counts. Data that changes.

Existing solutions

Sure, you can use channels, queues or magic rings to move data updates and safely process them in background copies of the data. You can use persistent data structures and all sorts of immutable trickery. But those are expensive. As in orders of magnitude more expensive than updating in-cache state in place. When this data thing you want to look at could be updated millions of times per second, you invariably end up with some sort of double-buffered (or multi buffered) scheme: Updates are done to an active copy, and analysis is done "in the background" on stable, inactive copies.

Double buffered schemes usually involve some sort of "phase flipping". At some point the notion of which copy is active changes. Writers update the "new" active copy, and readers access a stable and coherent copy that used to be active, but now isn't. It's this phase flipping that usually comes in the way of keeping writers from blocking.

There are all sorts of variations on how to do this flipping. We can obviously use some form of mutual exclusion lock to protect the writes and the flip. But then writers will block each other, and be blocked by the flipping operation. We can use ReaderWriter locks backwards: where the state being protected by the ReaderWriter lock would be the notion of which data set is the "active" one (the one writers write to). In this scheme writers take the read lock for the duration of their active state modification operations, while readers take the write lock to flip the roles of active and inactive data sets. This can be [much] better than complete mutual exclusion when multiple writers are involved, since writers no longer block other writers, but readers still block writers during a flip. Also, when you start asking yourself "what does 'read' mean again in this context?" that is a good sign you have a problem. Most people write buggier code when standing on their head and juggling. I'm sure there are a whole bunch of other schemes people use, but in my looking around thus far, I didn't find any examples that were non-blocking for the writers.

Why did I care?

The thing I actually wanted to double-buffer was a histogram. And not just any histogram. A fixed-footprint histogram that supports lossless recording of experienced latencies, such that later computation of precise percentiles will be possible, all the way to the as-many-9s-as-there-are-in-the-data level. The very purpose of such a histogram is often to capture and analyze latency outlier behavior. The recording operation cannot be allowed to be a cause of the very outliers it is trying to measure. For the latency recording mechanism to have any susceptibility to blocking or locking would be unacceptable.

These latency histograms are basically non-blocking data structures with tens (or hundreds) of kilobytes of state that is rapidly being mutated by critical path "writer" code. But I wanted to log their contents over intervals that are short enough to be interesting for monitoring purposes, and for later time based analysis. In order to log the latency information being captured, I needed a logging "reader" to somehow gain access to a stable, coherent "snapshot" of the latency data that was recorded during some prior interval. To do this, I needed a way for the reader to flip the roles of the active and inactive histograms, but I needed to do that without ever blocking the writers. This is a classic case of an asymmetric synchronization need. I'm fine blocking, delaying and pausing the reader. I just can't afford for the writers to ever block or otherwise delay the execution of the thread they are recording in.

In comes WriterReaderPhaser. And the best starting point for understanding what it does is to dissect the name:

The Phaser part is there because it's main function is to coordinate phase shifts between the writers and the readers. Besides, I couldn't bring myself to call this thing a lock. It's not a lock. Not in it's most important function, which is phase shift coordination. Writers remain lock-free in all cases (they actually remain wait free on architectures that support atomic increment operations). They never block or lock. Calling WriterReaderPhaser a lock would be like calling an AtomicLong an "add lock" because someone could also construct a spin-lock around it....

The WriterReader part is a reversal of the commonly used ReaderWriter (or ReadWrite) term. ReaderWriter locks are asymmetric, but in the reverse direction of what I needed: they enable [relatively] smooth reader operation while causing the writers to block. The really cool wait-free Left-Right which Martin Thompson had pointed me to achieves perfectly smooth reader operation, but that's still not what I needed. WriterReaderPhaser works for the exactly reversed need: Writers remain non-blocking and perfectly smooth, while only readers suffer.

The desired behaviors I was looking for in a WriterReaderPhaser were:

1. Writers remaining lock-free at all times. Ideally they will remain wait-free at all times.

2. A Reader can coordinate a phase flip and access to the inactive data such that:

2.1 Other readers will not flip a phase while this reader is still interested in the inactive data.

2.2 No writer modification will be made to the inactive data after the phase flip operation is complete, and for as long as the reader is interested in the inactive data.

2.3 Readers are guaranteed forward progress (even in the presence of heavy and continuous writer activity, and even when there is no writer activity at all).

Defining WriterReaderPhaser:

With these high level desired behaviors stated, lets clearly define the qualities and guarantees that a well implemented WriterReaderPhaser primitive would provide to users, and the relevant rules that users must adhere to in order to maintain those qualities and guarantees:

A WriterReaderPhaser instance provides the following 5 operations:

writerCriticalSectionEnter

writerCriticalSectionExit

readerLock

readerUnlock

flipPhase

When a WriterReaderPhaser instance is used to protect an actively updated data structure [or set of data structures] involving [potentially multiple] writers and [potentially multiple] readers , the assumptions on how readers and writers act are:

There are two sets of data structures (an "active" set and an "inactive" set)

Writing is done to the perceived active version (as perceived by the writer), and only within critical sections delineated by writerCriticalSectionEnter and writerCriticalSectionExit operations.

Only readers switch the perceived roles of the active and inactive data structures. They do so only while holding the readerLock, and the switch is only done before execution a flipPhase.

Readers do not hold onto readerLock indefinitely.

Only readers perform readerLock and readerUnlock.

Writers do not remain in their critical sections indefinitely.

Only writers perform writerCriticalSectionEnter and writerCriticalSectionExit.

Only readers perform flipPhase operations, and only while holding the readerLock.

When the above assumptions are met, WriterReaderPhaser guarantees that the inactive data structures are not being modified by any writers while being read while under readerLock protection after a flipPhase operation.

The following progress guarantees are provided to writers and readers that adhere to the above stated assumptions:

flipPhase operations are guaranteed to make forward progress, and will only be blocked by writers whose critical sections were entered prior to the start of the reader's flipPhase operation, and have not yet exited their critical sections.

readerLock only blocks for other readers that are holding the readerLock.

Example use

Imagine a simple use case where a large set of rapidly updated counters is being modified by writers, and a reader needs to gain access to stable interval samples of those counters for reporting and other analysis purposes.

The counters are represented in a volatile array of values (it is the array reference that is volatile, not the value cells within it):

volatile long counts[];

...

A writer updates a specific count (n) in the set of counters:

writerCriticalSectionEnter

counts[n]++; // should use atomic increment if multi-writer

writerCriticalSectionExit

A reader gains access to a stable set of counts collected during an interval, reports on it, and accumulates it:

long intervalCounts[];

long accumulated_counts[];

...

readerLock

reset(interval_counts);

long tmp[] = counts;

counts = interval_counts;

interval_counts = tmp;

flipPhase // At this point, interval_counts content is stable

report_interval_counts(interval_counts);

accumulated_counts.add(interval_counts);

readerUnlock

A working implementation

Under the hood, my WriterReaderPhaser implementation achieves these qualities in a fairly straightforward way, by using a dual set of epoch counters (and "odd" set and "even" set) to coordinate the phase flip operations, coupled with a read lock that is used purely to protect readers from each other in multi-reader situations: i.e. to prevent one reader from flipping a phase or changing the notion of active o inactive data while another reader is still operating on it. Many other implementation mechanisms are possible, but this one is certainly sufficient for the job at hand.

Rather than describe the logic in text, it is easiest to list it as code at this point. Below is the entire WriterReaderPhaser class as implemented in my current HdrHistogram repository, spelled out in Java code (most of which is detailed comments). The mechanism can obviously be ported to any language and envrionment that can provide support to atomic increment and atomic swap operations. It's the API and documentation (in the case the details in the JavaDoc comments) that is more important. A simple example of how this is used in practice can be found in HdrHistogram's various interval histogram recorders, like the original (and probably simplest example) in IntervalHistogramRecorder.java, or its more recent replacements in DoubleRecorder.java and Recorder.java which add some unrelated and more complicated logic that deals with safely avoiding some copy costs on getIntervalHistogram() variants.

And yes, it is now all in the public domain.

Enjoy.

[1] For an apparent etymology of the term "Yak Shaving", read the example story attributed here.

The discussion in question actually started from a question about what a good choice of hardware for running a low latency application may looks like these days, but then evolved into other subjects (as many of the best discussions on the group do), one of which was allocation rates.

Several smart and experienced people on the group chimed in and shared their hard earned wisdom, a lot of which came down to recommendations like "keep your allocation rates low", and "reducing allocation rates is one of the best tools to improve application behavior/performance". Specific numbers were cited (e.g. "My current threshold ... is ~300-400MB/sec").

This made that big "Java applications work hard to use less than 5% of today's toy server capabilities" chip I carry on my shoulder itch. I decided to scratch the itch by pointing out that one thing (and one thing only) is making people work hard to keep their apps within those silly limits: It's all about GC Pauses.

To support my claim, I went on a trivial math spree to show that even today's "toy" commodity servers can easily accommodate a rate of allocation 50x higher than the levels people try and contain their applications within, and that the only bad thing about a higher allocation rate is higher pause artifacts. In the poor systems that have those pause artifacts, of course....

The rest of the below is the actual posting:

...

These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are purely driven by GC pausing behavior. Nothing else.

Kirk (and others) are absolutely right to look for such limits when pausing GCs are used. But the *only* thing that makes allocation rates a challenge in todays Java/.NET (and other GC based) systems is GC pauses. All else (various resources spent or wasted) falls away with simple mechanical sympathy math. Moore's law is alive and well (for now). And hardware-related sustainable allocation rate follows it nicely. 20+GB/sec is a very practical level on current systems when pauses are not an issue. And yes, that's 50x the level at which people seem to "tune" for by crippling their code or their engineers...

1. From a speed and system resources spent perspective, sustainable allocation rate roughly follows Moore's law for the past 5 Xeon CPU generations.

1.1 From a CPU speed perspective:

The rate of sustainable allocation of a single core (at a given frequency) is growing very slowly over time (not @ Moore's law rates, but still creeping up with better speed at similar frequency, e.g. Haswell vs. Nehalem).

The number of cores per socket is growing nicely, and with it the overall overall CPU power per socket (@ roughly Moore's law). (e.g. from 4 cores per socket in late 2009 to 18 cores per socket in late 2014).

The overall CPU power available to sustain allocation rate per socket (and per 2 socket system, for example) is therefore growing at roughly Moore's law rates.

1.2 From a cache perspective:

L1 and L2 cache size per core have been fixed for the past 6 years in the Xeon world.

The L3 cache size per core is growing fairly slowly (not at Moore's law rates), but the L3 cache per socket has been growing slightly faster than number of cores per socket. (e.g. from 8MB/4_core_socket in 2009 to 45MB/18_core_socket in late 2014).

The cache size per socket has been growing steadily at Moore's law rates.

With the cache space per core growing slightly over time, the cache available for allocation work per core remains fixed or better.

1.3 From a memory bandwidth point of view:

The memory bandwidth per socket has been steadily growing, but at a rate slower than Moore's law. E.g. A late 2014 E5-2690 V3 has a max bandwidth of 68GB/sec. per socket. A late 2009 E5590 had 32GB/sec of max memory bandwidth per socket. That's a 2x increase over a period of time during which CPU capacity grew by more than 4x.

However, the memory bandwidth available (assume sustainable memory bandwidth is 1/3 or 1/2 of max), is still WAY up there, at 1.5GB-3GB/sec/core (that's out of a max of about 4-8GB/sec per core, depending on cores/socket chosen).

So while there is a looming bandwidth cap that may hit us in the future (bandwidth growing slower than CPU power), It's not until we reach allocation levels of ~1GB/sec/core that we'll start challenging memory bandwidth in current commodity server architectures.

From a memory bandwidth point of view, this translates to >20GB/sec of comfortably sustainable allocation rate on current commodity systems..

1.4 From a GC *work* perspective:

From a GC perspective, work per allocation unit is a constant that the user controls (with ratio or empty to live memory).

On Copying or Mark/Compact collectors, the work spent to collect a heap is linear to the live set size (NOT the heap size).

The frequency at which a collector has to do this work roughly follows:

allocation_rate / (heap_size - live_set_size)

The overall work per time unit is therefore follows allocation rate (for a given live_set_size and heap_size).

And the overall work per allocation unit is therefore a constant (for a given live_set_size and heap_size)

The constant is under the user's control. E.g. user can arbitrarily grow heap size to decrease work per unit, and arbitrarily shrink memory to go the other way (e.g. if they want to spend CPU power to save memory).

This math holds for all current newgen collectors, which tend to dominate the amount of work spent in GC (so not just in Zing, where it holds for both newgen and olden).

But using this math does require a willingness to grow the heap size with Moore's law, which people have refused to do for over a decade. [driven by the unwillingness to deal with the pausing effects that would grow with it]

[BTW, we find it to be common practice, on current applications and on current systems, to deal with 1-5GB/sec of allocation rate, and to confortably do so while spend no more than 2-5% of overall system CPU cycles on GC work. This level seems to be the point where most people stop caring enough to spend more memory on reducing CPU consumption.]

2. From a GC pause perspective:

This is the big bugaboo. The one that keeps people from applying all the nice math above. The one that keeps Java heaps and allocation rates today at the same levels they were 10 years ago. The one that seems to keep people doing "creative things" in order to keep leveraging Moore's law and having programs that are aware of more than 640MB of state.

GC pauses don't have to grow with Moore's law. They don't even have to exist. But as long as they do, and as long as their magnitude grows with the attempt to linearly grow state and allocation rates. Pauses will continue to dominate people's tuning and coding decisions and motivations. [and by magnitude, we're not talking about averages. We're talking about the worst thing people will accept during a day.]

GC pauses seem to be limiting both allocation rates and live set sizes.

The live set size part is semi-obvious: If your [eventual, inevitable, large] GC pause grows with the size of your live set or heap size, you'll cap your heap size at whatever size causes the largest pause you are willing to bear. Period.

The allocation rate part requires some more math, and this differs for different collector parts:

2.2 For the newgen parts of collector:

By definition, a higher allocation rate requires a linearly larger newgen sizes to maintain the same "objects die in newgen" properties. [e.g. if you put 4x as many cores to work doing the same transactions, with the same object lifetime profiles, you need 4x as much newgen to avoid promoting more things with larger pauses].

While "typical" newgen pauses may stay just as small, a larger newgen linearly grows the worst-case amount of stuff that a newgen *might* promote in a single GC pauses, and with it grows the actual newgen pause experienced when promotion spikes occur.

Unfortunately, real applications have those spikes every time you read in a bunch of long-lasting data in one shot (like updating a cache or a directory, or reading in a new index, or replicating state on a failover),

Oldgen collectors that pause for *everything* (like ParallelGC) actually don't get worse with allocation rate. They are just so terrible to begin with (pausing for ~1 second per live GB) that outside of batch processing, nobody would consider using them for live sets larger than a couple of GB (unless they find regularly pausing for more than a couple of seconds acceptable).

Most Oldgen collectors that *try* to not pause "most" of the time (like CMS) are highly susceptible to allocation rate and mutation rate (and mutation rate tends to track allocation rate linearly in most apps). E.g. the mostly-concurrent-marking algorithms used in CMS and G1 must revisit (CMS) or otherwise process (G1's SATB) all references mutated in the heap before it finishes. The rate of mutation increases the GC cycle time, while at the same time the rate of allocation reduces the time the GC has in order to complete it's work. At a high enough allocation rate + mutation rate level, the collector can't finish it's work fast enough and a promotion failure or a concurrent mode failure occurs. And when that occurs, you get that terrible pause you were trying to avoid.

As a result, even for apps that don't try to maintain "low latency" and only go for "keep the humans happy" levels, most current mostly-concurrently collectors only remain mostly-concurrent within a limited allocation rate. Which is why I suspect these 640KB/sec (oh, right, 300MB/sec) guidelines exist.

Bottom line:

When pauses are not there to worry about, sustaining many GB/sec of allocation is a walk in the park on today's cheap, commodity servers. It's pauses, and only pauses, that make people work so hard to fit their applications in a budget that is 50x smaller than what the hardware can accommodate. People that do this do it for good reason. But it's too bad they have to shoulder the responsibility for badly behaving garbage collectors. When they can choose (and there is a choice) to use collectors that don't pause, the pressure to keep allocation rates down changes, moving the "this is too much" lines up by more than an order of magnitude.

Saturday, March 29, 2014

HashMaps are great. And fast. Well, fast most of the time. If you keep growing them, you'll get elevator music every once in a while. Then they go fast again. For a while.

Wouldn't it be nice if HashMaps didn't stall your code even when they were resizing?

Some background: As those of you who have read my various past rants may have noticed, I spend a lot of my time thinking about the behavior of {latency, response-time, reaction-time}. In addition to trying to better understand or teach about the behavior (with monitoring and measurement tools like HdrHistogram, LatencyUtils, and jHiccup), I actually work on things that try to improve bad behavior. For some definitions of "improve" and "bad". Eliminating pausing behavior in GC was the lowest hanging fruit, but more recent work has focused on eliminating pauses due to other things that stand out once those pesky GC blips are gone. Things like at-market-open deoptimzations. Things like lock deflation, lock de-biasing, class unloading, weak reference processing, and all sorts of TTSP (time to safepoint) issues. I've also learned a lot about how to bring down Linux's contribution to latency spikes.

But the JVM and the OS are not the only things that cause latency spikes. Sometimes it's your code, and the code is doing something "spiky". In my day job, I keep running into in actual, real-world low latency system code that is typically super-fast, but occasionally spikes in actual work latency due to some rare but huge piece of work that needs to be done. This is most often associated with some state accumulation. Once we eliminate GC pauses (which tend to dominate latency spikes, but also tend to simply disappear when Zing is applied), we get to see the things that were hiding in the GC noise. We often run into "nice" patterns of growing latency spikes at growing intervals, with a near-perfect doubling in both magnitude and interval between the spikes. This happens so often that we've studied the common causes, and (by far) the most common culprits seem to be HashMaps. The kind used to accumulate something during the day, and which resize in powers-of-two steps as a result.

I've had "build a Pauseless HashMap" on my weekend project list for over a year now, but finally got around to actually building it (at the request of a friend on the mechanical sympathy group). There are probably at least 17 ways to skin a HashMap so it won't stall puts and gets when it resizes, but this is my simple take on it:

This is also a classic example of how GC makes this sort of concurrent programming thing both fast and simple. This is a classic case of an asymmetric speed need between two concurrent actors that share mutable state. I worked hard to make the fast path get() and put() cheap, and managed (I think) to not even use volatiles in the fast path. In doing this, I shifted all the cost I could think of to the background work, where latency doesn't matter nearly as much. This sort of trick would be much harder (or slower) to do if GC wasn't there to safely pick up the junk behind us, as it would (invariably, I think) add a need for additional synchronizing work in the fast path.