Saturday, May 5, 2007

Nonblocking-Reader/Blocking-Writer in Java

So it's Saturday. I should be off like chasing women or motorcycle racing or something, but instead I'm at home obsessing over non-blocking threadsafe algorithms.

(yeah. again.)

In particular, I had an email exchange with (my ex-Ph.D.advisor) Doug Lea regarding the java.util.concurrent.lock's ReentrantReadWriteLock (which Doug also happen to be the author of). The beauty of this class is that it provides a solution to the classic Reader/Writer problem. That is, it allows many threads interesting reading something to operate simultaneously (as reading does not conflict with reading). However, it also allows you a writer-lock that when switched on, blocks all readers and other writers. This follows the assumption that writing could corrupt simultaneous readers or writers.

This wasn't very easy in Java prior to JDK 1.5.

Our discussion was over a microbenchmark I wrote to prove to myself how great it was that I could have a flock of readers all banging on a map (or something) simultaneously. I wrote (or borrowed) a standard regularly synchronized map wrapper:

Here, readers will not wait for other readers, however they will wait for writers (and writers wait for everyone).

This example does do better, but maybe not as "better" as I would think. Pounding the code with 50 simultaneous readers runs about 6.5 times faster (i.e., I got 6.5 times as many operations in a 1 second interval) than the fully synchronized version.

Interestingly, with 5 readers and 20 writers the synched version does about 10000 writes and 1200 reads. The reader/writer version does 300 writes and 10000 reads (sort of the opposite). In other words, the writers (probably expectedly) really get hurt in the reader/writer configuration. (If you're complaining microbenchmarks suck, you can stop. Highly contended parallel access to a map is pretty real - and if nothing else, I was really just trying to make threads fight, and this seemed like a reasonable way to do it).

Doug quickly pointed out that the reader/writer lock has a fair bit of overhead as compared to a simple synchronized method and that the map operations are so fast, they simply don't hold the lock long enough to cause much contention.

I probably should have quit there and played some more desktop tower defense, but I tried something else. Of course it seems that (in true double-checked locking form) if a reader can know that no writers are about, then there's no need for it to do any locking at all. Thus I played around awhile with a non-blocking threadsafe reader. This didn't go so well as I found counterexamples to every version of my code right as I typed the final semi-colon.

The write lock used in the put method is pretty standard. Only one writer can be active at any time. The reader code (i.e., the get method) is the tricky part. Effectively, a reader won't start while it knows that a writer is active. And, once its done it checks again to see if a writer is now active or was while it was running. If so, it throws away its result and tries again. In theory, a reader might reread the same value 1000 (or infinite) times before its satisfied that the data is good. In practice, it only happens a very few times.

In threadsafe (blocking) code we're used to coding such that all important code can only happen when its safe. Nonblocking code works more like a "teenager model" - it says "Let's try it and then when we're done, we'll see if what we did was ok. If not, we'll try it again until it is ok."

If the very busy looking do or while loops seem odd to you, they're quite normal in non-blocking threadsafe code. See Brian Goetz's great introduction to nonblocking algorithms. In the right circumstance (as Brian is happy to point out), nonblocking code can smoke blocking code even if it ends up busy looping awhile. And of course, still be 100% correct. Of course, the latter statement is often difficult to make happen.

I love this kind of all hell's breaking loose code. I created some tests to try to expose a race condition and so far it looks good. I ran this idea by Jeremy Manson ("Dr. Memory Model") and he confirmed for me that the use of atomic ints above will cause memory consistency (keep in mind, Jeremy is not a big fan of lock-free algorithms it seems). My empirical tests verified it so far - I can't make this code race (I can easily make races by changing just tiny little bits). I'm going to play with some benchmarks and see what I can drum up.

(Edit: Honestly, if you're looking for a non-blocking hashtable the above is thought-provoking but not terribly clever. You'd be wise to check out Cliff Click's Nonblocking Hashtable implementation. Its more comprehensive, proven, and is very likely faster.)

I recall something about writes on unprotected variables potentially being partial. That is, if a non-volatile variable is concurrently updated, a reader thread may see neither the initial value nor the updated value but instead something intermediate.

Although I can't imagine a hardware implementation that would cause it, it seems plausible that a reference may be partially written and thus invalid. If that occurs, the JVM is probably within its rights to do all sorts of unhappy things (like throw an unexpected Exception or Error).

In order for a try-and-see approach to work the JVM must guarantee consistency of all references at the very least.

markus:I've seen Cliff's hashtable - in fact I attended his talk at Google. I've followed Cliff's work for years, it was a treat to finally meet him and his hashtable is excellent - it does things far smarter than my code above.

alex:I have done only informal performance tests and plan on doing further testing. As expected, the results are fantastic (which is why I want to be very careful about having good results). This isn't really that amazing however - nonblocking algorithms do tend to destroy blocking algorithms. But - this isn't so cleancut - cpu utilization spikes too.

Logan:I've been very worried about the Java memory model's consistency issues (note this code wouldnt even compile in JDK1.4 but even if it did, it probably then wouldnt work). Thats why I asked Jeremy Manson who is truly an expert in this area. My understanding from my discussion with him is that the operations on the atomicInt provide the memory barriers I require.

martti: Damn, you caught me. My thought was that the assignment to value in the get is loop-invariant. That is, a compiler (or runtime) might realize that its silly to keep doing that calculation within the loop body when it doesn't look like it will change. And, as an optimization, move the statement out of the loop (breaking the whole point of the atomicInteger as a guard)

To counter that I added a spurious check in the while loop to see if value was then equal to some public static variable (which it theoretically never could be which is ok - but the compiler (hopefully) wouldnt know that). Note that some decompilers do something similar.

Anyway, I got the idea from Jeremy that because of the atomicInt usage, the compiler wouldn't do the loop-invariant code motion I feared. My empirical tests verified this.

I may be wrong or missing something, but if there is more than one field write as a consequence of calling map.put(), or if the field write is not atomic, then you are not guaranteed a consistent view of the map during the read operation - the map.get() - as the JLS doesn't guarantee atomic application or view of fields that are not read in the scope of the same lock, in other words, your reader threads could see the state of the map in an incorrect state.

While I would probably bet that map is probably not written to suffer from that problem, it seems you have a problem in theory, anyway.

I think this is probably what logan was referring to.

To reiterate - it seems that you have the proper barriers to prevent threads from accessing critical sections of code at the incorrect times, you haven't addressed the memory model and the possibility of a partial update having occurred.

It seems you have thread barriers, but not memory barriers. You could end up reading partial flushes because you are not using any kind of construct that would ensure a coherent read of the map when doing map.get() (according to the JLS, synchronized doesn't just prevent more than one thread from entering critical sections, it also ensures that all mutations made under the same lock are visible by other threads acquiring that lock.)

HashMap itself may not have a problem, (I don't know what it's implementation is internally) but if you tried to do what you are doing at a general level, multiple field writes made under the write lock could be read in non-coherent states by the reader threads.

Re: Partial writes on non-volatile variables: you are guaranteed to never see a broken/partial ref (breaks type-safety, it might be mistaken for something real). In practice only misaligned longs & doubles on X86 get split this way.

I am not referring to a partial ref or write. I mean updating more than one field at a time. You are not guaranteed to see a consistent view of more than one write unless you have a memory barrier. Synchronized provides the memory barrier, but I don't see how this code does.

After reviewing http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/atomic/package-summary.htmland http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html, I've convinced myself that there are in fact the appropriate memory barriers provided by AtomicInteger.

:) i have done a ReadWriteMap impl as well. Several. One is a wrapper, like synchronized map, and one is an extension with the RWLock built in. Some other cool things you could consider are to enable the collections (don't just stop with Map) to take an optional RWLock argument in the constructor so that you can have a series of data structures inherently locked with the same lock (can be helpful if you need to lock and update multiple structures at once).

i'm surprised you got the results you did for the write lock, though. i get much more efficient writes with a concurrent lock than with a synchronized block in my microbenchmarks. :) at this point, i don't even touch synchronized collections anymore unless absolutely necessary.

the other fun item of note, if you are interested, is the memory consumption. if you have lots of maps, synchronized is, of course, best, RWMap next, and ConcurrentHashMap last. now, one way around that, to have concurrency (if you want that sort of thing) is to do an adaptive concurrent map which resizes its segments if it notices that it's being utilized by many threads (ideally, concurrently). i've also done one of those which i've found a few choice places to use. i'll have to check out cliff's map, though, i'm curious to see what its memory profile is like. (as you may guess, the system i'm working on has many many many maps it maintains, so the size of individual maps is important...hence my using prime sized maps instead of ^2, among other things).

now, question...you might actually know this b/c you seem to know the right people (how come I don't know the right people? :/ i really need to get back to school at the right school.... ) i'm really interested in a non-blocking LRU map...know of one?