Sunday, March 30, 2008

How fast is Java Volatile? or Atomic? or Synchronized?

This is a question I've wondered awhile. And I've gone ahead and made a stab at finding out. One question that I had was I had always heard that AMD chips have an on-cpu memory controller and Intel's don't (at least not until they're upcoming Nehalem line). Now I'm first to admit I'm talking above my knowledge (please correct profusely).

But the idea was that some operations (notably I've heard volatile reads which I didnt test here) are very similar regardless of volatility.

I spent a few hours making up a benchmark. I'm publishing the benchmark below for one important reason - to get it right. This is big disclaimer: Don't trust these results - at least with any precision. I do conclude that contended synchronization is expensive - thats a pretty expected result so hopefully we're somewhere on the right track.

However, Java benchmarks are very hard to right. So hard in fact that they do a hardness-overflow and end up looking easy - which is precisely what makes them so hard to get right. Hence - given that only I have seen this benchmark, its pretty likely to be broken some (and hence why its published here for public review to help fix).

With all that...

What I've tested here are storage operations. Specifically:

Storage to a local variable (mostly used as a control)Storage to an instance variableStorage to a static variableStorage to a static volatile variableStorage to a static AtomicLong variableStorage to a static variable protected by syncrhonization

I tested this on 3 different CPUs, all with JDK1.6 and variously on Linux and Windows. One very important point is do *not* compare the absolute numbers in the graphs directly. In no case are comparing the same CPU across 2 operating systems. The CPUs are wildly different in capability.

So we can explain more with a graph:

This is the run on a dual-core dual-Opteron machine. Note the blue-bar is local variable storage. As you'd expect it simply destroys everything else. This is very probably because the smart JVM went in and realized it could optimize the assignment out of the loop. Note the blue-bar just gets happier and happier as we add cores. The threads are completely unaffected by each other and we get actual linear increases in power. Even at 5 threads (note we have 4 cores) we get some speedup.

The orange and yellow bars are instance variable and static variable stores respectively. Some level of indirection or inability to optimize jumps in and makes things a bit more sane. Surprising we get little or no improvement as we add threads.

The three thread-visible methods of store are so slow they barely show up. This brings up an important point, let's assume you wrote an application and made it correctly thread safe. Using synchronization or volatiles or whatever. You *cannot* optimize away synchronization in the name of performance. It simply can't be done. (conversely, if you could then it wasn't correctly thread-safe to start). You can however, at times, replace one type of synchronization primitive with another.

That being said, lets take a closer look at the same graph with just the barrier memory store operations.

Volatile and Atomic are neck-and-neck. Synchronization is still a big loser, especially after we contend giving it two threads.

One other anecdote - while running the non-memory-barrier benchmarks, my CPU meter showed 100% user space usage. With these benchmarks however, it went to 30-40% kernel cpu usage.

Ok, jumping to the Core2Quad Extreme CPU. Definitely a faster processor but with a different memory architecture.

Local store again goes flying. Oddly the instance store doesn't do much but the static store increases nicely with the cores. And, once again, we can't even fricken see the barriered writes.. so here they are.

Look how great the JVM does knowing that just one thread is out there. It can really optimize the code to elide or at least marginalize the impact. Surprisingly, synchronization still gets nailed across the board.

Note that according to what I have so far, a synchronized static store is something 50 times slower than a simple static store. And 10 or so times slower than a volatile static store.

One more.. and its a fun one. Windows XP running on a Dothan. (Funny, I realized I hadn't directly known that Dothan was single core, but I just assumed it after I saw the graphs.)

Crazy. Add a few threads and performance doesn't even budge. Of course, with only one core a system can completely avoid all the complex architecture keeping cores in sync. Why its local writes are worse as compared to instance beats me.

Also, although I said don't compare graphs, this little single core Dothan beats the Core2Quad in all barrier-ed writes after 2 threads. Note that on the local writes, the Core2Quad is something like 50 time faster. But even on the simple static volatile store - at 2 threads, the Dothan is now twice as fast.

So. Pardon my informalness here - I'm actually quite expecting feedback to break this code in grand ways and force me to redo all the runs and rewrite all the text (which I'll be happy to do if we get a clean bench out of this). I have no intention of making very specific claims once this benchmark is firmed up - but I would like to have a "sense" of the costs of these different operations.

4 comments:

If you are not already a member or have not already done so, you may want to post your benchmark to the Java Concurrency Interest mailing list. This kind of thing is just up their alley. You can join at http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest.

Local store is kept directly in the processor registers. It scales with the number of processors being used, as expected. The additional benefit from the fifth thread may be noise, or it may be that there are enough registers available for more than one of your threads, so the processor can do something fancy about keeping instructions going from two threads in parallel at times.

Instance and static stores likely write to cache which isn't promptly flushed to main memory. However, the hardware architecture of these processors may have coherent caches, which means there's hardware that keeps the caches for all processors in synch; the cache sniffing is likely what limits the speed for these writes, which thus doesn't scale with the number of processors in use. The only puzzling thing is the scaling of static store versus instance store in one test.

The volatile and atomic store use memory barriers, which likely translate to low level cache flush instructions. They're limited by the speed of main memory, which, being shared between processors, doesn't scale with number of processors. The interesting thing is that the JVM doesn't know about the coherent cache architecture, which it could take advantage of to avoid the memory flushes (though that might break JNI or something). The synchronized store additionally involves grabbing a lock, slowing it down further.

a)There is a lot that can be going on behind the scenes in the jvm here to start with. Such as similar to your other benchmark where the jvm will know that is running on a single-core cpu and can skip on emiting lock-prefixes on some instructions. It can even turn memory barriers into no-ops etc. See:http://www.infoq.com/articles/memory_barriers_jvm_concurrency

And according to Dave Dice (JVM guy at Sun):"On MP IA32 and AMD64 systems we have to use locked atomic instructions (e.g., lock:cmpxchg) to acquire locks. On uniprocessors we can do without the lock: prefix. lock:ed instructions have considerable latency on some Intel processors (>500 cycles)."

So running on most x86 one-core machines (also when just allowing the process to use one core?) it will be a lot quicker to do synchronization.

b) Hyperthread cores are paired with a real core and usually provide faster context switching etc but that's all. No additional ALUs etc.

c) The cpu itself, such as: c1) An i7 shares the rather large and quick L3 cache among cores. Running one core only you are effectively running with ONE very big L3 cache instead of it being split up between cores.

c2) Running on a few cores or just one core, the i7 cpu overclocks the used cores.

c3) I7 runs synchronisations primitives a lot faster, LOCK COMPXCHG instructions in 60% of the time compared to the previous generation for example.