Saturday, 13 August 2011

False Sharing && Java 7

In my previous post on False Sharing I suggested it can be avoided by padding the cache line with unused long fields. It seems Java 7 got clever and eliminated or re-ordered the unused fields, thus re-introducing false sharing. I've experimented with a number of techniques on different platforms and found the following code to be the most reliable.

With this code I get similar performance results to those stated in the previous False Sharing article. The padding in PaddedAtomicLong above can be commented out to see the false sharing effect.

I think we should all lobby the powers that be inside Oracle to have intrinsics added to the language so we can have cache line aligned and padded atomic classes. This and some other low-level changes would help make Java a real concurrent programming language. We keep hearing them say multi-core is coming. I say it is here and Java needs to catch up.

You are totally right in that we have no guarantee how Java objects will be placed on the heap. This is the source of the problem with false sharing. If you have counters or pointers that are used across threads it is vitally important to ensure they are in different cache lines or the application will not scale up with cores. The point of the padding is to protect the counters/pointers to ensure they ever get placed in the same cache line.

If I knew up front that would be a possibility. Often when designing large systems we don't know such things, or we have created a library used by another application.

It is hard to create a small enough example with sufficient context. The above example illustrates how bad things can been when it happens. If you pad such structures then you need not worry about where they get located in memory. Only the highly contended concurrent structures need such padding.

After a good discussion with Jeff Hain we have discovered the JVM cannot be trusted when it comes to re-ordering code. What happens before the volatile seems to be safe but what happens after is not reliable. We have a better solution by replacing the VolatileLong with the following and using the normal methods of AtomicLong.

Oh how I wish the Java powers the be would get serious about this stuff and add some intrinsics that are cache line aligned and padded. This have been the biggest source or performance bugs in the Disruptor.

Agree that it would be great if there was a way to declare a field such it is guaranteed to occupy it's own cache line, and let the JVM deal with placing the right padding in the object layout. What you do here with artificial padding is the next best thing, but as you know, what actually happens to object layout becomes JVM-implementation specific.

Being paranoid, I'd add some more stuff to your padding technique to make it harder to get rid of the padding fields through optimization. A hypothetically smart enough JVM would still be able to prove away the need to actually implement the atomic long field p1...p7 in your code above: The PaddedAtomicLong class is only visible from the final (and therefor non-extendable) FalseSharing class. The compiler can therefore "know" that it is looking at all the code that will ever see the padding fields, and can prove that no behavior actually depends on p1...p7. The JVM could then implement the above without taking any space for the padding fields...

You can probably defeat this by making the PaddedAtomicLong class visible outside of FalseSharing, either directly or by adding accessor method on FalseSharing that would depend on all the p1...p7 values and could theoretically be called by someone outside.

I've often used an array of 15 longs for example and put the value at index 7 to achieve the same result. However this does not work if the value requires volatile semantics. For that you need to use AtomicLongArray or similar. Based on my measurements the indirection and bounds checking costs for this can be significant in an algorithm.

I know a bunch of folk are proposing an @Contended annotation for Java to mark fields so they can be cache line aligned and padded. I hope it comes soon. :-)

By using index 7 (8th element) ensures that there will be 56 bytes of padding either side of the value. 56 bytes plus the 8 for the value ensures nothing else can share the same 64-byte cache line regardless of starting location in memory for the array.

FWIW - I have a centos5.8 VM running on hyperv w/ oracle java 1.6.0_22, and this version of the false sharing test with the padded atomic is about 4 times faster than VolatileLong. A typical run for me with padded atomic long:

How do we ensure this will fit exactly inside the cache line (which is 64 bits I presume)? Does this means that PaddedAtomicLong object itself will take up 16bytes (8 bytes for object but what about the remaining 8 bytes) ?

public final static class VolatileLong // 16byte{ public volatile long value = 0L; // 8 byte public long p1, p2, p3, p4, p5, p6; // 6*8 = 48byte}seems total = 16 + 8 + 48 = 72 bytes > 64 byte cache line. As far as I know header will have 2 words since is about 64bites system. an object of type VolatileLong will be store on 2 cache lines ? If was the case when two different objects of type VolatileLong can be stored one after another in memory, in the example bellow, second object of type VolatileLong will be stored from byte 9 of second cache line, since first object ends at byte 8 on second cache line or it's added another 7 bytes as padding and start from byte 16 ?