Wednesday, 17 October 2012

Compact Off-Heap Structures/Tuples In Java

In my last post I detailed the implications of the access patterns your code takes to main memory. Since then I've had a lot of questions about what can be done in Java to enable more predictable memory layout. There are patterns that can be applied using array backed structures which I will discuss in another post. This post will explore how to simulate a feature sorely missing in Java - arrays of structures similar to what C has to offer.

Structures are very useful, both on the stack and the heap. To my knowledge it is not possible to simulate this feature on the Java stack. Not being able to do this on the stack is such as shame because it greatly limits the performance of some parallel algorithms, however that is a rant for another day.

In Java, all user defined types have to exist on the heap. The Java heap is managed by the garbage collector in the general case, however there is more to the wider heap in a Java process. With the introduction of direct ByteBuffer, memory can be allocated which is not tracked by the garbage collector because it can be available to native code for tasks like avoiding the copying of data to and from the kernel for IO. So one method of managing structures is to fake them within a ByteBuffer as a reasonable approach. This can allow compact data representations, but has performance and size limitations. For example, it is not possible to have a ByteBuffer greater than 2GB, and all access is bounds checked which impacts performance. An alternative exists using Unsafe that is both faster and and not size constrained like ByteBuffer.

The approach I'm about to detail is not traditional Java. If your problem space is dealing with big data, or extreme performance, then there are benefits to be had. If your data sets are small, and performance is not an issue, then run away now to avoid getting sucked into the dark arts of native memory management.

The benefits of the approach I'm about to detail are:

Significantly improved performance

More compact data representation

Ability to work with very large data sets while avoiding nasty GC pauses[1]

With all choices there are consequences. By taking the approach detailed below you take responsibility for some of the memory managment yourself. Getting it wrong can lead to memory leaks, or worse, you can crash the JVM! Proceed with caution...

Suitable Example - Trade Data

A common challenge faced in finance applications is capturing and working with very large volumes of order and trade data. For the example I will create a large table of in-memory trade data that can have analysis queries run against it. This table will be built using 2 contrasting approaches. Firstly, I'll take the traditional Java approach of creating a large array and reference individual Trade objects. Secondly, I keep the usage code identical but replace the large array and Trade objects with an off-heap array of structures that can be manipulated via a Flyweight pattern.

If for the traditional Java approach I used some other data structure, such as a Map or Tree, then the memory footprint would be even greater and the performance lower.

The evidence here is pretty clear cut. Using the off-heap structures approach is more than an order of magnitude faster. At the most extreme, look at the 5th run on a Sandy Bridge processor, we have 43.2times difference in duration to complete the task. It is also a nice illustration of how well Sandy Bridge does with predictable access patterns to data. Not only is the performance significantly better it is also more consistent. As the heap becomes fragmented, and thus access patterns become more random, the performance degrades as can be seen in the later runs with standard Java approach.

2. More compact data representation

For our off-heap representation each object requires 42-bytes. To store 50 million of these, as in the example, we require 2,100,000,000 bytes. The memory required by the JVM heap is:

memory required = total memory - free memory - base JVM needs

2,883,248,712 = 3,817,799,680 - 810,551,856- 123,999,112

This implies the JVM needs ~40% more memory to represent the same data. The reason for this overhead is the array of references to the Java objects plus the object headers. In a previous post I discussed object layout in Java.

When working with very large data sets this overhead can become a significant limiting factor.

3. Ability to work with very large data sets while avoiding nasty GC pauses

The sample code above forces a GC cycle before each run and can improve the consistency of the results in some cases. Feel free to remove the call to System.gc() and observe the implications for yourself. If you run the tests adding the following command line arguments then the garbage collector will output in painful detail what happened.

-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics
From analysing the output I can see the application underwent a total of 29 GC cycles. Pause times are listed below by extracting the lines from the output indicating when the application threads are stopped.

With System.gc() before each run
================================
Total time for which application threads were stopped: 0.0085280 seconds
Total time for which application threads were stopped: 0.7280530 seconds
Total time for which application threads were stopped: 8.1703460 seconds
Total time for which application threads were stopped: 5.6112210 seconds
Total time for which application threads were stopped: 1.2531370 seconds
Total time for which application threads were stopped: 7.6392250 seconds
Total time for which application threads were stopped: 5.7847050 seconds
Total time for which application threads were stopped: 1.3070470 seconds
Total time for which application threads were stopped: 8.2520880 seconds
Total time for which application threads were stopped: 6.0949910 seconds
Total time for which application threads were stopped: 1.3988480 seconds
Total time for which application threads were stopped: 8.1793240 seconds
Total time for which application threads were stopped: 6.4138720 seconds
Total time for which application threads were stopped: 4.4991670 seconds
Total time for which application threads were stopped: 4.5612290 seconds
Total time for which application threads were stopped: 0.3598490 seconds
Total time for which application threads were stopped: 0.7111000 seconds
Total time for which application threads were stopped: 1.4426750 seconds
Total time for which application threads were stopped: 1.5931500 seconds
Total time for which application threads were stopped: 10.9484920 seconds
Total time for which application threads were stopped: 7.0707230 seconds
Without System.gc() before each run
===================================
Test run times
0 - duration 12120ms
1 - duration 9439ms
2 - duration 9844ms
3 - duration 20933ms
4 - duration 23041ms
Total time for which application threads were stopped: 0.0170860 seconds
Total time for which application threads were stopped: 0.7915350 seconds
Total time for which application threads were stopped: 10.7153320 seconds
Total time for which application threads were stopped: 5.6234650 seconds
Total time for which application threads were stopped: 1.2689950 seconds
Total time for which application threads were stopped: 7.6238170 seconds
Total time for which application threads were stopped: 6.0114540 seconds
Total time for which application threads were stopped: 1.2990070 seconds
Total time for which application threads were stopped: 7.9918480 seconds
Total time for which application threads were stopped: 5.9997920 seconds
Total time for which application threads were stopped: 1.3430040 seconds
Total time for which application threads were stopped: 8.0759940 seconds
Total time for which application threads were stopped: 6.3980610 seconds
Total time for which application threads were stopped: 4.5572100 seconds
Total time for which application threads were stopped: 4.6193830 seconds
Total time for which application threads were stopped: 0.3877930 seconds
Total time for which application threads were stopped: 0.7429270 seconds
Total time for which application threads were stopped: 1.5248070 seconds
Total time for which application threads were stopped: 1.5312130 seconds
Total time for which application threads were stopped: 10.9120250 seconds
Total time for which application threads were stopped: 7.3528590 seconds

It can been seen from the output that a significant proportion of the time is spent in the garbage collector. When your threads are stopped your application is not responsive. These tests have been done with default GC settings. It is possible to tune the GC for better results but this can be a highly skilled and significant effort. The only JVM I know that copes well by not imposing long pause times, even under high-throughput conditions, is the Azul concurrent compacting collector.

When profiling this application, I can see that the majority of the time is spent allocating the objects and promoting them to the old generation because they do not fit in the young generation. The initialisation costs can be removed from the timing but that is not realistic. If the traditional Java approach is taken the state needs to be built up before the query can take place. The end user of an application has to wait for the state to be built up and the query executed.

This test is really quite trivial. Imagine working with similar data sets but at the 100 GB scale.

Note: When the garbage collector compacts a region, then objects that were next to each other can be moved far apart. This can result in TLB and other cache misses.

Side Note On Serialization

A huge benefit of using off-heap structures in this manner is how they can be very easily serialised to network, or storage, by a simple memory copy as I have shown in the previous post. This way we can completely bypass intermediate buffer and object allocation.

Conclusion

If you are willing to do some C style programming for large datasets it is possible to control the memory layout in Java by going off-heap. If you do, the benefits in performance, compactness, and avoiding GC issues are significant. However this is an approach that should not be used for all applications. Its benefits are only noticable for very large datasets, or the extremes of performance in throughput and/or latency.

I hope the Java community can collectively realise the importance of supporting structures both on the heap and the stack. John Rose has done some excellent work in this area defining how tuples could be added to the JVM. His talk on Arrays 2.0 from the JVM Language Summit this year is really worth a watch. John discusses options for arrays of structures, and structures of arrays, in his talk. If the tuples, as proposed by John, were available then the test described here could have comparable performance and be a more pleasant programming style. The whole array of structures could be allocated in a single action thus bypassing the copy of individual objects across generations, and it would be stored in a compact contiguous fashion. This would remove the significant GC issues for this class of problem.

Lately, I was comparing standard data structures between Java and .Net. In some cases I observed a 6-10X performance advantage to .Net for things like maps and dictionaries when .Net used native structure support. Let's get this into Java as soon as possible!

It is also pretty obvious from the results that if we are to use Java for real-time analysis on big data, then our standard garbage collectors need to significantly improve and support true concurrent operations.

[1] - To my knowledge the only JVM that deals well with very large heaps is Azul Zing

With a plain byte[] I'd expect it to be slower on access because of the bounds checking and the bytes to primitive conversions that need to constantly happen. A native order ByteBuffer is a better approach but still slower than using Unsafe. Being limited to 2GB for a normal byte[] is not exactly what I would call big data :-) You would need to first select one of many 2GB arrays into which you address thus adding another level of indirection when dealing with a large heap.

Yes, I pointed out that GC is one of the biggests costs when you run a profiler on this type of problem. Real world apps have to allocate the objects, and for big data applications they will not all fit in the young generation thus must be promoted.

To be fair both approaches have to allocate the memory and initialise it to be like-for-like. A bunch of forced GCs and sleeps is not real world behaviour.

The world is very interesting when you measure and profile :-) Thanks for the feedback.

Yes, real world apps allocate memory. But they do it same way for pojo and direct backed storage. So, if you pretend to allocate once and by dig chunks -- you can get benefit from direct memory, but you also get the benefit from pre-allocate array and entries in pojo-style. Or, would you pretend you app allocate objects incrementally -- you'll have much trouble with unsafe.allocate on many small chunks, I would bet it is much slower then usual java thread-local-buffer based allocation.

By the way, about java-array-backed storage: I've just done benchmarks with long[] backed "trade array", and it is very close to direct-buffer one. Yes, world is very interesting place when you conjecture and verify :)

For the long[] approach do you use a slot for each variable or do you pack when smaller? For example 2 ints pack into a long. Can you post a link to code showing the comparison? I'm curious to know what approach is taken and if it will work for the full range of data types.

I've pack two ints into 1 long, yes. For last char I use full long cell. I'm also curious about different primitive types serialization/deserialization: your example was quite simple since it was long-dominated.

Your "long[]" approach works well if your data is mostly longs. If you have a lot of smaller primitives then the packing costs will start to dominate. Unsafe works well when you have the full range of data types and also has the advantage of supporting volatile and CAS operations. With large data sets you want to pack them to get the most out of memory.

The big advantage with any of the large array/buffer approaches is they are directly allocated in the old gen and avoid the generational copy cost.

Well, generally I agree. But there are also some tricks to use: you do not bounded to longs, you could choose primitive by your data dominated field type. I think, it is not very complex task to make some fitness-function, which is account cost of packing/unpacking with cost of storing/loading data with smaller chunck then native hardware word, and so give you the best storage type for given fields-and-types set.

From the over side I would rely on JIT support. I do not examine assembly yet, but couldn't JIT optimize pack-and-store/load-and-unpack by use wider/norrower load/store instructions? It seems like not very complex heuristic.

About old gen -- well, do you actually see copying? If you allocate 16Gb long[] it does not fit into the young gen at all -- it will be allocated in old gen from beginning, afaik

As I said in my previous comment that the large array *is* allocated directly in old gen, thus avoiding the copy. With the POJO approach all those individual objects need to be copied a number of times in the promotion via the survivor spaces.

By the way, my point is that you can get most of C-like structure performance and compactness with plain java arrays (although it will be unusual java style), without black magic. This is much safer, and it may be even faster, or at least very close to direct memory. It seems like direct memory gives you real benefit only in few corver cases.

You are right that things can be very close with standard arrays. It is safer and easier to debug. I'd only recommend this approach if you really really need to :-) In fact I often recommend standard array approaches with flyweights for column-oriented approaches. Column-oriented in arrays makes excellent indices for searching.

The cases where direct memory has real benefits is if:

1. Your memory needs are greater than 2-8GB depending on data types.2. You want to directly serialise by copying to input or output buffers without conversion.3. You are performing concurrent operations on fields and need volatile and/or Compare and Swap semantics. This is very common in uber large memory applications.4. You use a wide range of data types, especially Strings as byte or char arrays.5. The address you use is to a memory-mapped file that you are using for storage, therefore gaining transparent persistence.

Thanks for the interaction and feedback! Hopefully those reading are learning from the comments.

"If you have a lot of smaller primitives then the packing costs will start to dominate", shouldn't JIT be good for packing/unpacking, they are simple multiple and/or/plus/shift operations, in addition, Intel core has several unit to compute in pipeline, shouldn't pack/unpack cost be very small? if not, maybe JIT compiler can be improved on this regard.

Pack and unpack operations will be relatively cheap compared to a cache miss. Have you considered that lots of other work needs to be done as part of computation and by not using the all the execution units unnecessarily when we have alternatives?

I don't understand the JIT reference. The JIT cannot turn these into packed structures because it would not be using the primitive array as specified. It also cannot make work magically go away. Am I missing something in this point?

http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/cpu/x86/vm/templateTable_x86_64.cppshows that if unpack pack use OR/AND/SHIFT/ADD it will be cheap, however, to access byte at a certain offset of the array, the "[index]" I hope it is not using this (baload opcode), would it involve the multiply of HeapWordSize in base_offset_in_bytes ? upsetting.

We are expanding on the scope of the post but that is OK. On a real project I would byte align the beginning of each structure so addressing can be done with shifts. This is also necessary to ensure the words for concurrent access are aligned just as Java object fields are.

I'm working with some JVM implementers on a new intrinsic that treats a class like the following as if it is a real array of structures.

Agreed, variable length data such as char array can be 0 padded to a array length of power 2, base 10 is for human, not memory ( need sufficient cache size ? it will be hard to decide 64 -> 128, ouch) . "return partitions[partitionIndex][partitionOffset];"Still, there has to be magic behind the getter in the above line right ? I guess this JVM implementation is not oracle hotspot :) would this intrinsics also be efficient in terms of wiping the region clean and copy to another instances?

When using 64-bit addressing the the "magic" behind the getter is not very difficult as an intrinsic. The issue with the above is the 32-bit index addressing limitation on Java arrays. This class specifies the behaviour and not the implementation for the intrinsic.

This can be made very efficient for copy and reset operations with contiguous memory layout.

I know of a number of JVM implementers who are looking at memory layout with structures, arrays, and object co-location. It would be great to see this work become generally available.

I feel the same way. The JVM works best if you only use the heap for scratch space and very long lived immutable state. Don't let anything make it out of the young generation and if you can, also avoid copying anything to survivor spaces as well.

I wish someone would implement a red-black tree or hash table based on this design.

You should test with a larger memory machine. It gets more interesting then. With increased volume the TLB and other caches get put under greater stress. At 10X memory compared to you, I'm seeing ~70% delta on just the iteration.

The real killer comes in with real world applications where the heap is fragmented and those objects are all over the place. The direct memory will always be together. I'd need to write a lot more code to show that in action. No other objects are allocated in this test to mix up the heap.

All memory management is happening inside the JVM, the min and max sizes are set at startup. In my experience it is the processor rather than the OS making the difference here. Sandy Bridge has an improved cache and memory ordering buffer compared to Nehalem, and it will get a lot better again in Haswell.

Very interesting! Have you looked at HugeCollections library (http://code.google.com/p/vanilla-java/wiki/HugeCollections)? What do you think of it? It seems to provide a similar functionality. The main difference is that it is column-oriented as opposed to your example, which is row-oriented.

I've not tried HugeCollections personally. Column-oriented is great for some classes of problem and row-oriented is great for others. I use whichever is most appropriate for the problem I need to solve. Neither is perfect for all scenarios.

1. took your code.2. Changed the 5 reps, to NUM_REPS where NUM_REPS == 5003. I removed the System.gc(); call in the main loop. Only to see how this might run in a non testing environment.4. NUM_RECORDS = 3 * 1000 * 1000

DirectMemory has less variation (i.e. coefficient of variation)

See http://screencast.com/t/mGBmoqOyETcG.

a) The top panel is JavaMemoryLayout, the bottom panel is DirectMemoryLayout.b) Dropped first 10 observations for bothc) the y-axis is log(runtime, 10)d) these are 490 observations plotted in order of occurrence

Very very impressive. Though could you explain the cycles in the DirectMemoryLayout?

Is the y-axis time to complete the run? If it is the issue might be the the OS memory is being fragmented as the process constantly grow and shrinks in the direct case. The plain Java version has a constant memory requirement.

Try taking the memory allocation out of the loop in the direct case, which is more like a real application, and see what it does.

I believe many "Big Memory" solutions and KV stores use similar techniques with ByteBuffers that are either heap based or memory-mapped files. If structures/tuples are added to the language, like John Rose has described, then none of these workarounds would be necessary.

One thing that strikes me right now is that we could use UnsafeMemory to alleviate some full gc pain.Assume that we have something like a Disruptor that we need to provide with byte buckets (byte arrays of say 1500 bytes). If we allocate them from the heap, they will reside in old gen but will never be released, just increasing the live data set of the old gen. OTOH, if we allocate those from UnsafeMemory, they will not clutter old gen, potentially lowering full gc times. I know i definitely could use something like that right now as the full gc:s doesn't release all that much memory.

Going off-heap is very common in low-latency applications to avoid the GC overhead. This is especially important if you need to be fast right out the gate.

Since byte buffers don't contain object references there is no card marking and thus limited GC interaction once promoted. The majority of the cost will come in allocation and promotion to the old generation.

I have multiple publishers... Is it still possible to use chaining blocks or variable sized records in that case? I cannot figure out where to put the pointer to the next block. If by block you mean RingBuffer Event, at translate time we haven't claimed the next seqNum yet, but if you mean DirectMemoryObj, the get() function only works reliably for multiple producers if the blocks are fixed length. Am I missing something?

The most dramatic speed and memory usage improvement comes when you compares standard Java data structures with Unsafe - based ones: up to 8-10 times less memory usage , zero GC on 50-100GB heaps (off-heaps, of course).

There is one question on Unsafe.allocateMemory/freeMemory implementation. Is it platform malloc/free or the interface to JVM malloc/free? Platform (default Linux glibc, for example) is prone to significant memory fragmentation and not suitable for long running server applications. Many native Linux (and I suppose Windows) applications use alternative memory allocators (custom, tcmalloc, jemalloc etc).

The implementation of Unsafe.allocateMemory will likely be JVM specific and therefore you cannot rely on an underlying implementation. Best to allocate large contiguous chunks and manage it yourself. I would not use this for object level allocations.

Total time for which application threads were stopped: 0.8967600 secondsTotal time for which application threads were stopped: 1.0796190 secondsTotal time for which application threads were stopped: 1.6817910 secondsTotal time for which application threads were stopped: 0.2799290 secondsTotal time for which application threads were stopped: 2.5433730 seconds

That is one way to avoid the generational copy issues :-) GC can be played with for a microbenchmark. The real challenge is how to do this well without lots of tricks and be part of a real application. The only true answer is to add support for arrays of structs to Java which cannot come soon enough.

"To my knowledge it is not possible to simulate this feature on the Java stack." Being pedantic for a moment...

You can create pseudo-structures on the stack using primitives.

It's just that it's not terribly useful in the Java world. Kind of like trying to model functional programming in Java -- you can do it, but it's clunky and painful.

Here's an ugly example of using the stack for structures in Java. Instead of defining a "car" structure, we use primitive parameters to simulate passing a car structure on the stack from one procedure to the next.

This example is somewhat modeled after creating a temporary structure on the stack in C, doing some work with it, then throwing it away.

By "on the stack" I mean the actual stack that will be generated in native code after the optimiser has run. No heap objects at all so no GC interaction, and the structs/objects are *in* the stack frame for memory locality. Your pseudo objects above are on the heap away from the stack frame and need to be GC'ed. Real stack based data is key to the scalability of many true parallel algorithms to avoid contention.

firstly I would like to thank you for your blog and excellent posts you putting there. I wanted to use the Unsafe approach in my project, but I encounter one issue which I can't resolve. I'm trying to create byte array, which I then copy to off-heap memory and get the pointer to it. Everything works fine with 32bit java, but with 64bit java I'm getting JVM crash:(. Do you have any explanation to why the simple test program below is crashing on 64bit, but not on 32bit? Thanks a lot for your help!

You cannot safely treat off-heap memory like real objects. You must use Unsafe or direct buffers to read and write the bytes as primitives, and not as real objects.

I suspect this is working on 32-but because the address is real. On 64-bit the JVM uses compressed object references that are treated as special offsets based on byte alignment rather than real addresses. This *may* work on 32GB+ heaps but is still wrong.

thank you very much for your reply. If I disable compressed oops with -XX:-UseCompressedOops option everything works fine on 64bit too. Do you still think that I should not use the "Pointer" class at all?

The reason why I use it is the following scenario:

The application is receiving messages from the middleware. The middleware supports writing the messages directly into the byte array on the given offset - something like int recv(byte[] bytearray, int offset). So with "Pointer" class I can pass the off-heap memory as byte array to recv method and middleware will write directly to off-heap memory - instead of writing to temporary managed byte array and then rewriting this array to off heap through Unsafe.

What do you think about this approach? I think the only question is what happen if the "Pointer" instance is garbage collected - but this should not be an issue if I prevent it.

I still think this is evil :-/ The garbage collector will try and mark this pointer as reachable when it clearly is not. You have just deferred the crash to some random point in the future. Going off-heap is a "handle with care situation". Please please please do not do things to deceive the runtime. It will only end in tears.

Why don't you allow the middleware to write off-heap then use a flyweight to access the data? Or simply use a direct ByteBuffer?

I agree that letting GC to mark the pointer as reachable is wrong, but if I write the code that I always keep the reference to "Pointer" instance until the application is terminated than it should not be a big issue. But I have to agree that the whole "Pointer" approach is evil:).

Btw I can't allow the middleware to write to off-heap directly because it is not internally developed middleware - we using zeromq (http://www.zeromq.org/).

The issue with the GC marking a pointer as reachable is that many VM do this by updating a state table some given offset from the current page containing the object. It may also try to move the fake object during compaction. Off heap must be treated as raw bytes and cannot be safely treated as real Java objects.

We are organising training spending for next year at my work. I would like to attend your concurrency course if you are running it. I had a look at the instil site for the course and it is listed as TBC, so I am registering my interest here to encourage you to run it :)

Having recently been advised by yourself(many thanks) to beware of cache line alignment issues I wondered how you would reason about alignment in the above case(assuming 64b cache line):1) the trade structure size is 422) unsafe.allocateMemory will return an address that "will never be zero, and will bealigned for all value types." --> will be 16 bytes aligned, from what I can find.This means your address can start in one of 4 locations on a cache line(0,16,32,48) resulting in a variety of conditions in which one of the fields is split across 2 cache lines e.g:If the address is cache aligned then the second record has it's 4 field split between line one and line 2.I understand this can result in loss of atomicity of updates to the field, meaning a half formed field could be visible on write. I assume this can be corrected by padding the structure such that this cannot happen(to a size which is a multiple of 16).Am I correct in my reasoning so far? If you agree that the issue exists, how would it manifest(perf issue? correctness issue)?Thanks again,Nitsan

This is the reason why Java will align objects on 8-byte boundaries and then carefully organise the fields like I described in a previous blog on false sharing.

1) You could get issues if the word you are using for coordination in a lock-free concurrent algorithm is split across cache lines, thus making loads and stores not atomic.

2) You can also get performance issues reading fields that are split across cache lines, or even not aligned on word boundaries depending on processor.

I did not want to make this article more complex but when building systems I allocated my off-heap objects on word boundaries and do similar for this fields. Maybe I'll do a follow up with more detail on this.

1) Would you not get similar issues of correctness on regular reads/writes?2) Would you not see performance issues for writes?I agree with the sentiment of keeping it simple, but as your blog is widely read, and highly regarded, I think many people would not be aware of the issue and copy your example as is pretty much. But perhaps I'm just projecting my ignorance on others :).

Regular loads and stores will work as expected but may suffer performance issues as words get assembled that span cache lines. Store costs are more likely to be hidden by the store buffer and write combining buffers than load costs.

Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.

The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).

The more complex atomic operations (e.g. AtomicLong.addAndGet()) that both read and write a memory field within a single atomic operations are guaranteed to provide atomicity regardless of memory layout.

In practice, JVM implementations typically force fields to not cross cache line boundaries by simply aligning all fields to their field size. This is a common necessity since most architectures do not support unaligned types in memory, and do not support unaligned atomic memory operations.

BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).

I have significant concerns about these microbenchmarks and the use of sequential integers in the test data. In the past, I've seen very significant skewing of test results occur when that is done. I just did a test to verify that these tests also suffer from the same flaw.

In each init() method I added a seeded random (to ensure the same values are being compared):

java.util.Random r = new Random(NUM_RECORDS);

and for each use of "i" in constructing the trade record, I replaced "i" with r.nextInt(NUM_RECORDS). Doing this had little impact on the traditional Java test (probably because GC predominates). The initial test runs had been ~10 seconds on my box, that stayed ~10 seconds, and then to ~20+ seconds for the final two runs. However, for the DirectMemoryTest, it had been around 920ms on my box, but after making the change, the times jumped to 4.5 seconds. I think that 5-fold difference is pretty highly significant and should be more carefully guarded against. The direct memory is still clearly faster, but the advantage was pretty significantly eroded. I'd suggest that the pseudo random numbers are far more representative of many (most?) real world applications (certainly any trading system) than a sequentially increasing numbers.

Your use of random here will likely defeat the hardware prefetcher. For bulk operations you want the support of the hardware prefetcher to hide memory latency. I do they very deliberately. I've blogged about this in more detail.

Yes, I read that post. First, I think this isn't about hardware prefetching since we're not fetching trades based on the contents of previous trades. We're still iterating over memory (the trades themselves) sequentially. Note I am not randomly accessing the trades, simply setting the values on the trades to random numbers. The only use of random (in my modification) was in the init() method. e.g.: