Thursday, 5 July 2012

Native C/C++ Like Performance For Java Object Serialisation

Do you ever wish you could turn a Java object into a stream of bytes as fast as it can be done in a native language like C++? If you use standard Java Serialization you could be disappointed with the performance. Java Serialization was designed for a very different purpose than serialising objects as quickly and compactly as possible.

Why do we need fast and compact serialisation? Many of our systems are distributed and we need to communicate by passing state between processes efficiently. This state lives inside our objects. I've profiled many systems and often a large part of the cost is the serialisation of this state to-and-from byte buffers. I've seen a significant range of protocols and mechanisms used to achieve this. At one end of the spectrum are the easy to use but inefficient protocols likes Java Serialisation, XML and JSON. At the other end of this spectrum are the binary protocols that can be very fast and efficient but they require a deeper understanding and skill.

In this article I will illustrate the performance gains that are possible when using simple binary protocols and introduce a little known technique available in Java to achieve similar performance to what is possible with native languages like C or C++.

The three approaches to be compared are:

Java Serialization: The standard method in Java of having an object implement Serializable.

Binary via ByteBuffer: A simple protocol using the ByteBuffer API to write the fields of an object in binary format. This is our baseline for what is considered a good binary encoding approach.

Binary via Unsafe: Introduction to Unsafe and its collection of methods that allow direct memory manipulation. Here I will show how to get similar performance to C/C++.

To write and read back a single relatively small object on my fast 2.4 GHz Sandy Bridge laptop can take ~10,000ns using Java Serialization, whereas when using Unsafe this can come down to well less than 100ns even accounting for the test code itself. To put this in context, when using Java Serialization the costs are on par with a network hop! Now that would be very costly if your transport is a fast IPC mechanism on the same system.

There are numerous reasons why Java Serialisation is so costly. For example it writes out the fully qualified class and field names for each object plus version information. Also ObjectOutputStream keeps a collection of all written objects so they can be conflated when close() is called.
Java Serialisation requires 340 bytes for this example object, yet we only require 185 bytes for the binary versions. Details for the Java Serialization format can be found here. If I had not used arrays for the majority of data, then the serialised object would have been significantly larger with Java Serialization because of the field names. In my experience text based protocols like XML and JSON can be even less efficient than Java Serialization. Also be aware that Java Serialization is the standard mechanism employed for RMI.

The real issue is the number of instructions to be executed. The Unsafe method wins by a significant margin because in Hotspot, and many other JVMs, the optimiser treats these operations as intrinsics and replaces the call with assembly instructions to perform the memory manipulation. For primitive types this results in a single x86 MOV instruction which can often happen in a single cycle. The details can be seen by having Hotspot output the optimised code as I described in a previous article.

Now it has to be said that "with great power comes great responsibility" and if you use Unsafe it is effectively the same as programming in C, and with that can come memory access violations when you get offsets wrong.

Adding Some Context

"What about the likes of Google Protocol Buffers?", I hear you cry out. These are very useful libraries and can often offer better performance and more flexibility than Java Serialisation. However they are not remotely close to the performance of using Unsafe like I have shown here. Protocol Buffers solve a different problem and provide nice self-describing messages which work well across languages. Please test with different protocols and serialisation techniques to compare results.

Also the astute among you will be asking, "What about Endianness (byte-ordering) of the integers written?" With Unsafe the bytes are written in native order. This is great for IPC and between systems of the same type. When systems use differing formats then conversion will be necessary.

How do we deal with multiple versions of a class or determining what class an object belongs to? I want to keep this article focused but let's say a simple integer to indicate the implementation class is all that is required for a header. This integer can be used to look up the appropriately implementation for the de-serialisation operation.

An argument I often hear against binary protocols, and for text protocols, is what about being human readable and debugging? There is an easy solution to this. Develop a tool for reading the binary format!

Conclusion

In conclusion it is possible to achieve the same native C/C++ like levels of performance in Java for serialising an object to-and-from a byte stream by effectively using the same techniques. The UnsafeMemory class, for which I've provided a skeleton implementation, could easily be expanded to encapsulate this behaviour and thus protect oneself from many of the potential issues when dealing with such a sharp tool.

Now for the burning question. Would it not be so much better if Java offered an alternative Marshallable interface to
Serializable by offering natively what I've effectively done with Unsafe???

No, Akka doesn't use this technique, and I'm not sure where that'd be beneficial right now as in-memory message passing is done via references.

Akka Serialization is completely pluggable & configurable so you can write your own serializers, use other people's serializers and mix and match at will. Akka Remote Protocol uses Protocol Buffers to allow for platform independence but the payloads uses Akka Serialization.

So it's a bit better than standard serialization, and extern can be improved upon by overriding the class descriptor writer, but there's not much point. It's not going to beat ByteBuffer, let alone UnsafeMemory.

They found that Externalizable gives the best results of all, beating even kryo!

I wonder what caused this huge discrepancy.

Possibly it was something with the environment (hardware, OS, JVM, etc).

Another possibility is the seemingly minor choice of what streams to use. The source code that you posted uses ObjectInput/ObjectOutput interfaces. What concrete classes did you use? The only ones built into the JDK are ObjectInputStream/ObjectOutputStream.

In contrast, download the Thrift benchmarks as described here http://code.google.com/p/thrift-protobuf-compare/source/checkout

Then open up the class ...\thrift-protobuf-compare-read-only\tpc\src\serializers\JavaExtSerializer.java

There you will find that it defines custom ObjectInput/ObjectOutput implementations that essentially are just DataInputStream/DataOutputStream.

Could it be that DataXxxStream has vastly lower overhead than ObjectXxxStream, even when all you do is write/read primitives like you do? Skimming the ObjectOutputStream source code, I note that primitives are written using an underlying BlockDataOutputStream. (This is also mentioned near the end of that class's javadocs.) In contrast, if you look at the source code of DataOutputStream, you will see that methods like writeInt write directly to the underlying OutputStream.

So, I suspect that your results could be improved a lot with a tiny code adjustment. Would you mind re-running?

On another note, I have not played with yet, but here is a project that claims to be a drop-in replacement for JDK serialization, but as fast or faster than Kryo: http://code.google.com/p/fast-serialization/

Followup to my own reply: looking close at the thrift benchmark JavaExtSerializer code, I think that they somewhat cheated.

If you look closer, you will see that they only write then read an object of a single type (MediaContent). Consequently, their deserialize method does not determine what type of object is next on the input stream--it assumes that a single MediaContent instance is there.

So, the solution that they give is NOT a general solution for writing a series of objects of potentially different types to a stream. You need to be more complicated (e.g. when writing, first write what type of object is coming).

That said, their approach of only effectively using a DataXxxStream is intriguing. Nickman, I would still like to see what you get if you try that.

Thanks for posting these measurements. Very helpful to know what the tradeoff is. I will probably continue to use ByteBuffer's for the bounds checking. So much parallelism is available that I don't see it as the right tradeoff for serialization to disk or network.

I think the right way to view these results are in terms of the # of megabytes of serialization you can do per core per second. My napkin math says 974 megabytes per core. By way of comparison when I tested Snappy I got the advertised 250 megabytes/second and LZ4 compression claims to do 300 megabytes/sec.

As you point out these numbers matter if you are doing IPC on a local system, and if you have 10-gig E there is some fat to trim, but you can scale your way out of this particular bottleneck. I'll bet there is more fat to trim thinking about the cache misses being incurred by local IPC and associated coordination then the serialization scheme itself. That too is on the order of 100s of nanoseconds.

It would be nice if you could collaborate with team behind [https://github.com/eishay/jvm-serializers] (there is a mailing list). Much easier to get decent perspective; especially since while sometimes (de)serialization is a significant cost, quite often it really is not (which was implied in the article too).

Apologies for the chart formatting. My understanding is that the serializer runner in the jvm-serializer test suite does indeed do warmup runs and runs each test 500 times against each serializer. I had the test run against against the serializers I'm most familiar with - java and kryo.

The two numbers I was interested in were the serialization and deserialization costs which are the second and the third numbers in the chart above (in nano seconds I believe). We currently use an older version of kryo than the one in the test so I was curious just how much faster, if anything, using Unsafe would be. From the test results above it would seem that the serialization times are slightly faster and the deserialization times are 1/2 of what kryo achieves.

For any performance test it is important to do multiple runs of over 10 thousand iteration to allow the optimiser to kick in, otherwise you are just testing the interpreter. An option is to vary the CompileThreshold.

From the documentation for -server JVM:-XX:CompileThreshold=10000 Number of method invocations/branches before compiling [-client: 1,500]

If you want a specific byte ordering for the Unsafe, you can use Long/Integer.reverseBytes(). On Intel these are both implemented as intrinsics (BSWAP) and are quite efficient. You can statically evaluate whether your need to reverse the bytes for your platform by writing out a long in the desired byte order and reading it in again. Store this in the final static field and the value can be read via Unsafe and reversed if required.

Hi Martin,I've recently written a similar article in my blog and updated it after reading this article. I've made a full set of ByteBuffer performance tests: heap/direct bufers; little/big endian; working on separate elements of array or using bulk methods; Java 6 b30/Java 7 b2. I've tested byte[] and long[] processing times separately, so I ended up with 64 time measurements for ByteBuffers and 16 - for Unsafe.As it tturned out, direct byte buffers with native byte order used to serialize arrays of any primitive type are nearly indistinguishable by performance from Unsafe.You can read my article here: http://vorontsovm.blogspot.com.au/2012/07/javaniobytebuffer-javaiodatainputstream.html

Thanks Mike this is interesting. I've found direct ByteBuffer's can give sufficient performance for large byte arrays when intrinsics get applied. By the way you should test across all JVMs, I was very surprised at the differences! Hotspot and Azul rock but the others (JRockit, IBM, et al) are somewhat lacking.

Most JDK libraries now support the use of ByteBuffer and Channel, unfortunately many 3rd party libraries only take a byte[] in their API. This is why Unsafe is more interesting to me. I can use it to manipulate *any* type of buffer I get a reference to.

Very nice article on a topic that is frequentky overlooked I think. On second thaught however I wonder whether comparing JSE serialization to a custom protocol is of any value. Are we comparing apples to apples here?JSE serialisation only requires you to implement the correct interface. Your object can but does not need custom IO code at all. This is a pretty big contrast to the ByteBuffer and UnsafeMemory implementations.Secondly, JSE serialization has additional logic to serialize object references to (re)construct object graphs etc. I see no equivalence of that feature in the other 2 testcases. I have added a test case for the Hessian library which have results in the line of those for JSE serialization.

Java Serialization is not a fair comparison at all. I added that test case for context to what people would likely do if they were not aware of custom binary protocols. I could have shown how much slower XML is even than Java Serialization to be really silly.

Yes, serialising a graph of objects does require additional code and this can be implemented with a simple key protocol. However in a high-performance environment if you are passing complex graphs for basic communication then I'd have to council a step back and question the design.

The point of the article is to show how it is possible to get native like performance in a managed runtime environment like Java. If you are working in a high-performance domain then you need to be thinking like this. If performance is not an issue then ease of coding is probably your best investment.

Someone let me know when they've modified Kryo to use Unsafe in the com.esotericsoftware.kryo.io Output and Input classes. I'd love to see the results then. I'd do it myself, but I'm sure someone else has already thought of this minor mod.

Very good article. I'm a fan of Kryo's. But I've also ported to Java "Python Construct", an excellent library for parsing and building custom binary protocols.Speed hasn't been a requirement so far, but if someone wants to develop the low level further and use "unsafe" rather than ByteBuffer, that'd be a useful contribution:

It's interesting to see that I'm not the only person featuring the sun.misc.Unsafe idea. I started some Github project (Lightning) a few months ago to implement a serialization framework that uses as much Unsafe constructs as possible but can failover to reflection.Another thing it does is generating marshaller / unmarshaller bytecode at runtime to run at native speed the moment HotSpot jumps in.The last performance feature is the missing constructor invocation. This means ONLY value objects can be transfered with this serializer but for most cases this is enough.

Another reason for the framework was a fast-failing principle means it is used for clusters and all clusternodes need to have the same codebase. So the masternode builds a classdefinition container holding all information of registered / configured classes and attributes.

The current implementation only features JGroups integration so when a new node connects the clustermaster transfers it's own classdefinition container to the node and the new cluster member tests it's own classes against the definitions. If one class fails the new node is automatically disconnected from the cluster.

The whole project started as a prototyping to find out if my expectations about the speed improvement would be correct but this implementation was good enough to make it a real project.

The building time is a one time operation when generating the classdefinition container and generating the bytecode marshallers. This is normally done at startup time of the clusternode.The possible differences in byte size depends on if a field was randomized to null or filled with a value.

For interested people I would like to see others joining the project or offering ideas on what to add or how to improve the framework. I'm open for questions and optinions as well. You can contact me on m e [at] n o c t a r i u s [DOT] c o m or Google+. I hope to see some reaction of the blog's owner.

Interesting - I was thinking about something like this for Apache DirectMemory - but you're already ahead on the way ;) we now use protostuff as the default serializer - do you think you could have better figures?

I already thought about adding a serializer for directmemory (https://github.com/noctarius/Lightning/issues/10). it would be nice to have a chat about how to integrate both systems in a nice way :) Feel free to contact me with the above mail address.

Very timely article for software I am developing write now. However, we are trying to implement memory-mapped buffers to allow paging in order to make our application work on devices with limited physical memory. Is there (a reasonable) way I could leverage the use of Unsafe operations as an alternative to using a MappedByteBuffer?

You can use reflection to get the address of the memory inside the MappedByteBuffer then use Unsafe to manipulate it. However on some non-mainstream JVMs Unsafe may not have intrinsics applied, therefore you are better off using the ByteBuffer interface. Best to test for your own requirements and platform.

Guys, I just came across this article which follows up on this one with some brief benchmarks: http://java.dzone.com/articles/fast-java-file-serialization

One surprising claim made in that article is that standard Java serialization to a File can be sped up by ~4X if you use a FileOutputStream(RandomAccessFile.getFD()) instead of the usual BufferedOutputStream(FileOutputStream).

What? That claim was news to me! Anyone one else ever seen a performance claim like that before?

It cannot be because RandomAccessFile natively implements DataOutput, since a FileOutputStream wraps and hides that API...

I have found that benchmarking file I/O is really nasty due to disk caching. You gotta be super careful between benchmarks to do stuff to ruin the cache before doing a new benchmark. Perhaps these results are just not trustworthy?

I've implemented a serialization that uses unsafe optionally to speed things up. Additionally it is possible to inject hints using annotations into the serialization algorithm.In a real project, doing handcrafted serialization of complex structures is costly in terms of mandays and resulting bugs.Theoretically it should be possible to achieve similar or (in most cases better) performance by using a mix of efficient generic serialization implementation+Annotation hints.

Not really an apples to apples comparison. For example the byte buffer approach does not serialize the class type itself, IOW it is not polymorphic, it also uses a fixed size which often times is not possible, especially if the object is bigger than the allocated size.

For an apples to apples comparison you should use externalize but only invoke the writeExternal and readExternal methods not ever call #writeObject on the stream with the object to be externalized. Make sense? And of course if you really want to be fair you will either reuse the byte array for the ObjectOutputStream/InputStream or just allocate a new ByteBuffer in the loop as this is how most code would work.