Monday, 5 May 2014

Simple Binary Encoding

Financial systems communicate by sending and receiving vast numbers of messages in many different formats. When people use terms like "vast" I normally think, "really..how many?" So lets quantify "vast" for the finance industry. Market data feeds from financial exchanges typically can be emitting tens or hundreds of thousands of message per second, and aggregate feeds like OPRA can peak at over 10 million messages per second with volumes growing year-on-year. This presentation gives a good overview.

In this crazy world we still see significant use of ASCII encoded presentations, such as FIX tag value, and some more slightly sane binary encoded presentations like FAST. Some markets even commit the sin of sending out market data as XML! Well I cannot complain too much as they have at times provided me a good income writing ultra fast XML parsers.

Last year the CME, who are a member the FIX community, commissioned Todd Montgomery, of 29West LBM fame, and myself to build the reference implementation of the new FIX Simple Binary Encoding (SBE) standard. SBE is a codec aimed at addressing the efficiency issues in low-latency trading, with a specific focus on market data. The CME, working within the FIX community, have done a great job of coming up with an encoding presentation that can be so efficient. Maybe a suitable atonement for the sins of past FIX tag value implementations. Todd and I worked on the Java and C++ implementation, and later we were helped on the .Net side by the amazing Olivier Deheurles at Adaptive. Working on a cool technical problem with such a team is a dream job.

SBE Overview

SBE is an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. Of the many applications I profile with performance issues, message encoding/decoding is often the most significant cost. I've seen many applications that spend significantly more CPU time parsing and transforming XML and JSON than executing business logic. SBE is designed to make this part of a system the most efficient it can be. SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.

The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.

The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Message Structure

A message must be capable of being read or written sequentially to preserve the streaming access design principle, i.e. with no need to backtrack. Some codecs insert location pointers for variable length fields, such as string types, that have to be indirected for access. This indirection comes at a cost of extra instructions plus losing the support of the hardware prefetchers. SBE's design allows for pure sequential access and copy-free native access semantics.

Figure 1

SBE messages have a common header that identifies the type and version of the message body to follow. The header is followed by the root fields of the message which are all fixed length with static offsets. The root fields are very similar to a struct in C. If the message is more complex then one or more repeating groups similar to the root block can follow. Repeating groups can nest other repeating group structures. Finally, variable length strings and blobs come at the end of the message. Fields may also be optional. The XML schema describing the SBE presentation can be found here.

SbeTool and the Compiler

To use SBE it is first necessary to define a schema for your messages. SBE provides a language independent type system supporting integers, floating point numbers, characters, arrays, constants, enums, bitsets, composites, grouped structures that repeat, and variable length strings and blobs.

A message schema can be input into the SbeTool and compiled to produce stubs in a range of languages, or to generate binary metadata suitable for decoding messages on-the-fly.

java [-Doption=value] -jar sbe.jar <message-declarations-file.xml>

SbeTool and the compiler are written in Java. The tool can currently output stubs in Java, C++, and C#.

Programming with Stubs

A full example of messages defined in a schema with supporting code can be found here. The generated stubs follow a flyweight pattern with instances reused to avoid allocation. The stubs wrap a buffer at an offset and then read it sequentially and natively.

The generated code in all languages gives performance similar to casting a C struct over the memory.

On-The-Fly Decoding

The compiler produces an intermediate representation (IR) for the input XML message schema. This IR can be serialised in the SBE binary format to be used for later on-the-fly decoding of messages that have been stored. It is also useful for tools, such as a network sniffer, that will not have been compiled with the stubs. A full example of the IR being used can be found here.

Direct Buffers

SBE, via Agrona, provides an abstraction to Java, with the MutableDirectBuffer class, to work with buffers that are byte[], heap or direct ByteBuffer buffers, and off heap memory addresses returned from Unsafe.allocateMemory(long) or JNI. In low-latency applications, messages are often encoded/decoded in memory mapped files via MappedByteBuffer and thus can be be transferred to a network channel by the kernel thus avoiding user space copies.

C++ and C# have built-in support for direct memory access and do not require such an abstraction as the Java version does. A DirectBuffer abstraction was added for C# to support Endianess and encapsulate the unsafe pointer access.

Message Extension and Versioning

SBE schemas carry a version number that allows for message extension. A message can be extended by adding fields at the end of a block. Fields cannot be removed or reordered for backwards compatibility.

Extension fields must be optional otherwise a newer template reading an older message would not work. Templates carry metadata for min, max, null, timeunit, character encoding, etc., these are accessible via static (class level) methods on the stubs.

Byte Ordering and Alignment

The message schema allows for precise alignment of fields by specifying offsets. Fields are by default encoded in Little Endian form unless otherwise specified in a schema. For maximum performance native encoding with fields on word aligned boundaries should be used. The penalty for accessing non-aligned fields on some processors can be very significant. For alignment one must consider the framing protocol and buffer locations in memory.

Message Protocols

I often see people complain that a codec cannot support a particular presentation in a single message. However this is often possible to address with a protocol of messages. Protocols are a great way to split an interaction into its component parts, these parts are then often composable for many interactions between systems. For example, the IR implementation of schema metadata is more complex than can be supported by the structure of a single message. We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interpreter with much less branching than the typical switch based state machines.

Protocol design is an area that most developers don't seem to get an opportunity to learn. I feel this is a great loss. The fact that so many developers will call an "encoding" such as ASCII a "protocol" is very telling. The value of protocols is so obvious when one gets to work with a programmer like Todd who has spent his life successfully designing protocols.

Stub Performance

The stubs provide a significant performance advantage over the dynamic OTF decoding. For accessing primitive fields we believe the performance is reaching the limits of what is possible from a general purpose tool. The generated assembly code is very similar to what a compiler will generate for accessing a C struct, even from Java!

Regarding the general performance of the stubs, we have observed that C++ has a very marginal advantage over the Java which we believe is due to runtime inserted Safepoint checks. The C# version lags a little further behind due to its runtime not being as aggressive with inlining methods as the Java runtime. Stubs for all three languages are capable of encoding or decoding typical financial messages in tens of nanoseconds. This effectively makes the encoding and decoding of messages almost free for most applications relative to the rest of the application logic.

Feedback

This is the first version of SBE and we would welcome feedback. The reference implementation is constrained by the FIX community specification. It is possible to influence the specification but please don't expect pull requests to be accepted that significantly go against the specification. Support for Javascript, Python, Erlang, and other languages has been discussed and would be very welcome.

Update: 08-May-2014

Thanks to feedback from Kenton Varda, the creator of GPB, we were able to improve the benchmarks to get the best performance out of GPB. Below are the results for the changes to the Java benchmarks.

The C++ GPB examples on optimisation show approximately a doubling of throughput compared to initial results. It should be noted that you often have to do the opposite in Java with GPB compared to C++ to get performance improvements, such as allocate objects rather than reuse them.

40 comments:

Martin, thank you for the article. Could you talk a bit more about this "We encode IR by first sending a template message providing an overview, followed by a stream of messages, each encoding the tokens from the compiler IR. This allows for the design of a very fast OTF decoder which can be implemented as a threaded interrupter with much less branching than the typical switch based state machines." Especially interested in the "threaded interrupter vs Switch based state machine" bit.

Might this be "threaded interpreter", which is an alternative to a switch-based interpreter? I always liked Forth, and apparently it can be more CPU-cache friendly (see stuff at http://www.complang.tuwien.ac.at/projects/interpreters.html)

Rather than encode the IR tokens as a finger tree we encode them as a stream. This stream can then be feed into a parser that, even with a Java implementation, can be implemented without using a single big switch statement. Too much unpredictable branching can really hurt CPU throughput. Branching is OK provided it is mostly predictable based on past statistics. By using recursion in Java it is also possible to make the OTF decoder allocation free. Recursion in this case is safe because we only need to recurse into nested repeating groups.

Debug the following example to see the IR being used and the parser in action.

Thanks Martin. I looked at the project a bit more in detail and had a brief thread on the Cap'n Proto boards too. So even though you do mention it I think a fair comparison would highlight the difference in features especially compared to something like Cap'n Proto where a lot of the same principles are used. Two things especially stick out:i) No bounds checking in the CPP code as far as I can tell. This means that you probably only support trusted sources. Seems like you could perform heartbleed like attacks if you accept messages from the internet. Maybe the responsibility for these checks lies somewhere else?ii) The sequential access requirement is a killer for some projects. You mention this very clearly and I understand that this is the norm for trading data, but some applications just can't live with this constraint. For example imagine I want to represent my objects using SBE in a replicated object database. One of the replicas gets a query for only a particular field (and this is unpredictable), I need to iterate every field just to satisfy that query. Further the CPP bindings at least don't prevent you from shooting yourself in the foot. You could easily call car.available() before car.modelYear() and it won't complain.

Cap'n Proto is a good project. There are many others. We just picked GBP as a comparison because of how commonly it is used to show people a difference. I could have chosen ASN.1 but not so many people know that.

i) The bound checking reaction is fascinating in how people so misunderstand heartbleed and the like. Any codec could be used to window over a buffer from the network. However any externally sourced input should be validated, this is the crux of the problem.

I think a check similar to the Java and C# side should be added to help prevent people being silly, but this is not a security issue. If people need this protection for security then I'd not trust them with any other part of a secure app. If you take your thinking to its conclusion then char* is not allowed in C/C++.

ii) The sequential access is actually more flexible than I outlined. Best that you are totally sequential, but SBE can allow arbitrary access to any field within a given block. Think C structures and how it can move over memory. Each block has a C structure over it. If arbitrary access is required to fields across blocks then maybe you should be considering another codec and accept the costs that implementing that features requires.

Re (i) If I send some one a buffer saying it can be cast to struct foo {int length; char* data} and they blindly believe the length part and then send the data back to me when requested later - it is a problem. Of course it's their fault and they should have validated the data. In SBE's case since a separate part of the program (networking code) is allocating the buffer (char*) and knows it's length, it needs to be able to tell the decoding code (which is generated) to not exceed the bounds when returning data from the getters. The decoding logic needs to know the size of the buffer and ensure that it doesn't reach for something out of bounds. To your point about char*, it is a pain isn't it? C/C++ allow a lot of things including returning pointers to stack allocated data, doesn't mean it's a good idea.

Re (ii) I see, good to know. So you just prefer sequential access because of locality within a block, but don't require it. That seems pretty workable.

I'm not disagreeing that safety can be improved by bounds checking the access, and do believe it can be done efficiently in the C++ case without noticeable impact for this class of problem. I agree with a sensible level of bounds checking for writing robust and safe code. I also believe in native languages people should have a choice. For the record I think the default should be bounds checking on for SBE. However there is no guarantee that the code calling SBE will pass the correct length.

I just cannot accept that is an automatic security issue. There are a lot of responsibilities that come with native programming. Heartbleed, that everyone keeps quoting because it is what is front of mind, requires a service to take inputs from the network and return a range of memory without validating the inputs. Taking a packet off the network and reading its contents is a different thing. This is not blindly returning data to the network, or is it allowing an overrun on write that corrupts the stack. Unfortunately our biases lead us to always want to fight the last war rather than the big picture.

I think bounds checking in this case helps developers to not make stupid mistakes, like use an insufficiently large buffer when reading from the network, however it does not prevent them from creating security issues.

Martin, I am talking about the returning case and not the storing case. I am exaggerating here but this is the problem I am talking about:

//Pseudo code.char* data = malloc(someSize);read(data, someSize); // All is fine till now.Decoder* decoder = passDataToSBE(data, someSize); // SBE determines that the data is of the form struct Foo { int n; char* data } where n is more than someSize.// Now the client demands to read the some byte (in a separate request) that it just wrote and the server depends on SBE for the bytes. Server does this:write(decoder.getDataAtIndex(i), sizeOf(char));// SBE returns any byte (as long as it is less than n?) , doesn't check if it goes beyond the boundary of the buffer (of someSize) it was supplied.

As I said I think bounds checking should be the default for C++ and it will help with this sort of scenario. However, and this is a big however, if this is the sort of programming people are doing when responding to network based requests then way more problems are about to come your way. Absolutely no checking at a semantic level of input to output parameters is beyond dumb.

Just think how brain dead it would be to create a method such as decoder.getDataAtIndex(i) is. The decoder could have checks but equally the buffer capacity could be passed in wrong by the idiot who wrote such a method. This is the level of craziness that caused Heartbleed. I totally get your point of the value of bounds checking and agree. Do you get mine that this level of discussion deflects from the real issues around designing secure code?

I totally get your point. My example was intentionally horrendous and I don't claim to say that code of that quality can be protected merely by SBE or other libraries including extensive bounds checking. The interface of someMethod(void* foo, int length) is error prone and like you pointed out some one could pass the wrong buffer capacity. For networking code in serious native projects, I end up using higher level classes similar to Netty's ByteBuf/Java ByteBuffer and wrap read/write calls in them so as minimize errors like passing the wrong buffer capacity. But that doesn't protect people completely either. Native code like you said is dangerous and even projects by experienced programmers need serious security audits, but my point (and you seem to agree) is that in this day and age bounds checking is a sensible default for most libraries. I do agree with you that bounds checking is a tree level strategy, and we need to look at the forest to understand how to write secure code.

Hi. I just read your blog on SBE. I would like to know how this compares to those: https://github.com/eishay/jvm-serializers Since ProtoBuf isn't quite the "fastest contender out there", when it comes to Java serialisation.

The last time I looked these benchmarks used POJOs which generate a lot of garbage and thus do not fit with the low-latency goals of SBE. If you know of a good low-latency targeted benchmark for encoding I'd love to give it a try.

We know Protobuf is not the fastest. We just picked it for illustration because it is one of the most common. I know the likes of Kryo is much faster Protobuf. The Kryo folk have tried techniques from my blog in the past to great success.

OK. I do not know of another "well-know" benchmark about Java serialisation. I think Kryo is, or used to be, usually thought to be generally "the fastest". I think one can serialize manually with it, so I suspect a benchmark against "manual Kryo" would be "interesting".

Anyway, I had another question. I was wondering if the benefits achieved by designing to get the best out of the CPU would "port" to JavaScript? In other words, would a JavaScript version also show (significantly) lower latency then "popular" JS serialisation API, or is it only interesting for "compatibility"?

Actually fast-serialization is faster than kryo for many (most) test cases (depends on data) and is mostly compatible to the original JDK implementation.I am currently adding a raw offheap/byte[] interface as the use of JDK streams is a performance killer. Its also possible to do zero-copy serialization then.

An advantage of serialization is low-effort. Frequently there is no explicit data-to-message conversion as one may serialize application data structure without doing any inbetween transformation/processing. Vice versa, an application frequently works directly with received deserialized messages, rarely need for 'parsing'/transformation. It really shines for generic approaches (rmi alike stuff) and if complex interlinked data structures are transmitted (reference restoration).Ofc the receiving side will have to allocate serialized objects, so its probably not well suited for the ultra-low jitter/latency arena.

Regarding a port to JS: I'd expect JS-world to be "convenience first", so somewhat handcrafted approaches probably would not get too much love (at least from the front end dev's perspective ;) ).

Thank you for your reply. I notice SBE makes uses of java Unsafe package. I see all those comments that mention it could be changed from version to version. What do you think about these compatibility issue?

A significant number of projects now use Unsafe, plus the core java.util.concurrent classes. There are plans to find an alternative for Java 9. SBE can look to support that alternative at that time.

SBE is designed for low-latency and we do not intend to support Java 6. If someone wanted to build a low-latency application, then one of the first things to do is upgrade to Java 7 or 8 for the performance benefits it brings, and not stay on an old unsupported version.

Martin, Is it possible to access fields by offset without reading all fields in the message? It seems like the functionality is missing from the first look to a generated Java code. Thanks for considering a new feature (if it is missing)!

Martin, Would you be open to a patch to offer the inclusion of being able to have default values for fields? The reason to add them would be to introduce the concept of forward compatibility like with avro. Any thoughts on supporting an official set of resolution rules like AVRO?References: - http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html- http://docs.confluent.io/1.0.1/avro.html#serialization-and-evolution- http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution

well written, very interesting, too bad it does not reflect the javamare the generator code is, as well horror of the code it produces for c++(in terms of readablility, performance , maintanability, bloat). My manager unfortunately is sold on this peace of shit, so I am stack supporting it. Simple example: the so called token builder, whotf wrote that, it is impossible to troubleshoot. don't belive me - try to find(under 5 minutes) existing bug in the code where individual values for an enum loose their description attribute values, while being read from a xml config

1. Do I have to maintain separate message schema for Little Endian servers and Big Endian servers?

2. When I use SBE with Aeron, is it always required to run on a Little Endian server as it said on design assumptions? (https://github.com/real-logic/aeron/wiki/Protocol-Specification#design-assumptions)

2. Aeron and SBE can run on little or big endian CPUs and do the necessary conversions. Many existing network protocols assumed big endian but this is no longer the most common platform. Things evolve.