sjson-new and the prisoner of Azkaban

Within the sbt code base there are a few places where the persisted data is in the order of hundreds of megabytes that I suspect it becomes a performance bottleneck, especially on machines without an SSD drive.
Naturally, my first instinct was to start reading up on the encoding of Google Protocol Buffers to implement my own custom binary format.

microbenchmark using sbt-jmh

What I should've done first, is start benchmarking. Using @ktosopl (Konrad Malawski)'s sbt-jmh, setting up a microbenchmark is easy. All you have to do is pop that plugin into your build. and create a subproject that enables JmhPlugin.

This will let me compare different JSON backends under the same condition. Of course, there are many other parameters such as the hardware and the quality of the data that affects the performance metrics, but this should give me a ballpark idea.

custom binary format

Like I said benchmarking is where I should've started. Instead what I did was creating a new binary format that's inspired by Google Protocol Buffers and Apache Avro. See Binary Mode for the code.

Here is the binary message for 150: Int:

01 00 00 00 AC 02
----------- -----
tag

The first four bytes represent a tag. The first byte of the tag represents the wiretype, and reset of three bytes are used for the hash of field name. AC 02 is the ZigZag encoding of varint same as protobuf. This encoding uses fewer bytes for smaller values of integers.

Here's the binary message for "Hello":

07 00 00 00 05 48 65 6C 6C 6F
----------- -- --------------
tag len

The wiretype for String is 07, and the content is length-delimited UTF-8 String.

The first tag 01 96 44 87 is for a: Int. 01 is the wiretype for Int and 96 44 87 is derived from the murmurhash of "a".

02 is the ZigZag encoding of 1

The second tag 01 41 F9 E8 is for b: Int

04 is the ZigZag encoding of 2

The third tag 0A 00 00 00 is for the field name table

0C is the number of bytes (12) in this table, encoded as unsigned varint

Entry for the first tag. 01 61 is length-delimited String for "a"

Entry for the second tag. 01 62 is length-delimited String for "b"

The idea behind this is that I wanted the binary message to retain the flexibility of JSON object where a datatype can be evolved over time by adding new fields. At the same time, carrying around the field names for each entry seems wasteful, so here I'm using a fast hashing algorithm to shrink the field names into 24 bits. This is subject to hash collisions, but that's something I could detect during encoding.

When I pushed this pull request out, @xuwei-k (Yoshida-san)'s reaction was "why aren't you using MessagePack?"

MessagePack support

After implementing my own binary format, it didn't take too long to implement MessagePack support.
MessagePack packs a whole a lot more into the initial tag byte so things like null can be expressed in a single byte.

Note that I've used msgpack-java as the backend, so the performance might be different from the C++ implementation of msgpack.

gzipped Spray JSON

@fommil (Sam Halliday) had also told me that gzipped Spray JSON gets a better performance than binary formats. Now that we have the harness, we can compare head to head. Here's the benchmark result using Travis CI:

Both for saving and loading, the custom binary format is doing the worst in terms of time (234ms).
If you combine the saving and loading timing, overall plain Spray JSON using Jawn does the best (72ms), gzipped Spray JSON next (99ms), and then MessagePack (138ms).

In terms of the file size, Spray JSON is 1.4 MB, gzipped Spray JSON is 123 KB, MessgePack is 1.2 MB, and the custon binary message is 896 KB. Because the test data had a lot of repetition, it might have compressed too well, but still we get 1/10 of the file size for around 40% penalty in time. This is worth considering especially for machines that does not have SSD drives.

Scala JSON support

One of the motivations for me to write sjson-new to begin with was providing a transition path to use SLIP-28 Scala JSON AST.

The principal driver behind Scala JSON, Matthew de Detrich, has published a milestone for Scala JSON AST under "org.mdedetrich" %% "scala-json-ast" % "1.0.0-M1", so we can start trying it out. It doesn't include neither parser nor a pretty printer so I've brought Jawn facade in from Mathew's Jawn fork and a pretty printer from Spray. Without too much work, I was able to provide support for Scala JSON's "unsafe" AST.

The most tedious part is providing the JsonFormat for each datatype. But once that part is done, sjson-new can reuse the same protocol across different backends with no additional work. Here's the benchmark result:

This shows that Scala JSON using Jawn (63ms) is round tripping 12% faster than Spray JSON using Jawn (72ms). Similar trend continues for gzipped Scala JSON using Jawn (90ms) and gzipped Spray JSON using Jawn (99ms). The performance boost is likely coming from the use of Array in the "unsafe" AST compared to Vector in Spray JSON.

Edit: The above result was likely because of my bug. More recent result looks like this:

This shows that Scala JSON using Jawn (71ms) is round tripping 9% slower than Spray JSON using Jawn (65ms). Similar trend continues for gzipped Scala JSON using Jawn (86ms) and gzipped Spray JSON using Jawn (77ms).

The file sizes are identical, which should be no surprise since they are both based on the Spray JSON's compact printer.

emulating JNothing

There's still something we can learn from the binary protocols like Protocol Buffers:

For any non-repeated fields in proto3, or optional fields in proto2, the encoded message may or may not have a key-value pair with that tag number.

In proto3, repeated fields are packed by default. These function like repeated fields, but are encoded differently. A packed repeated field containing zero elements does not appear in the encoded message.

These techniques can be applied to JSON format as well. There are a lot of datatypes that has Option[A] fields, or some collection fields that default to None or empty value. Not including those empty fields into JSON object has several benefits.

First, this should allow some evolution of the schema because the reading side becomes more permissive. When JSON is missing a field, it will now fill in some default value. Second, this will make the size of JSON object smaller since we won't need to include field names for them.

Lift JSON/Json4s has a notion called JNothing that expresses a lack of value, but I think the modern thinking is to use Option[J] instead. This means changing JsonReader so it accepts Option[J] instead of J. Here's the modified JsonFormat for Int:

This covers the reading part. Next I added def addField[J](name: String, obj: A, builder: Builder[J]): Unit method to the JsonWriter. This allows the format to optionally omit the creation of a JSON field. Let's try using this:

As you can see, the JSON representation for Person("Bob", None) does not include the field for None value.

sjson-new 0.4.0

Since gzipping is faster, smaller, and likely more reliable than rolling out my own binary format, I've decided to not include that into 0.4.0. If you want to try the feature described in this post other than that, here's 0.4.0:

// To use sjson-new with Spray JSON
libraryDependencies +="com.eed3si9n"%%"sjson-new-spray"%"0.4.0"// To use sjson-new with Scala JSON
libraryDependencies +="com.eed3si9n"%%"sjson-new-scalajson"%"0.4.0"// To use sjson-new with MessagePack
libraryDependencies +="com.eed3si9n"%%"sjson-new-msgpack"%"0.4.0"