Tag: sajson

Several holidays ago, I got a bee in my bonnet and wrote a fast JSON parser whose parsed AST fits in a single contiguous block of memory. The code was small and simple and the performance was generally on-par with RapidJSON, so I stopped and moved on with my life.

Well, at the end of 2016, Rich Geldreich shared that he’d also written a high-performance JSON parser.

A while back I wrote a *really* fast JSON parser in C++ for fun, much faster than RapidJSON (in 2012 anyway). Is this useful tech to anyone?

I dropped his pjson into my benchmarking harness and discovered it was twice as fast as both RapidJSON and sajson! Coupled with some major JSON parsing performance problems in a project at work, I was nerd-sniped again.

I started reading his code but nothing looked particularly extraordinary. In fact, several things looked like they ought to be slower than sajson… Oh wait, what’s this?

How do we think about the performance of this loop? Remember that mainstream CPUs are out-of-order and can execute four (or more!) instructions in parallel. So an approximate mental model for reasoning about CPU performance is that they understand multiple instructions per cycle, then stick them all into an execution engine that can execute multiple instructions per cycle. Instructions will execute simultaneously if they’re independent of each other. If an instruction depends on the result of another, it must wait N cycles, where N is the first instruction’s latency.

So, assuming all branches are correctly predicted (branch predictors operate in the frontend and don’t count towards dependency chains), let’s do a rough estimate of the cost of the loop above:

increment p is the only instruction on the critical path – it carries a dependency across iterations of the loop. Is there enough work to satisfy the execution resources during the increment? Well, the comparisons are independent so they can all be done in parallel, and there are four of them, so we can probably keep the execution units busy. But it does mean that, at most, we can only check one byte per cycle. In reality, we need to issue the load and increment too, so we’re looking at a loop overhead of about 2-3 cycles per byte.

Now let’s look more closely at Rich’s code.

Replacing the comparisons with a lookup table increases comparison latency (3-4 cycles latency from L1) and increased comparison throughput (multiple loads can be issued per cycle).

So let’s spell out the instructions for Rich’s code (reordered for clarity):

Again, the critical path is only add p, 4, but we still need to issue the other instructions. The difference is that now all of the loads happen in parallel and the comparisons for 4 bytes happen in parallel too, rather than doing four comparisons per byte.

It’s still hard to say if this is a win on paper — Haswell can only issue 2 loads per cycle. But the loads can overlap with the comparisons from previous bytes). However, we still have to issue all of these instructions. So maybe we’re looking at something closer to 2 cycles per byte?

Empirically, at least on my Broadwell, replacing four comparisons with a LUT was definitely a win. 55cc213

But is unrolling the loop necessary? If I take out the unrolling but leave the LUT, clang gets slower but gcc stays the same. I checked – neither gcc nor clang do any automatic unrolling here. What’s going on? Branch predictor effects? Also, Intel’s Intel Architecture Code Analyzer tool says that the unrolled and non-unrolled loops are both frontend-bound and have the same loop throughput.

I’m not yet convinced that a tight dependency chain is why unrolling is a win. More investigation is required here. But the important thing to keep in mind is to pay attention to the critical path latency through a loop body as well as the number of independent operations that can soak up spare execution bandwidth.

Lead Bullets

After playing around with LUTs and unrolling, I started implementing a bunch of small optimizations that I’d long known were available but didn’t consider to be that important.

Well, as it often turns out in performance projects, a sequence of 2% gains adds up to significant wins over time! If you’ve read Ben Horowitz’s lead bullets story or how SQLite achieved a 50% speed up, this will sound familiar.

Here are the optimizations that mattered:

Moved the input and output pointers into locals instead of members, which helps VC++ and Clang understand that they can be placed in registers. (gcc was already performing that optimization.) 71078d34a07c77

Store the tag bits at the bottom of the element index instead of the top, which avoids a shift on 64-bit. e7f2351

Static Branch Prediction

I also spent a bit of time on static branch prediction. It’s a questionable optimization; in theory, you should just use profile-guided optimization (PGO) and/or rely on the CPU’s branch predictors, but in practice few people actually bother to set up PGO. Plus, even though the CPU will quickly learn which branches are taken, the compiler doesn’t know. Thus, by using static branch prediction hints, the compiler can line up the instructions so the hot path flows in a straight line and all the cold paths are off somewhere else, sometimes saving register motion in the hot path.

I can’t recommend spending a lot of time on annotating your branches, but it does show up as a small but measurable win in benchmarks, especially on smaller and simpler CPUs.

Things I Didn’t Do

Unlike Rich’s parser and RapidJSON, I chose not to optimize whitespace skipping. Why? Not worth the code size increase – the first thing someone who cares about JSON parsing performance does is minify the JSON. b05082b

I haven’t yet optimized number parsing. Both RapidJSON and Rich’s parser are measurably faster there, and it would be straightforward to apply the techniques. But the documents I regularly deal with are strings and objects and rarely contain numbers.

Benchmark Results

Desktop

Dell XPS 13

iPhone SE

Atom D2700

The charts aren’t very attractive, but if you look closely, you’ll notice a few things:

Parsing JSON on modern CPUs can be done at a rate of hundreds of megabytes per second.

gcc does a much better job with RapidJSON than either clang or MSVC.

JSON parsing benefits from x64 – it’s not a memory-bound or cache-bound problem, and the extra registers help a lot.

The iPhone SE is not much slower than my laptop’s Broadwell. :)

The Remainder of the Delta

As you can see in the charts above, sajson is often faster than RapidJSON, but still not as fast as Rich’s pjson. Here are the reasons why:

sajson does not require the input buffer to be null-terminated, meaning that every pointer increment requires a comparison with the end of the buffer (to detect eof) in addition to the byte comparison itself. I’ve thought about changing this policy (or even adding a compile-time option) but I really like the idea that I can take a buffer straight from a disk mmap or database and pass it straight to sajson without copying. On the other hand, I measured about a 10% performance boost from avoiding length checks.

sajson sorts object keys so that object lookup takes logarithmic rather than linear time. The other high-performance parsers have linear-time object lookup by key. This is an optimization that, while not necessary for most use cases, avoids any accidental worst-case quadratic-time usage.

sajson’s contiguous AST design requires, for every array, shifting N words in memory where N is the number of elements in the array. The alternative would be to use growable arrays in the AST (requiring that they get shifted as the array is realloc’d). Hard to say how much this matters.

Aside: The “You Can’t Beat the Compiler” Myth

There’s this persistent idea that compilers are smart and will magically turn your code into something that maps efficiently to the machine. That’s only approximately true. It’s really hard for compilers to prove the safety (or profitability) of certain transformations and, glancing through the produced code for sajson, I frequently noticed just plain dumb code generation. Like, instead of writing a constant into a known memory address, it would load a constant into a register, then jump to another location to OR it with another constant, and then jump somewhere else to write it to memory.

Also, just look at the charts – there are sometimes significant differences between the compilers on the same code. Compilers aren’t perfect and they appreciate all the help you can give!

Benchmarking

Measuring the effect of microoptimizations on modern computers can be tricky. With dynamic clock frequencies and all of today’s background tasks, the easiest way to get stable performance numbers is to take the fastest time from all the runs. Run your function a couple thousand times and record the minimum. Even tiny optimizations will show up this way.

I also had my JSON benchmarks report MB/s, which normalizes the differences between test files of different sizes. It also helps understand parser startup cost (when testing small files) and the differences in parse speed between files with lots of numbers, strings, huge objects, etc.

Swift Bindings

Dropbox (primarily @aeidelson) contributed high-quality Swift bindings for sajson. The challenge in writing these bindings was finding a way to efficiently expose the sajson parse tree to Swift. It turns out that constructing Swift arrays and objects is REALLY expensive; we once benchmarked 10 ms in sajson’s parser and 400 ms of Swift data structure construction.

Fortunately, Swift has decent APIs for unsafely manipulating pointers and memory, and with those we implemented the ability to access sajson AST nodes through a close-to-natural ValueReader type.

An iOS team at Dropbox replaced JSONSerialization with sajson and cut their initial load times by two thirds!

Summary

I used to think JSON parsing was not something you ever wanted in your application’s critical path. It’s certainly not the kind of algorithm that modern computers love (read byte, branch, read byte, branch). That said, this problem has been beaten to death. We now have multiple parsers that can parse data at hundreds of megabytes per second — around the same rate as SHA-256! If we relaxed some of the constraints on sajson, it could even go faster.

So how fast was Rich’s parser, after all? When measured in Clang and MSVC, quite a lot, actually. But when measured in GCC, RapidJSON, sajson, and pjson were (and remain) very similar. Many of the differences come down to naive, manually-inlined code, which we know the compiler can reliably convert to a fast sequence of instructions. It’s annoying to eschew abstractions and even duplicate logic across several locations, but, in the real world, people use real compilers. For critical infrastructure, it’s worth taking on some manual optimization so your users don’t have to care.

Update

Arseny Kapoulkine pointed out an awesome trick to me: replace the last byte of the buffer with 0, which lets you avoid comparing against the end of the buffer. When you come across a 0 byte, if it’s the last one, treat it as if it’s the byte that got replaced. This would be a material performance win in sajson _and_ avoid the need for specifying null-terminated buffers.

This post concludes my obsession with JSON parsing. In fact, the entire reason I wrote a JSON parser was to show these graphs. I wanted to see whether I could write a JSON parser faster than any other when run in Emscripten. As vjson is typically faster, I did not succeed unless I requalify my goal as writing the fastest-in-Emscripten JSON parser with a useful parse tree.

This benchmark’s code is on GitHub. I encourage you to reproduce my results and search for flaws.

Native vs. Emscripten

Two things are immediately clear from this graph. First, native code is still 5-10x faster than Emscripten/JavaScript. Second, yajl and jansson are dramatically slower than rapidjson, sajson, and vjson. Native yajl and jansson are even slower than Emscripten’d sajson. Henceforth I will not include them.

Looking closely at the browser results, a third conclusion is clear. Firefox runs Emscripten’d code much faster than Chrome.

Finally, sajson consistently performs better than rapidjson in my Emscripten tests. And vjson always parses the fastest. I believe this is because Emscripten and browser JITs punish larger code sizes.

The previous graph only shows parse rates by browser and parser for a single file. Next let’s look at parse rates by file.

Yep, native code consistently wins. At this point I want to dig into differences between the browsers, so I will show the same graph but without native code.

Firefox vs. Chrome

Not only is Firefox consistently faster than Chrome, it’s faster by a factor of 2-5x!

Finally, here the same graph but normalized against Firefox 18.

If I were a Firefox JS developer, I’d be feeling pretty proud right now, assuming this experiment holds up to scrutiny. Even so, these results match my experience with Bullet/Emscripten in Chrome: Chrome takes a very long time to stabilize its JIT, and I’ve even seen it fail to stabilize, constantly deopting and then reoptimizing code. In contrast, Firefox may take longer to JIT up front, but performance is smooth afterwards.

Further work is necessary to test the hypothesis that code size is the biggest predictor of Emscripten performance.

Preemptive answers to predicted questions follow:

Well, duh. You shouldn’t use an Emscripten’d JSON parser. You should use the browser’s built-in JSON.parse function.

This isn’t about parsing JSON. This is about seeing whether the web can compete with native code under common workloads. After all, Mozillians have been claiming JavaScript is or will be fast enough to replace native apps. If parsing JSON through a blessed browser API is the only way to do it quickly, then developers are disempowered. What if a new format comes along? Must developers wait on the browser vendors? The web has to be better than that.

Shouldn’t you have compiled the Emscripten code with -fno-exceptions?

Yes. Oops. That said, after generating the graphs, I did recompile with -fno-exceptions and it made no difference to the benchmark results in either Chrome or Firefox.

With the caveat that each parser provides different functionality and access to the resulting parse tree, I benchmarked sajson, rapidjson, vjson, YAJL, and Jansson. My methodology was simple: given large-ish real-world JSON files, parse them as many times as possible in one second. To include the cost of reading the parse tree in the benchmark, I then iterated over the entire document and summed the number of strings, numbers, object values, array elements, etc.

The documents are:

apache_builds.json: Data from the Apache Jenkins installation. Mostly it’s a array of three-element objects, with string keys and string values.

update-center.json: Also from Jenkins though I’m not sure where I found it.

apache_builds.json, github_events.json, and instruments.json are pretty-printed with a great deal of interelement whitespace.

Now for the results. The Y axis is parses per second. Thus, higher is faster.

Core 2 Duo E6850, Windows 7, Visual Studio 2010, x86

Core 2 Duo E6850, Windows 7, Visual Studio 2010, AMD64

Atom D2700, Ubuntu 12.04, clang 3.0, AMD64

Raspberry Pi

Conclusions

sajson compares favorably to rapidjson and vjson, all of which stomp the C-based YAJL and Jansson parsers. 64-bit builds are noticeably faster: presumably because the additional registers are helpful. Raspberry Pis are slow. :)

There’s not much opportunity to take advantage of a modern CPU’s superscalar execution pipeline when parsing JSON. Parsing JSON generally involves reading a character and branching. Searching for a string’s ending quotes could be implemented with vector instructions (see strchr in Agner Fog’s optimization manual or this tutorial) but the ROI doesn’t seem high to me.

Could we use multiple threads? Maybe, if we split the input text into chunks and had each thread discover the next { or [ and optimistically begin parsing future elements. Almost certainly not worth it

Either way, sajson is approximately done. I will post benchmarks soon.

The least trivial algorithm for building sajson’s parse tree is
allocating (or should I say, reserving?) the space in the parse tree
for an array’s or object’s element list without knowing the length in
advance.

Let’s consider an eleven-character JSON text. Imagine we’ve parsed
three characters: [[[. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.

We don’t know the length of the arrays, so we cannot begin writing the
parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of
which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass
scanned the JSON text for arrays and objects and temporarily stored
their lengths into safe locations in the parse tree array. The
second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was
slow. The scanning phase was simpler than parsing, but not THAT
simpler. Since parsing involves reading one byte and
branching on its value, parsing in two phases was effectively
half the speed. Second, the scanning phase duplicated a fair amount
of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost
of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp, for
temporary storage, and the one at the back out, for the actual parse
tree data.

When encountering the beginning of an array or object, remember the
current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the currenttemp pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp and into out.

It may help to work through a simple example:

[[0],0]

The JSON text is 7 characters. Thus we have 7 words of parse tree to
work with:

[ ][ ][ ][ ][ ][ ][ ]
^ ^
temp out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the
C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its
type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
^ ^
temp out

Encountering the first ], we calculate the array length, 1, and move
the references from temp to out. We write
the new array’s location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
^ ^
temp out

We were careful to adjust the element references so they remain
relative to the array record.

We then encounter another zero and place its payload in out and
location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
^
temp out

Closing the outer array, we calculate its length (2), again move
the references from temp to out, and write the final array record.

[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
^
out

out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start
address of each array or object on the C stack. This is liable to
overflow in the case of a JSON file consisting of N [s followed by N
]s for some large N. The JSON standard allows parsers to limit the
maximum depth they handle, but we can do better.

It’s possible to eliminate the recursion by storing the value oftemp into the temp side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out, and the
previous value of temp is restored. If the previous value of temp
is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these
outer references?

First, let’s examine the same example but where we store a reference
to the outer ‘in-construction’ object in temp:

An easy conceptualization is that the final size of an array
record will be 1+N records, including its length. The temporary array
storage is also 1+N, where we don’t yet know its length but we do have
a reference to the enclosing array or object. Thus, we have room for
outer references in the parse tree.

Actual Code

The result
is an implementation whose parsing loop is almost entirely inlined,
and on architectures with a reasonable number of registers (even
AMD64), very little spills to the stack.

sajson is available
under the MIT license, but at the time of this writing, it is
primarily a proof of concept. The API is not stable and it does not
support string escapes. It also needs a security review to guarantee
that no malformed inputs crash.

Last week, I described a JSON parse tree data structure that, worst
case, requires N words for N characters of JSON text. I want to
explain the algorithm used to generate said parse tree, but first I
will describe the parse tree data structure in detail.
Simultaneously, I will show that the parse tree will fit in the worst
case.

Given that value types are stored in 3-bit tags, it’s intuitive that N
characters of JSON data requires N words in the parse tree. Let’s
consider the parse tree representation of each JSON type individually:

Strings

Strings are represented in the parse tree by two pointers: one to the
beginning of the string and one to the end. In the source text, these
correspond to the string’s quotes. The empty string, "",
is two characters and consumes two words in the parse tree.

struct String {
size_t begin;
size_t end;
};

Arrays

Arrays are represented by their length followed by a relative offset
to each element.

The smallest object, {}, is 2 characters but its
representation in the parse tree is a single word.

Now consider an object of one element: {"":""}. Including the key
string (but not the value string!), the object is five characters in
the input text. Its parse tree representation is four words: the
length plus three for its element.

Each additional element in the input text adds four characters (a
comma, a colon, and two quotes) but requires only three words in the
parse tree.

Numbers: Integers

That’s because, on 32-bit architectures, doubles are two words. If
single-digit numbers such as 0 consumed two words in the
parse tree, [0,0,0,0] would not fit.

Storing integers more compactly, we see that the smallest integers use
one character of input text and one word of parse tree structure.

It’s not worth the complexity, but the astute may notice that if we
limit integers to 29 bits, we don’t need to consume words in the parse
tree at all.

Numbers: Doubles

On 32-bit architectures, doubles are stored (unaligned) into the parse
tree.

struct Double {
size_t first_half;
size_t second_half;
};

On 64-bit architectures, a double consumes a single word.

The shortest JSON doubles are a single digit
followed by a decimal point followed by a single digit
(example: 0.0) or a single digit with a single-digit
exponent (example: 9e9). It’s clear that they fit into
the parse tree.

Over the holiday break, as mental exercise, I wrote a
single-allocation JSON parser, sajson. Why
single-allocation? To me, software that fits within a
precise resource budget, especially memory, is elegant. Most C or
C++ JSON parsers allocate memory per node and use hash tables to store
objects. Even if said parsers use efficient pool allocators or hash
table implementations, they miss the forest for the trees.

Dynamic memory allocation has disadvantages: fragmentation,
cache locality, and thread contention are the common arguments
against. But I focused on a different issue: what is the worst case
memory usage to parse, say, a 200 MB JSON file? With a JSON parser
that dynamically allocates, it’s challenging to prove the worst case
consumption.

Before we calculate the worst case memory consumption of a JSON
parser, let’s cover some basics.

Parsers convert input text, a stream of characters, into a data
structure or event stream suitable for reading or processing in some
way. In this instance, sajson is a non-streaming dom-style parser in
that it translates a complete buffer of characters into a contiguous
parse tree that supports both enumeration and random access.

JSON has seven data types. Three are unit types: null, true, and
false. Two are scalars: numbers and strings. Finally, arrays and
objects are composites: they contain references to other values. The
root element of a JSON document can only be an array or object.

sajson’s goal is to convert a stream of JSON text into a contiguous data
structure containing an enumerable and randomly-accessible parse tree.

My first attempt defined the parsed representation of each value as a type
enumeration followed by the type’s payload.

Using a full size_t to store a 3-bit type constant is rather wasteful.
(Remember there are seven JSON types.) Because sajson only targets
32-bit and 64-bit architectures, each array or object element offset
has three bits to spare and thus can include the element’s type. The
document needs one bit to determine the type of the root element.
(Remember the root element must be an array or an object.)

A further optimization exists: rather than storing
all numbers as IEEE 64-bit doubles, we can add an extra type tag:
<Integer>. Single-digit JSON numbers must be integers, and thus
consume less structural storage.

Let’s consider the same example above with tagged element references,
where <tag>:offset delimits the tag from the offset.

With the above changes, the parse tree size is cut in half! It now
fits in sizeof(size_t) * input_length.

Next time I’ll describe the challenges in building said parse tree
without a-priori knowledge of array length. Here’s a hint: imagine
you know the input text is 20 characters long. The first three
characters are "[[["…