Tag: c++

Just as all mainstream languages now have lambda functions, I predict sum types are the next construct to spread outward from the typed functional programming community to the mainstream. Sum types are very useful, and after living with them in Haskell for a while, I miss them deeply when using languages without them.

Fortunately, sum types seem to be catching on: both Rust and Swift have them, and it sounds like TypeScript’s developers are at least open to the idea.

I am writing this article because, while sum types are conceptually simple, most programmers I know don’t have hands-on experience with them don’t have a good sense of their usefulness.

In this article, I’ll explain what sum types are, how they’re typically represented, and why they’re useful. I will also dispel some common misconceptions that cause people to argue sum types aren’t necessary.

What is a Sum Type?

Sum types can be explained a couple ways. First, I’ll compare them to product types, which are extremely familiar to all programmers. Then I’ll show how sum types look (unsafely) implemented in C.

Every language has product types – tuples and structs or records. They are called product types because they’re analogous to the cartesian products of sets. That is, int * float is the set of pairs of values (int, float). Each pair contains an int AND a float.

If product types correspond to AND, sum types correspond to OR. A sum type indicates a value that is either X or Y or Z or …

Let’s take a look at enumerations and unions and show how sum types are a safe generalization of the two. Sum types are a more general form of enumerations. Consider C enums:

enum Quality {
LOW,
MEDIUM,
HIGH
};

A value of type Quality must be one of LOW, MEDIUM, or HIGH (excepting uninitialized data or unsafe casts — we are talking about C of course). C also has a union language feature:

union Event {
struct ClickEvent ce;
struct PaintEvent pe;
};

ClickEvent and PaintEvent share the same storage location in the union, so if you write to one, you ought not read from the other. Depending on the version of the C or C++ specification, the memory will either alias or your program will have undefined behavior. Either way, at any point in time, it’s only legal to read from one of the components of a union.

A sum type, sometimes called a discriminated union or tagged variant, is a combination of a tag (like an enum) and a payload per possibility (like a union).

In C, to implement a kind of sum type, you could write something like:

However, there is some risk here. Nothing prevents code from accessing .paint in the case that type is CLICK. At all times, every possible field in event is visible to the programmer.

A sum type is a safe formalization of this idea.

Sum Types are Safe

Languages like ML, Haskell, F#, Scala, Rust, Swift, and Ada provide direct support for sum types. I’m going to give examples in Haskell because the Haskell syntax is simple and clear. In Haskell, our Event type would look like this:

data Event = ClickEvent Int Int
| PaintEvent Color

That syntax can be read as follows: there is a data type Event that contains two cases: it is either a ClickEvent containing two Ints or a PaintEvent containing a Color.

A value of type Event contains two parts: a small tag describing whether it’s a ClickEvent or PaintEvent followed by a payload that depends on the specific case. If it’s a ClickEvent, the payload is two integers. If it’s a PaintEvent, the payload is one color. The physical representation in memory would look something like [CLICK_EVENT_TAG][Int][Int] or [PAINT_EVENT_TAG][Color], much like our C code above. Some languages can even store the tag in the bottom bits of the pointer, which is even more efficient.

Now, to see what type of Event a value contains, and to read the event’s contents, you must pattern match on it.

Sum types, paired with pattern matching, provide nice safety guarantees. You cannot read x out of an event without first verifying that it’s a ClickEvent. You cannot read color without verifying it’s a PaintEvent. Moreover, the color value is only in scope when the event is known to be a PaintEvent.

Sum Types are General

We’ve already discussed how sum types are more general than simple C-style enumerations. In fact, in a simple enumeration, since none of the options have payloads, the sum type can be represented as a single integer in memory. The following DayOfWeek type, for example, can be represented as efficiently as the corresponding C enum would.

Sum types can also be used to create nullable data types like C pointers or Java references. Consider F#’s option, or Rust’s Option, or Haskell’s Maybe:

data Maybe a = Nothing | Just a

(a is a generic type variable – that is, you can have a Maybe Int or Maybe Customer or Maybe (Maybe String)).

Appropriate use of Maybe comes naturally to programmers coming from Java or Python or Objective C — it’s just like using NULL or None or nil instead of an object reference except that the type signature of a data type or function indicates whether a value is optional or not.

When nullable references are replaced by explicit Maybe or Option, you no longer have to worry about NullPointerExceptions, NullReferenceExceptions, and the like. The type system enforces that required values exist and that optional values are safely pattern-matched before they can be dereferenced.

But then someone could accidentally call a different method on window outside of the if statement… reintroducing the problem. To help mitigate this possibility, C++ does allow introducing names inside of a conditional:

if (Window* window = get_focused_window()) {
window->close_window();

Pattern matching on Maybe avoids this problem entirely. There’s no way to even call close_window unless an actual window is returned, and the variable w is never bound unless there is an actual focused window:

Thinking in Sum Types

Once you live with sum types for a while, they change the way you think. People coming from languages like Python or Java (myself included) to Haskell immediately gravitate towards tuples and Maybe since they’re familiar. But once you become accustomed to sum types, they subtly shift how you think about the shape of your data.

I’ll share a specific memorable example. At IMVU we built a Haskell URL library and we wanted to represent the "head" of a URL, which includes the optional scheme, host, username, password, and port. Everything before the path, really. This data structure has at least one important invariant: it is illegal to have a scheme with no host. But it is possible to have a host with no scheme in the case of protocol-relative URLs.

In hindsight, the structure is pretty ridiculous and hard to follow. But the idea is that, if the URL head is Nothing, then the URL is relative. If it’s Just Nothing, then the path is treated as absolute. If it’s Just (Just (Nothing, host)), then it’s a protocol-relative URL. Otherwise it’s a fully-qualified URL, and the head contains both a scheme and a host.

Now the cases are much clearer. And they have explicit names and appropriate payloads!

Sum Types You Already Know

There are several sum types that every programmer has already deeply internalized. They’ve internalized them so deeply that they no longer think about the general concept. For example, "This variable either contains a valid reference to class T or it is null." We’ve already discussed optional types in depth.

Another common example is when functions can return failure conditions. A fallible function either returns a value or it returns some kind of error. In Haskell, this is usually represented with Either, where the error is on the Left. Similarly, Rust uses the Result type. It’s relatively rare, but I’ve seen Python functions that either return a value or an object that derives from Exception. In C++, functions that need to return more error information will usually return the error status by value and, in the case of success, copy the result into an out parameter.

Obviously languages with exceptions can throw an exception, but exceptions aren’t a general error-handling solution for the following reasons:

What if you temporarily want to store the result or error? Perhaps in a cache or promise or async job queue. Using sum types allows sidestepping the complicated issue of exception transferability.

Some languages either don’t have exceptions or limit where they can be thrown or caught.

If used frequently, exceptions generally have worse performance than simple return values.

I’m not saying exceptions are good or bad – just that they shouldn’t be used as an argument for why sum types aren’t important. :)

Another sum type many people are familiar with is values in dynamic languages like JavaScript. A JavaScript value is one of many things: either undefined or null or a boolean or a number or an object or… In Haskell, the JavaScript value type would be defined approximately as such:

Notice this type is recursive — Arrays and Objects can refer to other JSONValues.

Protocols, especially network protocols, are another situation where sum types frequently come up. Network packets will often contain a bitfield of some sort describing the type of packet, followed by a payload, just like discriminated unions. This same structure is also used to communicate over channels or queues between concurrent processes. Some example protocol definitions modeled as sum types:

Approximating Sum Types

If you’ve ever tried to implement a JSON AST in a language like C++ or Java or Go you will see that the lack of sum types makes safely and efficiently expressing the possibilities challenging. There are a couple ways this is typically handled. The first is with a record containing many optional values.

The implied invariant is that only one value is defined at a time. (And perhaps, in this example, JSON null is represented by all pointers in JSONValue being null.) This limitation here is that nothing stops someone from making a JSONValue where, say, both a and o are set. Is it an array? Or an object? The invariant is broken, so it’s ambiguous. This costs us some type safety. This approximation, by the way, is equivalent to Go’s errors-as-multiple-return-values idiom. Go functions return a result and an error, and it’s assumed (but not enforced) that only one is valid at a time.

Another approach to approximating sum types is using an interface and classes like the following Java:

To check the specific type of a JSONValue, you need a runtime type lookup, something like C++’s dynamic_cast or Go’s type switch.

This is how Go, and many C++ and Java JSON libraries, represent the AST. The reason this approach isn’t ideal is because there’s nothing stopping anyone from deriving new JSONValue classes and inserting them into JSON arrays or objects. This weakens some of the static guarantees: given a JSONValue, the compiler can’t be 100% sure that it’s only a boolean, number, null, string, array, or object, so it’s possible for the JSON AST to be invalid. Again, we lose type safety without sum types.

There is a third approach for implementing sum types in languages without direct support, but it involves a great deal of boilerplate. In the next section I’ll discuss how this can work.

The Expression Problem

Sometimes, when it comes up that a particular language doesn’t support sum types (as most mainstream languages don’t), people make the argument "You don’t need sum types, you can just use an interface for the type and a class for each constructor."

That argument sounds good at first, but I’ll explain generally that interfaces and sum types have different (and somewhat opposite) use cases.

As I mentioned in the previous section, it’s common in languages without sum types, such as Java and Go, to represent a JSON AST as follows:

As I also mentioned, this structure does not rigorously enforce that the ONLY thing in, say, a JSON array is a null, a boolean, a number, a string, an object, or another array. Some other random class could derive from JSONValue, even if it’s not a sensible JSON value. The JSON encoder wouldn’t know what to do with it. That is, interfaces and derived classes here are not as type safe as sum types, as they don’t enforce valid JSON.

With sum types, given a value of type JSONValue, the compiler and programmer know precisely which cases are possible. Thus, any code in the program can safely and completely enumerate the possibilities. Thus, we can use JSONValue anywhere in the program without modifying the cases at all. But if we add a new case to JSONValue, then we potentially have to update all uses. That is, it is much easier to use the sum type in new situations than to modify the list of cases. (Imagine how much code you’d have to update if someone said "Oh, by the way, all pointers in this Java program can have a new state: they’re either null, valid, or lazy, in which case you have to force them. (Remember that nullable references are a limited form of sum types.) That would require a blood bath of code updates across all Java programs ever written.)

The opposite situation occurs with interfaces and derived classes. Given an interface, you don’t know what class implements it — code consuming an interface is limited to the API provided by the interface. This gives a different degree of freedom: it’s easy to add new cases (e.g. classes deriving from the interface) without updating existing code, but your uses are limited to the functionality exposed by the interface. To add new methods to the interface, all existing implementations must be updated.

This is the visitor pattern, which is another way to approximate sum types in languages without them. Visitor has the right maintainability characteristics (easy to add uses, hard to add cases), but it involves a great deal of boilerplate. It also requires two indirect function calls per pattern match, so it’s dramatically less efficient than a simple discriminated union would be. On the other hand, direct pattern matches of sum types can be as cheap as a tag check or two.

Another reason visitor is not a good replacement for sum types in general is that the boilerplate is onerous enough that you won’t start "thinking in sum types". In languages with lightweight sum types, like Haskell and ML and Rust and Swift, it’s quite reasonable to use a sum type to reflect a lightweight bit of user interface state. For example, if you’re building a chat client, you may represent the current scroll state as:

This data type only has a distance from bottom when scrolled up, not pegged to the bottom. Building a visitor just for this use case is so much code that most people would sacrifice a bit of type safety and instead simply add two fields.

Another huge benefit of pattern matching sum types over the visitor pattern is that pattern matches can be nested or have wildcards. Consider a function that can either return a value or some error type. Haskell convention is that errors are on the Left and values are on the Right branch of an Either.

Named Variants? or Variants as Types?

Now I’d like to talk a little about sum types are specified from a language design perspective.

Programming languages that implement sum types have to decide how the ‘tag’ of the sum type is represented in code. There are two main approaches languages take. Either the cases are given explicit names or each case is specified with a type.

Haskell, ML, Swift, and Rust all take the first approach. Each case in the type is given a name. This name is not a type – it’s more like a constant that describes which ‘case’ the sum type value currently holds. Haskell calls the names "type constructors" because they produce values of the sum type. From the Rust documentation:

Quit and ChangeColor are not types. They are values. Quit is a Message by itself, but ChangeColor is a function taking three ints and returning Message. Either way, the names Quit, ChangeColor, Move, and Write indicate which case a Message contains. These names are also be used in pattern matches. Again, from the Rust documentation:

The other way to specify the cases of a sum type is to use types themselves. This is how C++’s boost.variant and D’s std.variant libraries work. An example will help clarify the difference. The above Rust code translated to C++ would be:

Types themselves are used to index into the variant. There are several problems with using types to specify the cases of sum types. First, it’s incompatible with nested pattern matches. In Haskell I could write something like:

You can see that, in C++, you can’t pattern match against MouseDown and LeftButton in the same match expression.

(Note: It might look like I could compare with == to simplify the code, but in this case I can’t because the pattern match extracts coordinates from the event. That is, the coordinates are a "wildcard match" and their value is irrelevant to whether that particular branch runs.)

Also, it’s so verbose! Most C++ programmers I know would give up some type safety in order to fit cleanly into C++ syntax, and end up with something like this:

Using types to index into variants is attractive – it doesn’t require adding any notion of type constructors to the language. Instead it uses existing language functionality to describe the variants. However, it doesn’t play well with type inference or pattern matching, especially when generics are involved. If you pattern match using type names, you must explicitly spell out each fully-qualified generic type, rather than letting type inference figure out what is what:

Compare to the following Haskell, where the error and success types are inferred and thus implicit:

result <- some_fallible_action
case result of
Left e ->
handleError e
Right result ->
handleSuccess result

There’s an even deeper problem with indexing variants by type: it becomes illegal to write variant<int, int>. How would you know if you’re referring to the first or second int? You might say "Well, don’t do that", but in generic programming that can be difficult or annoying to work around. Special-case limitations should be avoided in language design if possible – we’ve already learned how annoying void can be in generic programming.

These are all solid reasons, from a language design perspective, to give each case in a sum type an explicit name. This could address many of the concerns raised with respect to adding sum types to the Go language. (See also this thread). The Go FAQ specifically calls out that sum types are not supported in Go because they interact confusingly with interfaces, but that problem is entirely sidestepped by named type constructors. (There are other reasons retrofitting sum types into Go at this point is challenging, but their interaction with interfaces is a red herring.)

It’s likely not a coincidence that languages with sum types and type constructors are the same ones with pervasive type inference.

Summary

I hope I’ve convinced you that sum types, especially when paired with pattern matching, are very useful. They’re easily one of my favorite features of Haskell, and I’m thrilled to see that new languages like Rust and Swift have them too. Given their utility and generality, I expect more and more languages to grow sum types in one form or another. I hope the language authors do some reading, explore the tradeoffs, and similarly come to the conclusion that Haskell, ML, Rust, and Swift got it right and should be copied, especially with respect to named cases rather than "union types". :)

To summarize, sum types:

provide a level of type safety not available otherwise.

have an efficient representation, more efficient than vtables or the visitor pattern.

give programmers an opportunity to clearly describe possibilities.

with pattern matching, provide excellent safety guarantees.

are an old idea, and are finally coming back into mainstream programming!

I don’t know why, but there’s something about the concept of sum types that makes them easy to dismiss, especially if you’ve spent your entire programming career without them. It takes experience living in a sum types world to truly internalize their value. I tried to use compelling, realistic examples to show their utility and I hope I succeeded. :)

Endnotes

Terminology

In this article, I’ve used the name sum type, but tagged variant, tagged union, or discriminated union are fine names too. The phrase sum type originates in type theory and is a denotational description. The other names are operational in that they describe the implementation strategy.

Terminology is important though. When Rust introduced sum types, they had to name them something. They happened to settle on enum, which is a bit confusing for people coming from languages where enums cannot carry payloads. There’s a corresponding argument that they should have been called union, but that’s confusing too, because sum types aren’t about sharing storage either. Sum types are a combination of the two, so neither keyword fits exactly. Personally, I’m partial to Haskell’s data keyword because it is used for both sum and product types, sidestepping the confusion entirely. :)

More Reading

If you’re convinced, or perhaps not yet, and you’d like to read more, some great articles have been written about the subject:

For what it’s worth, Ada has had a pretty close approximation of sum types for decades, but it did not spread to other mainstream languages. Ada’s implementation isn’t quite type safe, as accessing the wrong case results in a runtime error, but it’s probably close enough to safe in practice.

Much thanks goes to Mark Laws for providing valuable feedback and corrections. Of course, any errors are my own.

In this talk, you’ll learn where Embind fits in the overall space of solutions for connecting C++ and JavaScript, why generated code size is so important, how Embind works hard to keep code size small, and several of the C++11 techniques I learned for this project.

cppreference.com does a great job documenting the semantics of the various cast operators, so I won’t cover them here. Instead, let’s define some rules that will keep you and your teammates sane and prevent a class of bug-causing accidents.

Enable Warnings For Implicit Conversions

Enable your compiler’s warnings for implicit conversions of float -> int, int -> float, int -> char, and so on. On the hopefully-rare occasion you actually want said conversion, use a static_cast.

Also, consider restructuring your code so type knowledge is explicit. Sometimes, instead of a single list of base class pointers, it works out better to store a list per subtype. That is, instead of std::vector<Material*> materials;, the following design might work a little more smoothly in practice: std::vector<StaticMaterial*> staticMaterials; std::vector<DynamicMaterial*> dynamicMaterials;

If you’re converting between floats and ints all the time, see if you can stick with one or the other. Some platforms severely penalize said conversions.

In general, frequent use of cast operators indicates the software’s design can be improved.

Use Weaker Casts

Use the most restrictive cast operator for the situation. Casts should be precise, specific, and should fail if the code is changed in a way that makes the conversion meaningless.

Prefer static_cast over reinterpret_cast

static_cast will give a compile-time error if attempting to convert C* to D* where neither C nor D derive from each other. reinterpret_cast and C-style casts would both allow said conversion.

Prefer reinterpret_cast over C-style Casts

reinterpret_cast does not allow casting away constness. You must use const_cast for that. C-style casts, again, let you do anything. Prefer the weaker cast operation.

Avoid const_cast

I’m not sure I’ve ever seen a justified use of const_cast. It’s almost always some kind of clever hack. Just get your constness right in the first place.

Avoid C-style Casts

I’m going to repeat myself. Don’t use C-style casts. They let anything by. When you refactor something and you find yourself debugging some insane issue that should have been a compiler error, you’ll wish you used a weaker cast operator.

Don’t Cast Function Pointers to void*

The web is an asynchronous world built on asynchronous APIs. Thus, typical web applications are full of callbacks. onload and onerror for XMLHttpRequest, the callback argument to setTimeout, and messages from Web Workers are common examples.

Using asynchronous APIs is relatively natural in JavaScript. JavaScript is garbage-collected and supports anonymous functions that close over their scope.

However, when writing Emscripten-compiled C++ to interact with asynchronous web APIs, callbacks are less natural. First, JavaScript must call into a C++ interface. Embind works well for this purpose. Then, the C++ must respond to the callback appropriately. This post is about the latter component: writing efficient and clean asynchronous code in C++11.

I won’t go into detail here about how that works, but imagine you have an interface for fetching URLs with XMLHttpRequest.

Imagine Response is a struct that embodies the HTTP response and onLoaded runs when the XMLHttpRequest ‘load’ event fires.

To fetch data from the network, you would instantiate an implementation of the XHRCallback interface and pass it into the XHR object. I’m not going to cover in this article how to connect these interfaces up to JavaScript, but instead we will look at various various implementations of XHRCallback on the C++ side.

For the purposes of this example, let’s imagine we want to fetch some JSON, parse it, and store the result in a model.

Approach 1

A simple approach is to write an implementation of the interface that knows about the destination Model and updates it after parsing the body. Something like:

Ignoring the implementation of LambdaXHRCallback, the API’s a little cleaner to use. This approach requires backing the callback interface with an implementation that delegates to a std::function. The std::function can be bound to a local lambda, keeping the callback logic lexically near the code issuing the request.

From a clarity perspective, this is an improvement. However, because Emscripten requires that your customers download and parse the entire program during page load (in some browsers, parsing happens on every pageload!), code size is a huge deal. Even code in rarely-used code paths is worth paying attention to.

std::function, being implemented with its own abstract “implementation-knowledge-erasing” interface that is allocated upon initialization or assignment, tends to result in rather fat code. The default 16-byte backing storage in 32-bit libc++ doesn’t help either.

Can we achieve clear asynchronous code without paying the std::function penalty? Yes, in fact!

But… but… there are templates here, how is that any better than std::function? Well, first of all, now we only have one virtual call: the XHRCallback interface itself. Previously, we would have a virtual call into LambdaXHRCallback and then again through the std::function.

Second, in C++11, lambdas are syntax sugar for an anonymous class type with an operator(). Since the lambda’s immediately given to the LambdaXHRCallback template and stored directly as a member variable, in practice, the types are merged during link-time optimization.

I ported a dozen or so network callbacks from std::function to the template lambda implementation and saw a 39 KB reduction in the size of the resulting minified JavaScript.

I won’t go so far as to recommend avoiding std::function in Emscripten projects, but I would suggest asking whether there are better ways to accomplish your goals.

For it being 2013 and as much as Herb Sutter has talked about C++11, it’s surprisingly hard to get an off-the-shelf C++11 development toolchain on Windows, at least as of today. By off-the-shelf I mean suitable for an engineering team to get up and running quickly. Of course I could perform unnatural acts and compile my own packages of whatever, but no thanks.

Cygwin runs gcc 4.5 which is too old for most C++11 features. Cygwin does provide a clang 3.1 package, but it uses the gcc 4.5 libstdc++ headers, lacking most of C++11’s standard library.

I could attempt to compile my own libcxx but libcxx is only known to work on Mac OS X.

In November, Microsoft released a Community Technology Preview increasing Visual Studio 2012’s C++11 support but it requires modifying your project to use the CTP toolchain. I’m using SCons and I have no idea how to convince it to use the CTP.

With the caveat that each parser provides different functionality and access to the resulting parse tree, I benchmarked sajson, rapidjson, vjson, YAJL, and Jansson. My methodology was simple: given large-ish real-world JSON files, parse them as many times as possible in one second. To include the cost of reading the parse tree in the benchmark, I then iterated over the entire document and summed the number of strings, numbers, object values, array elements, etc.

The documents are:

apache_builds.json: Data from the Apache Jenkins installation. Mostly it’s a array of three-element objects, with string keys and string values.

update-center.json: Also from Jenkins though I’m not sure where I found it.

apache_builds.json, github_events.json, and instruments.json are pretty-printed with a great deal of interelement whitespace.

Now for the results. The Y axis is parses per second. Thus, higher is faster.

Core 2 Duo E6850, Windows 7, Visual Studio 2010, x86

Core 2 Duo E6850, Windows 7, Visual Studio 2010, AMD64

Atom D2700, Ubuntu 12.04, clang 3.0, AMD64

Raspberry Pi

Conclusions

sajson compares favorably to rapidjson and vjson, all of which stomp the C-based YAJL and Jansson parsers. 64-bit builds are noticeably faster: presumably because the additional registers are helpful. Raspberry Pis are slow. :)

The least trivial algorithm for building sajson’s parse tree is
allocating (or should I say, reserving?) the space in the parse tree
for an array’s or object’s element list without knowing the length in
advance.

Let’s consider an eleven-character JSON text. Imagine we’ve parsed
three characters: [[[. At this point we know two things:
1) we can fit the parse tree in eleven words and 2) there are at least
three arrays.

We don’t know the length of the arrays, so we cannot begin writing the
parse tree records yet.

The file could be [[[0,0,0]]] or [[[[[0]]]]] or [[[0,[0]]]] all of
which have quite different parse tree representations.

My first attempt involved parsing in two passes. The first pass
scanned the JSON text for arrays and objects and temporarily stored
their lengths into safe locations in the parse tree array. The
second pass used that information to correctly lay out the parse tree.

Parsing in two passes worked but had two major disadvantages. First, it was
slow. The scanning phase was simpler than parsing, but not THAT
simpler. Since parsing involves reading one byte and
branching on its value, parsing in two phases was effectively
half the speed. Second, the scanning phase duplicated a fair amount
of parsing code, making it harder to reason about and maintain.

Mike Ralston and I worked out a simpler approach at the cost
of two memory writes per array/object element record.

The gist is: given a parse tree array of size N, start one pointer at
the front and one at the back. Call the one at the front temp, for
temporary storage, and the one at the back out, for the actual parse
tree data.

When encountering the beginning of an array or object, remember the
current temp pointer.

When encountering a scalar element (numbers, strings, etc.), place its
payload in the parse tree at out and its location in temp.

When encountering the end of an array or object, compare the currenttemp pointer to its value when beginning the array or object. The
difference is the length of the array or object. Now that we know the
length and each element reference, we can move the element references
out of temp and into out.

It may help to work through a simple example:

[[0],0]

The JSON text is 7 characters. Thus we have 7 words of parse tree to
work with:

[ ][ ][ ][ ][ ][ ][ ]
^ ^
temp out

Encountering the first [, we store the current value of temp (on the C stack).

Encountering the second [, we store the current value of temp (on the
C stack.)

At this point, nothing has been written to the parse tree.

Then we see the first zero and place its payload at out and its
type+location reference at temp.

[<Integer>:6][ ][ ][ ][ ][ ][0]
^ ^
temp out

Encountering the first ], we calculate the array length, 1, and move
the references from temp to out. We write
the new array’s location to temp, giving:

[<Array>:4][ ][ ][ ][1][<Integer>:2][0]
^ ^
temp out

We were careful to adjust the element references so they remain
relative to the array record.

We then encounter another zero and place its payload in out and
location in temp:

[<Array>:4][<Integer>:3][ ][0][1][<Integer>:2][0]
^
temp out

Closing the outer array, we calculate its length (2), again move
the references from temp to out, and write the final array record.

[2][<Array>:4][<Integer>:3][0][1][<Integer>:2][0]
^
out

out now gives us the root of the parse tree.

Eliminating the Recursion

You may have noticed the previous implementation stores the start
address of each array or object on the C stack. This is liable to
overflow in the case of a JSON file consisting of N [s followed by N
]s for some large N. The JSON standard allows parsers to limit the
maximum depth they handle, but we can do better.

It’s possible to eliminate the recursion by storing the value oftemp into the temp side of the parse tree at the start of every
array or object. When reaching the end of an array or object, its
length is calculated, the record is written into out, and the
previous value of temp is restored. If the previous value of temp
is a special root marker, parsing is complete.

Does the parse tree, even during construction, have room for these
outer references?

First, let’s examine the same example but where we store a reference
to the outer ‘in-construction’ object in temp:

An easy conceptualization is that the final size of an array
record will be 1+N records, including its length. The temporary array
storage is also 1+N, where we don’t yet know its length but we do have
a reference to the enclosing array or object. Thus, we have room for
outer references in the parse tree.

Actual Code

The result
is an implementation whose parsing loop is almost entirely inlined,
and on architectures with a reasonable number of registers (even
AMD64), very little spills to the stack.

sajson is available
under the MIT license, but at the time of this writing, it is
primarily a proof of concept. The API is not stable and it does not
support string escapes. It also needs a security review to guarantee
that no malformed inputs crash.

Last week, I described a JSON parse tree data structure that, worst
case, requires N words for N characters of JSON text. I want to
explain the algorithm used to generate said parse tree, but first I
will describe the parse tree data structure in detail.
Simultaneously, I will show that the parse tree will fit in the worst
case.

Given that value types are stored in 3-bit tags, it’s intuitive that N
characters of JSON data requires N words in the parse tree. Let’s
consider the parse tree representation of each JSON type individually:

Strings

Strings are represented in the parse tree by two pointers: one to the
beginning of the string and one to the end. In the source text, these
correspond to the string’s quotes. The empty string, "",
is two characters and consumes two words in the parse tree.

struct String {
size_t begin;
size_t end;
};

Arrays

Arrays are represented by their length followed by a relative offset
to each element.

The smallest object, {}, is 2 characters but its
representation in the parse tree is a single word.

Now consider an object of one element: {"":""}. Including the key
string (but not the value string!), the object is five characters in
the input text. Its parse tree representation is four words: the
length plus three for its element.

Each additional element in the input text adds four characters (a
comma, a colon, and two quotes) but requires only three words in the
parse tree.

Numbers: Integers

That’s because, on 32-bit architectures, doubles are two words. If
single-digit numbers such as 0 consumed two words in the
parse tree, [0,0,0,0] would not fit.

Storing integers more compactly, we see that the smallest integers use
one character of input text and one word of parse tree structure.

It’s not worth the complexity, but the astute may notice that if we
limit integers to 29 bits, we don’t need to consume words in the parse
tree at all.

Numbers: Doubles

On 32-bit architectures, doubles are stored (unaligned) into the parse
tree.

struct Double {
size_t first_half;
size_t second_half;
};

On 64-bit architectures, a double consumes a single word.

The shortest JSON doubles are a single digit
followed by a decimal point followed by a single digit
(example: 0.0) or a single digit with a single-digit
exponent (example: 9e9). It’s clear that they fit into
the parse tree.