What is oheap?

In the last post, I mentioned that oil's ASDL implementation can
serialize ASTs into a binary format I'm calling oheap.

Although it will evolve, I'll continue to mention it, so this post is a general
overview. At the end, I'll make note of limitations and future work.

Motivation

First, I should mention that the 1997 paper on ASDL discusses an
ASN.1-like binary encoding of ASTs. As far as I know, this format is
unused. oheap can be thought of as a modern incarnation of that idea.

I wanted oheap to solve a practical problem: integrating the osh parser in
Python and a shell runtime in C++. The simplest and most efficient way to do
this is by sharing the AST with an in-memory binary format.

The other, perhaps more conventional, way is to use the
Python-C API, which is what Python itself does. But this is
verbose and error-prone: in this post, I mentioned
that 123 lines of Python.asdl translates to ~8100 lines of C code using
the Python-C API.

In contrast, with oheap, 129 lines of osh.asdl translates to ~1100
lines of C++ code in osh.asdl.h. As we'll see below, most these lines
define types and do no work at runtime.

The second motivation is further in the future: I want much of oil to be
written in oil. But it doesn't make sense to parse oil for every shell
script it runs. Compiling oil code to oheap will avoid this.

Thus oheap can be compared to the .pyc format in Python. However, oil
won't use the file system as a cache, because that scheme causes subtle
problems in production. Also, I don't believe that running a shell script
should litter your system with temp files (although surprisingly, almost all
shells use temp files to implement here docs).

Description

I call the format oheap because it's conceptually like the C heap — a
block of bytes representing integers, pointers, strings, arrays, and records.
However, it's a relocatable heap, which means you can store it in a file
and load it with a single read() call.

Understanding the following methods is a shortcut to understanding the format.
They're essentially the "runtime library" for oheap:

The Int() method takes an offset n from the beginning of the Obj instance
— i.e. the only bytes_ member — and treats that location as the
beginning of a little-endian three-byte integer.

The Ref() method takes a pointer base to the oheap image, an offset n
to be treated as an Int, and returns a reference to another Obj. In other
words, it looks up a pointer field on an Obj instance.

Here's the analogy to the C heap:

ASDL types are represented as subtypes of Obj, which makes Obj& analogous
to void*.

A Ref() call is analogous to deferencing a pointer. This means that
oheap structures are lazily decoded. We can place the image anywhere
in memory and immediately begin computation, without a decoding step.

The rest of the osh.asdl.h header file, generated from osh.asdl, uses
static_cast<> to give you a typed API over the heap. It's similar to a
subset of the protobuf API.

Also note that every method in the header is inline. The runtime library does
little real work: just array indexing, left shifts, and addition.

Encoding

ASDL has these logical types:

bool

int

string

product type

sum type

array of any other type

which can be represented with these respective physical storage types in
oheap's C-like model:

bool: Int0 or 1

int: Int

string: Ref to a NUL-terminated character sequence

product type: adjacent Int and Ref (ints and pointers)

sum type: a tag byte; then adjacent Int and Ref

array: a length Int; then adjacent Int or Ref

oheap can use integers and pointers of any width, but in oil they're three
bytes wide. I figure that any shell script can be represented with an AST of
less than 224 = 16 Mi locations.

In addition to location independence, compression is another benefit of
representing pointers with small integers. For example, it takes oheap16
bytes to represent a sum type with five fields: one byte for the tag, and
three bytes for each of five fields.

A native representaiton on a 32-bit machine would take 1 + 5*4 = 21
bytes. On a 64-bit machine, it would take 1 + 5*8 = 41 bytes. In the
latter case, oheap is over 60% smaller.

Comparison to Other Serialization Frameworks

Together, ASDL and oheap have an architecture like
protocol buffers, with these components:

As mentioned, a key difference is that there's no decoding step. This
reminds me of capnproto, which is roughly a successor to protocol buffers
(having been developed by the author of "proto2"). capnproto avoids the
decoding step by using the in-memory format as the serialization format (with
some message-independent compression.)

You could say that oheap is the opposite: we're using the serialization
format as the in-memory format. The format is designed to be both small and
efficiently decoded on the fly.

Calling Ref() is undoubtedly faster than parsing shell, so we've achieved
our goal of avoiding parsing. But it remains to be seen how appropriate
oheap is in other situations.

Limitations

Right now, oil uses oheap in an immutable fashion. Everything is packed
tightly together: there's no operation for appending to an array, for example.

I have ideas for implementing mutating operations, but it's also possible that
we only need this ML-like model of transformations on immutable trees.

Another important difference between oheap and protobuf or capnproto
is that there are no integer tags in the binary encoding to identify fields.
This makes the encoding smaller at the expense of it being fragile with
respect to schema changes. Tags can be added, but since we're not saving
oheap files or sending them across the network, they're not currently needed.