TinyBuf: Binary Serialization

After reading some of Google's ProtoBuf documentation, and working extensively with Apache Thrift, I decided I could learn most effectively by creating my own proof-of-concept binary serialization framework. I would work in a largely test-driven fashion, and attempt to encode every bug I encountered in a reproducible unit test.

My proof of concept was going to be written in Python, although ideally the code would be written clearly enough that later implementation in Node, Scala or any other language wouldn't be difficult.

I decided early on that I would severely limit the scope of the protocol. Although user-defined data structures and common idioms like maps, lists and optional values would be supported, no effort would be made (as it is in Protobuf) to achieve robustness in the face of outdated configuration files.

I also decided that I'd focus on the interesting data serialization, and write an extremely simple parser – i.e. not a production quality system, but an incredibly basic line-by-line parser without much room for user interpretation.

Limitations aside, I did want to make this a streaming library. Python's extensive support for generators and hence lazy evaluation means that streaming and receiving streams of binary data (through e.g. a socket) should be particularly easy to implement.

I also wanted to use a strictly test-driven style of development. Although this isn't my usual way of working for things like web applications, it made sense to write good unit tests for this low-level code.

Finally, I didn't want to write anything that relied on generating code like Thrift or Protobuf do. I'd like to read in definition files and automatically create data types around those definitions.

The first thing to design was a varying-length integer type. As well as being useful itself, it would also form an important part of the type system for varying-length collections such as lists, maps and strings.

It was important to have a working "roundtrip" test. This would make sure that unsigned_int_to_bytes and its inverse, read_unsigned_int, would always compose together to give the identity function.

I also wanted to write a test for each component, making sure that the binary format was what I was expecting. I've used the LEB128 format to encode unsigned integers of arbitrary size into a stream of bytes. In short, the most significant bit of each byte is a "continuation bit" describing whether there are any more bytes in the number to read. The other 7 bits are interpreted as usual for a binary number.

Next up is serializing text. Luckily in Python to encode a string of unicode text as UTF-8 bytes I just call the .encode() method. Similarly, .decode() will transform a UTF-8 bytes object into a Python native unicode string.

In this case I trust the (de)serialization performed by .encode() and .decode(), so I won't write unit tests around this feature – it's enough to know that the two functions read_text and text_to_bytes are each other's inverse.

I'll also create a boolean type. In Python, integers already act as booleans, so I'll just encode the type as if it were a variable-length integer. This might seem wasteful, but generally it's more useful to align data to the nearest byte – especially for network transport.

At this point, I refactored the functions which translated between Python native data types and bytestreams by placing them in static methods. This way I could define a more uniform interface for types (int, string etc.) which could also be honoured by any user-generated types.

To create structures involving these types, I'll define a simple Map type. This type is an unordered associative list of key-value pairs, where the key is a varying-length integer and the value is any other type. To decode the Map type, a client will need a mapping between the key and the value type. The user will define this mapping in a file similar to those used with Protobuf. However, this unit test will create the mapping programmatically.

Here, we're doing three things. Firstly, we create a Mapping. This defines the relationship between numeric keys (1, 2, and 3), key names ("name", "age" and "likes_chocolate"), and the types of values these keys will be associated with.

Secondly, we create a bytestream which matches the structure we'd like our Map to have. It starts off with the number of entries in the Map, then alternates between numeric keys and their values. The first numeric key is 1, and its value is the encoded string "Bede Kelly". I'm using the previously-tested to_bytes method for each of the inner types, as I think it makes the code clearer and reduces redundancy in the testing.

Thirdly, we create a Map type from the mapping, and try reading a value of that type from the bytestream. If it's equal to our ideal dictionary of data, our test passes!

Let's see how this works with the inverse: generating a stream of bytes from data according to a mapping.

Again, we're creating a Mapping to link the numeric keys, the string names and the value types. But this time, we're checking that our serialization works correctly: that the to_bytes method of our new Map type outputs the correct sequence of bytes.

Finally, I'd like these Map types to complain when they don't get given all the values they need to satisfy their internal Mapping.