Such functions aren't really doing parsing work, they're mostly there for debugging and for sanity checks.

Get some data

Now let's get some data:

class FooParser
def get(n)
fail! "Trying to read past end of file" if bytes_left < n
result = @data[@ofs, n]
@ofs += n
result
end
end

Most files are byte-oriented, and going through a helper function to get the next few bytes and manage offset and end of file will save us complexity elsewhere.

This kind of function is also a great place for cases where file format is unusual somehow. Vast majority of formats are byte-oriented and fit in memory just fine, but if your format in too big, or uses individual bits, or does anything else weird, you can generally encapsulate such logic in get method - as long as @ofs uniquely describes current location in the file.

Get integers

Most formats are full of numbers and strings, and it's useful to write get methods for all of them. Let's start with Integers as they're easiest.

I generally use concise byte count convention (so get_i4 means get signed integer with 4 bytes), but more verbose bit-based naming (like get_i32 or even get_int32) might be more to your liking.

For most formats there's standard #unpack for them, so their unpackers are very simple - usually the only difference is right unpack string based on big vs small endianness of your format:

Very rarely file format uses more unusual number like 3-byte or variable length integer. Their get_* methods will look different from format to format, but you should generally be able to encapsulate them.

Here's an examples - for 24-bit numbers we can just pad them with extra byte (0 for unsigned, either 0 or 255 for signed), and use proper unpack string for 32-bit numbers:

Usually you can either add some padding, or just read them byte at a time and do some math to get the right number.

Get floats

This is slightly more complicated for two reasons - first the most common format in files is single precision, but Ruby uses double precision. Unpacking floats generates questionable extra precision:

[0.1].pack("f").unpack("f")[0] #=> 0.10000000149011612

Especially if you plan to present the result to users, we often don't want that - we want the "nicest looking" double precision number which still decodes to the same single precision number.

Now I'm sure there's some more "precise" algorithm, but this prettification method worked well:

class Float
def pretty_single
result = round(6)
return result if [self].pack("f") == [result].pack("f")
self
end
end

What we do here is round it to 6 decimal places, check if it equals self when converted to single precision, and if so, return rounded result, otherwise return self. It works with infinities, NaN, -0 and all other weirdness.

Second problem is that Ruby unpack still lacks half-precision floats, which are extremely popular in GPU world, so any data destined to go to GPU often uses it - or just formats trying to save on space.

In such case, just get_u2 and do bit manipulation yourself.

Get strings

Strings are just as ubiquitous as numbers, but there's a few more formats, including:

It's not uncommon for a single file format to have more than one string encoding - in such case create multiple such methods, or make it accept some argument to switch between them.

Get individual data structures

The generic part of the parser is done, so now for any data structure in your format you need to write appropriate parsing method, returning some Hash, Array, or proper object depending on the structure:

For more concise code you can often shortcut this thank to ruby evaluating arguments left to right (like most languages nowadays, but there were dark times when undefined or backwards evaluation conventions existed too):

Complete the parser

And the last part is completing the parser. If the whole file consists of same data structures every time, just write some kind of get_everything following the same pattern and you're done.

Even then I'd recommend adding something like fail! "Parsing done, but #{bytes_left} bytes left" unless eof? sanity check just to be sure format is what we expected it to be - extra garbage bytes generally indicate that file is not quite in formats expected.

Of course things are rarely quite as simple. It's very common for file formats to have structure like:

header starting with some magic

header then includes some global data and counts for data structures to follow

after that individual data structures follow, based on data in the header

Example parser

To avoid complexity of real formats, let's try a fictional format with kittens and puppies:

magic marker "CUTE"

u4 number of kittens

u4 number of puppies

each kitten has star rating (float), number of lives left (u4) and photo url (str)

There are weirder data structures, like offsets relative to current location etc. You can generally keep all the offset mangling in helper functions and keep your parsing code high level.

Lookahead

This is more a debugging thing than something you'll need to do often in actual parsing, but sometimes in your pry session you'd like to see if what's following is a kitten without affecting current parser state. It's very easy:

Then in your pry session just try lookahead{ get_kitten } and it will give you a kitten (or throw an exception).

Limitations

Writing Ruby-based parsers following patterns described in the post have been far more convenient and flexible than anything else I tried, but there are limitations:

you need to have some clue about the format, or at least data structures at its beginning - there's some debugging helpers, but no autodetection, and the only "partial parsing" you can conveniently do is for beginning of the file - if you know some data structures in the middle of the file, it will still be hard to find them and partially decode them without knowing the rest of the format

it's inherently one-way - you'll need to write a separate encoder, even though some clever DSL might have given you 90% of it for free. I found that overhead of highly abstracting DSLs is generally much higher than just writing it twice, and it's relatively easy to test that they match by round trip testing - just grab some samples, and check if decoding followed by encoding gets you the sample back.

its performance is OK, but not amazing - we're working with very high level language, and doing byte manipulation with a lot of layers of abstraction above it - it's actually quite surprising how fast it works, but it's still nowhere near as fast as a parser written in very low level language

it's designed for full parsing in one go, not random access - this is mostly performance issue, but if you want to just read (let alone change) a single kitten in file with thousands, this pattern won't support you

if the format is crazy complex, you'll probably need a lot more sophisticated parser - this pattern is great for simple to moderately complex formats

Jonathan Rochkind: This kind of declarative code definitely looks nice, but I'd be surprised if it was able to handle more complex formats. It seems more applicable to situations where you're in control of formats, not when you need to parse preexisting format with all the quirks.

The patterns I show in this post is basically binary equivalent of recursive descent parser, which is generally the nicest way to write parsers for complex formats.

My software

Creative Commons

Unless otherwise expressly stated, all original material of whatever nature created by Tomasz Węgrzanowski and included in this blog, is licensed under a Creative Commons License. It is also licensed under GFDL (for Wikipedia compatibility).