1 Handling Binary Data with Haskell

Many programming problems call for the use of binary formats for compactness,
ease-of-use, compatibility or speed. This page quickly covers some common
libraries for handling binary data in Haskell.

1.1 Bytestrings

Everything else in this tutorial will be based on bytestrings. Normal Haskell

String

types are linked lists of 32-bit characters. This has a

number of useful properties like coverage of the Unicode space and laziness,

however when it comes to dealing with bytewise data,

String

involves a space-inflation of about 24x and a large reduction in speed.

Bytestrings are packed arrays of bytes or 8-bit chars. If you have experience
in C, their memory representation would be the same as a uint8_t[]—although bytestrings know their length and don't allow overflows, etc.

There are two major flavours of bytestrings: strict and lazy. Strict
bytestrings are exactly what you would expect—a linear array of bytes in
memory. Lazy bytestrings are a list of strict bytestrings; often this is called
a cord in other languages. When reading a lazy bytestring from a file, the data
will be read chunk by chunk and the file can be larger than the size of memory.
The default chunk size is currently 32K.

Within each flavour of bytestring comes the Word8 and Char8 versions. These are
mostly an aid to the type system since they are fundamentally the same size of

Note that we are using strict bytestrings here. (It's quite common to import the
ByteString module under the names B or BS.)
Since the bytestrings are strict, the code will read the whole of stdin into
memory and then write it out. If the input was too large this would overflow
the available memory and fail.

Let's see the same program using lazy bytestrings. We are just changing the
imported ByteString module to be the lazy one and calling the exact same
functions from the new module:

This code, because of the lazy bytestrings, will cope with any sized input and
will start producing output before all the input has been read. You can think
of the code as setting up a pipeline, rather than executing in-order, as you

might expect. As

putStr

needs more data, it will cause the lazy
bytestring

contents

to read more until the end of the input is

found.

You should review the documentation
which lists all the functions which operate on ByteStrings. The documentation
for the various types (lazy Word8, strict Char8, ...) are all very similar. You
generally find the same functions in each, with the same names. Remember to
import the modules as qualified and give them different names.

1.1.2 The guts of ByteStrings

I'll just mention in passing that sometimes you need to do something which would
endanger the referential transparency of ByteStrings. Generally you only need
to do this when using the FFI to interface with C libraries. Should such a need
arise, you can have a look at the
internal functions and the
unsafe functions.
Remember that the last set of functions are called unsafe for a reason—misuse
can crash your program!

1.2 Binary parsing

Once you have your data as a bytestring you'll be wanting to parse something
from it. Here you need to install the
binary package. You should read the instructions on
how to install a Cabal package if you haven't done so already.

The binary package has three major parts: the Get monad,
the Put monad and a general serialisation for Haskell types. The
latter is like the pickle module that you may know from Python—it
has its own serialisation format and I won't be covering it any more here.
However, if you just need to persist some Haskell data structures, it might be
exactly what you want: the documentation is
here

1.2.1 The Get monad

The Get monad is a state monad; it keeps some state and each action
updates that state. The state in this case is an offset into the bytestring
which is getting parsed. Get parses lazy bytestrings; this is how
packages like
tar
can parse files several gigabytes long in constant memory: they are using a
pipeline of lazy bytestrings. However, this also has a downside. When parsing a
lazy bytestring a parse failure (such as running off the end of the bytestring)
is signified by an exception. Exceptions can only be caught in the IO monad
and, because of laziness, might not be thrown exactly where you expect. If this
is a problem, you probably want a strict version of Get, which is
covered below.

1.2.2 Strict Get monad

If you're parsing small messages then, firstly your input isn't going to be a
lazy bytestring but a strict one. That's not reallly a problem because you can
easilly convert between them. However, if you want to handle parse failures you
either have to write your parser very carefully, or you have to deal with the
fact that you can only catch exceptions in the IO monad.

If this is your dilemma, then you need a strict version of the Get

monad. It's almost exactly the same, but a parser of type

Get a

results in

(EitherString a, ByteString)

as the result of

runGet

. That type is a tuple where the first value is either a

string (an error string from the parse) or the result, and the second value is
the remaining bytestring when the parser finished.

Let's update the first example with this strict version of Get. You'll
have to install the
binary-strict
package for it to work.

Now we can see that the parser was successful (we got a Right) and we
can see that our shell actually added an extra newline on the input (correctly)
and the parser didn't consume that, so it's also returned to us. Now we try it
with a truncated input:

This time we didn't get an exception, but a Left value, which can be
handled in pure code. The remaining bytestring is the same because our
truncated input is 9 bytes long, parsing the first two Word32's
consumed 8 bytes and parsing the third failed—at which point we had the last
byte still in the input.

In your parser, you can also call

fail

, with an error string,

which will result in a Left value.

That's it; it's otherwise the same as the Get monad.

1.2.3 Incremental parsing

If you have to deal with a protocol which isn't length prefixed, or otherwise
chunkable, from the network then you are faced with the problem of knowing when
you have enough data to parse something semantically useful. You could run a
strict Get over what you have and catch the truncation result, but
that means that you're parsing the data multiple times etc.

Instead, you can use an incremental parser. There's an incremental version of
the Get monad in Data.Binary.Strict.IncrementalGet (you'll
need the binary-strict package).

You use it as normal, but rather than returning an Either value, you
get a Result. You need to go follow that link and look at the documentation for Result.

It reflects the three outcomes of parsing possibly truncated data. Either the
data is invalid as is, or it's complete, or it's truncated. In the truncated
case you are given a function (called a continuation), to which you can pass
more data, when you get it, and continue the parse. The continuation, again,
returns a Result depending on the result of parsing the additional
data as well.

1.2.4 Bit twiddling

Even with all this monadic goodness, sometimes you just need to move some bits
around. That's perfectly possible in Haskell too. Just import
Data.Bits and use the following table.

Name

C operator

Haskell

AND

&

.&.

OR

|

.|.

XOR

^

`xor`

NOT

~

`complement`

Left shift

<<

`shiftL`

Right shift

>>

`shiftR`

1.2.5 The BitGet monad

As an alternative to bit twiddling, you can also use the BitGet monad.
This is another state-like monad, like Get, but here the state
includes the current bit-offset in the input. This means that you can easily pull out
unaligned data. Sadly, haddock is currently breaking when trying to generate the
documentation for BitGet so I'll start with an example. Again, you'll
need the
binary-strict package installed.

Here's a description of the header of a DNS packet, direct from RFC 1035:

Here you can see that only the second line (from the ASCII-art diagram) is
parsed using BitGet. An outer Get monad is used for
everything else and the bit fields are pulled out with

getByteString

. Again, BitGet is a strict monad and

returns an Either, but it doesn't return the remaining bytestring,
just because there's no obvious way to represent a bytestring of a fractional
number of bytes.

You can see the list of BitGet functions and their comments in the
source code.

1.3 Binary generation

In contrast to parsing binary data, you might want to generate it. This is the
job of the Put monad. Follow along with the
documentation
if you like.

The Put monad is another state-like monad, but the state is an offset
into a series of buffers where the generated data is placed. All the buffer
creation and handling is done for you, so you can just forget about it. It
results in a lazy bytestring (so you can generate outputs that are larger than memory).

One limitation of Put, due to the nature of the Builder monad
which it works with, is that you can't get the current offset into the output.
This can be an issue with some formats which require you to encode byte offsets
into the file. You have to calculate these byte offsets yourself.

1.4 Other useful packages

There are other packages which you should know about, but which are mostly
covered by their documentation: