Saturday, June 10, 2017

Translating a C++ parser to Haskell

Recently I translated Nix's derivation parser to Haskell and I thought this would make an instructive example for how C++ idioms map to Haskell idioms. This post targets people who understand Haskell's basic syntax but perhaps have difficulty translating imperative style to a functional style. I will also throw in some benchmarks at the end, too, comparing Haskell performance to C++.

Nix derivations

Nix uses "derivations" to store instructions for how to build something. The corresponding C++ type is called a Derivation, which is located here:

This is how people used to encode object-oriented programming before there was such a thing as object-oriented programming and this pattern is common in functional languages. The disadvantage is that this leads to an import-heavy programming style.

We can now translate this C++ to Haskell now that we've reduced the code to simple data types and functions on those types:

Note that Haskell type synonyms reverse the order of the types compared to C++. The new type that you define goes on the left and the body of the definition goes on the right. Haskell's order makes more sense to me, since I'm used to the same order when defining values like x = 5.

There are a few more changes that I'd like to make before we proceed to the parsing code:

First, Haskell's String type and default list type are inefficient for both performance and space utilization, so we will replace them with Text and Vector, respectively. The latter types are more compact and provide better performance:

The first thing that the C++ parses is the string "Derive(" followed by a list of DerivationOutputs. The code consolidates the first '[' character of the list with the string "Derive(" which is why the code actually matches "Derive([".

Derivation files store the outputs field of our Derivation type as a list of 4-tuples. The first field of each 4-tuple is a key in our Map and the remaining three fields are the corresponding value, which is marshalled into a DerivationOutput.

The C++ code interleaves the logic for parsing the list structure and parsing each element but our Haskell code will separate the two for clarity:

The code has placeholders for three utilities we haven't defined yet with the following types:

-- This is a utility function that transforms a parser of key-value pairs into a-- parser for a `Map`
mapOf
::Ord k
-- ^ This is a "constraint" and not a function argument. This constraint-- says that `k` can be any type as long as we can compare two values of-- type `k`=>Parser (k, v)
-- ^ This is the actual first function argument: a parser of key-value-- pairs. The type of the key (which we denote as `k`) can be any type as-- long as `k` is comparable (due to the `Ord k` constraint immediately-- preceding this). The type of the value (which we dnote as `v`) can be-- any type->Parser (Map k v)
-- ^ This is the function output: a parser of a `Map k v` (i.e. a map from-- keys of type `k` to values of type `v`)-- This is a utility which parses a string literal according to the EBNF rule-- named `string`string ::ParserText-- This is a utility which parses a string literal according to the EBNF rule-- named `path`filepath ::Parser FilePath

mapOf use a helper function named listOf which parses a list of values. This parser takes advantage of the handy sepBy utility (short for "separated by") provided by Haskell's attoparsec library. You can read the implementation of listBy as saying:

Match the string "["

Match 0 or more elements separated by commas

Match the string "]"

Then you can read the implementation of mapOf as saying:

Parse a list of keyValue pairs

Use Data.Map.fromList to transform that into the corresponding Map

We can now use mapOf and listOf to transform the next block of parsing code, too:

However, we won't naively translate that to Haskell because this is on our parser's critical path for performance. Haskell's attoparsec library only guarantees good performance if you use bulk parsing primitives when possible instead of character-at-a-time parsing loops.

Our Haskell string literal parser will be a loop, but each iteration of the loop will parse a string block instead of a single character:

The reason this works is because Haskell's return is not the same as return in C/C++/Java because the Haskell return does not exit from the surrounding subroutine. Indeed, there is no such thing as a "surrounding subroutine" in Haskell and that's a good thing!

In this context the return function is like a Parser that does not parse anything and returns any value that you want. More generally, return is used to denote a subroutine that does nothing and produces a value that can be stored just like any other command.

Our derivation file is 15,210 characters long, so that comes out to about 200 nanoseconds per character to parse. That's not bad, but could still use improvement. However, I stopped optimizing at this point because I did some experiments that showed that parsing was no longer the bottleneck for even a trivial program.

I compared the performance of an executable written in Haskell to the nix-store executable (written in C++) to see how fast each one could display the outputs of a list of derivations. I ran them on 169 derivations all beginning with the letter z in their hash:

$ ls -d /nix/store/z*.drv |wc -l
169

The nix-store command lets you do this with nix-store --query --outputs:

... so if you factor in that overhead then both tools process derivations at a rate of about 440 microseconds per file. Given that the Haskell executable is exactly as efficient as C++ I figured that there was no point further optimizing the code. The first draft is simple, clear and efficient enough.

Conclusion

Hopefully this helps people see that you can translate C++ parsing code to Haskell. The main difference is that Haskell parsing libraries provide some higher-level abstractions and Haskell programs tend to define loops via recursion instead of iteration.

The Haskell code is simpler than the C++ code and efficient, too! This is why I recommend Haskell to people who want want a high-level programming language without sacrificing performance.

I also released the above Haskell parser as part of the nix-derivation library in case people were interested in using this code. You can find the library on Hackage or on GitHub.

Thank you! I had fun playing around with this code. On my machine (MacBook Air), I found that I had to switch to `Data.Attoparsec.ByteString.Char8` to get performance on par with `nix-store --query`. It's probably safe as the .drv files here report to be ASCII.

Possibly, although I don't know for sure since I'm not familiar with available C++ parsing libraries. Specifically, I don't know what C++ options there are for building recursive descent backtracking parsers (which is the usual idiom for Haskell parsers).