Fast Haskell: Competing with C at parsing XML

In this post we’re going to look at parsing XML in Haskell, how it compares with an efficient C parser, and steps you can take in Haskell to build a fast library from the ground up. We’re going to get fairly detailed and get our hands dirty.

A new kid on the block

A few weeks ago Neil Mitchell posted a blog post about a new XML library that he’d written. The parser is written in C, and the API is written in Haskell which uses the C library. He writes that it’s very fast:

Hexml has been designed for speed. In the very limited benchmarks I’ve done it is typically just over 2x faster at parsing than Pugixml, where Pugixml is the gold standard for fast XML DOM parsers. In my uses it has turned XML parsing from a bottleneck to an irrelevance, so it works for me.

In order to achieve that speed, he cheats by not performing operations he doesn’t care about:

To gain that speed, Hexml cheats. Primarily it doesn’t do entity expansion, so &amp; remains as &amp; in the output. It also doesn’t handle CData sections (but that’s because I’m lazy) and comment locations are not remembered. It also doesn’t deal with most of the XML standard, ignoring the DOCTYPE stuff. [..] I only work on UTF8, which for the bits of UTF8 I care about, is the same as ASCII - I don’t need to do any character decoding.

Cheating is fine when you describe in detail how you cheat. That’s just changing the rules of the game!

But C has problems

This post caught my attention because it seemed to me a pity to use C. Whether you use Haskell, Python, or whatever, there are a few problems with dropping down to C from your high-level language:

The program is more likely to segfault. I’ll take an exception over a segfault any day!

The program opens itself up to possible exploitation due to lack of memory safety.

If people want to extend your software, they have to use C, and not your high-level language.

At the moment, sorry to say – I wouldn’t use this library to parse any arbitrary XML, since it could be considered hostile, and get me owned. Using American Fuzzy Lop, just after a few minutes, I’ve already found around ~30 unique crashes.

But C is really fast right? Like 100s of times faster than Haskell! It’s worth the risk.

But-but C is fast!

Let’s benchmark it. We’re going to parse a 4KB, a 31KB and a 211KB XML file.

The numbers 60 and 62 are < and >. In XML the only characters that matter are < and > (if you don’t care about entities). < and > can’t appear inside speech marks (attributes). They are the only important things to search for. Results:

File hexml xeno
4KB 6.395 μs 2.630 μs
42KB 37.55 μs 7.814 μs

So the baseline performance of walking across the file in jumps is quite fast! Why is it fast? Let’s look at that for a minute:

The ByteString data type is a safe wrapper around a vector of bytes. It’s underneath equivalent to char* in C.

With that in mind, the S.elemIndex function is implemented using the standard C function memchr(3). As we all know, memchr jumps across your file in large word boundaries or even using SIMD operations, meaning it’s bloody fast. But the elemIndex function itself is safe.

So we’re effectively doing a for(..) { s=memchr(s,..) } loop over the file.

Keep an eye on the allocations

Using the weigh package for memory allocation tracking, we can also look at allocations of our code right now:

We see that it’s constant. Okay, it varies by a few bytes, but it doesn’t increase linearly or anything. That’s good! One thing that stood out to me, is that didn’t we pay for allocation of the Maybe values. For a 1000x < and > characters, we should have 1000 allocations of Just/Nothing. Let’s go down that rabbit hole for a second.

Looking at the Core

Well, if you compile the source like this

stack ghc -- -O2 -ddump-simpl Xeno.hs

You’ll see a dump of the real Core code that is generated after the Haskell code is desugared, and before it’s compiled to machine code. At this stage you can already see optimizations based on inlining, common-sub-expression elimination, deforestation, and other things.

The output is rather large. Core is verbose, and fast code tends to be longer. Here is the output, but you don’t have to understand it. Just note that there’s no mention of Maybe, Just or Nothing in there. It skips that altogether. See here specifically. There is a call to memchr, then there is an eqAddr comparison with NULL, to see whether the memchr is done or not. But we’re still doing safety checks so that the resulting code is safe.

Inlining counts

The curious reader might have noticed that INLINE line in my first code sample.

{-# INLINE elemIndexFrom #-}

Without the INLINE, the whole function is twice as slow and has linear allocation.

Right at the top, we have findTagName, doing all the allocations. So I looked at the code, and found that the only possible thing that could be allocating, is S.drop. This function skips n elements at the start of a ByteString. It turns out that S.head (S.drop index0 str) was allocating an intermediate string, just to get the first character of that string. It wasn’t copying the whole string, but it was making a new pointer to it.

So I realised that I could just replace S.head (S.drop n s) with S.index s n:

This function performs at the same speed as process before it accepted any callback arguments. This means that the only overhead to SAX’ing will be the activities that the callback functions themselves do.

Specialization is for insects (and, as it happens, optimized programs)

One point of interest is that adding a SPECIALIZE pragma for the process function increases speed by roughly 1 μs. Specialization means that for a given function which is generic (type-class polymorphic), which means it will accept a dictionary argument at runtime for the particular instance, instead we will generate a separate piece of code that is specialized on that exact instance. Below is the Identity monad’s (i.e. just pure, does nothing) specialized type for process.

In the 4KB case it’s only 800 ns, but as we say in Britain, take care of the pennies and the pounds will look after themselves. The 240->285 difference isn’t big in practical terms, but when we’re playing the speed game, we pay attention to things like that.

Where we stand: Xeno vs Hexml

Currently the SAX interface in Zeno outperforms Hexml in space and time. Hurrah! We’re as fast as C!

It’s also worth noting that Haskell does this all safely. All the functions I’m using are standard ByteString functions which do bounds checking and throw an exception if so. We don’t accidentally access memory that we shouldn’t, and we don’t segfault. The server keeps running.

If you’re interested, if we switch to unsafe functions (unsafeTake, unsafeIndex from the Data.ByteString.Unsafe module), we get a notable speed increase:

Implementing a DOM parser for Xeno

All isn’t lost. Hexml isn’t a dumb parser that’s fast because it’s in C, it’s also a decent algorithm. Rather than allocating a tree, it allocates a big flat vector of nodes and attributes, which contain offsets into the original string. We can do that in Haskell too!

Here’s my design of a data structure contained in a vector. We want to store just integers in the vector. Integers that point to offsets in the original string. Here’s what I came up with.

We have three kinds of payloads. Elements, text and attributes:

1. 00# Type tag: element2. 00# Parent index (within this array)3. 01# Start of the tag name in the original string4. 01# Length of the tag name5. 05# End index of the tag (within this array)

1. 02# Type tag: attribute2. 01# Start of the key3. 05# Length of the key4. 06# Start of the value5. 03# Length of the value

That’s all the detail I’m going to go into. You can read the code if you want to know more. It’s not a highly optimized format. Once we have such a vector, it’s possible to define a DOM API on top of it which can let you navigate the tree as usual, which we’ll see later.

We’re going to use our SAX parser–the process function, and we’re going to implement a function that writes to a big array. This is a very imperative algorithm. Haskellers don’t like imperative algorithms much, but Haskell’s fine with them.

The function runs in the ST monad which lets us locally read and write to mutable variables and vectors, while staying pure on the outside.

I allocate an array of 1000 64-bit Ints (on 64-bit arch), I keep a variable of the current size, and the current parent (if any). The current parent variable lets us, upon seeing a </close> tag, assign the position in the vector of where the parent is closed.

Whenever we get an event and the array is too small, I grow the array by doubling its size. This strategy is copied from the Hexml package.

Finally, when we’re done, we get the mutable vector, “freeze” it (this means making an immutable version of it), and then return that copy. We use unsafeFreeze to re-use the array without copying, which includes a promise that we don’t use the mutable vector afterwards, which we don’t.

The DOM speed

Not bad! The DOM parser is only <2x slower than Hexml (except in the 31KB where it’s faster. shrug). Here is where I stopped optimizing and decided it was good enough. But we can review some of the decisions made along the way.

In the code we’re using unboxed mutable references for the current size and parent, the mutable references are provided by the mutable-containers package. See these two lines here:

sizeRef <- fmap asURef (newRef 0)
parentRef <- fmap asURef (newRef 0)

Originally, I had tried STRef’s, which are boxed. Boxed just means it’s a pointer to an integer instead of an actual integer. An unboxed Int is a proper machine register. Using an STRef, we get worse speeds:

File xeno-dom
4KB 12.18 μs
31KB 6.412 μs
211KB 631.1 μs

Which is a noticeable speed loss.

Another thing to take into consideration is the array type. I’m using the unboxed mutable vectors from the vector package. When using atomic types like Int, it can be a leg-up to use unboxed vectors. If I use the regular boxed vectors from Data.Vector, the speed regresses to:

Tada! We matched Hexml, in pure Haskell, using safe accessor functions. We provided a SAX API which is very fast, and a simple demonstration DOM parser with a familiar API which is also quite fast. We use reasonably little memory in doing so.

This package is an experiment for educational purposes, to show what Haskell can do and what it can’t, for a very specific domain problem. If you would like to use this package, consider adopting it and giving it a good home. I’m not looking for more packages to maintain.