Smoking fast Haskell code using GHC’s new LLVM codegen

In this post we’ll play with GHC’s new LLVM code generator backend, and see how much faster some Haskell programs are when compiled with LLVM instead of GCC.

For the kind of loops we get from stream fusion, the -fllvm backend produced a lot better code, up to 3x faster in some cases. There are pretty graphs, and some smoking hot new technology.

Overview

This week David Tereiannounced that his work on an LLVM code generator backend for the Glasgow Haskell Compiler was ready to try out. Initial reports from his undergraduate thesis held that the LLVM code generator was competitive with the current GHC native code generator, a bit slower than the C backend in general (which uses GCC for code generation), but, tantalisingly, should produce big speedups for particular Haskell programs. In particular, tight loops of the kind generated by the bytestring, vector, data parallel arrays or text libraries. David reported speedups of 25% over the previous best performance we’d got from GHC for data parallel code.

I was very keen to try it out on the vector library — a fast, fusible numerical arrays package (similar to NumPy), which generates some very tight loops. Under the C backend, GCC has been failing to spot that the code GHC generates were actually loops, and this lead to GCC optimizing the generated code pretty badly. The native code generator does ok, but doesn’t have a lot of the clever low-level optimizations we need for really good bare metal performance.

The Vector Package

Vector is a Haskell library for working with arrays. It provides several array types (boxed, unboxed, C), with a rich interface similar to the lists library, and some functions reminiscent of Data Parallel Haskell. There’s a tutorial on how to use it.

The interface is built entirely around stream fusion combinators — a general form of classic loop fusion made possible by purity. When you do multiple passes over the data (e.g. sum/map/fold/filter/…) the compiler will common up the loops, and discard intermediate arrays, making the code potentially very fast.

The loops that are generated tend to be very register heavy, do no heap allocation, and benefit from clever imperative loop optimizations. Unfortunately, the GCC backend to GHC doesn’t spot that these are actually loops, so doesn’t get to fire many optimizations.

The promise of the LLVM backend is that it will recognize the loops GHC generates from fused code. Let’s see how it performs.

To benchmark these programs, I’ll use the criterion and progression benchmarking libraries. (I had to build the darcs version of gtk2hs, and compiler data accessor-template with the -ftemplate_2_4 flag)

Simple loops

To start off, let’s generate 1 billion ints, sum them, print the result. That should tell us if our loops are efficient:

This is the fastest Haskell we’ve ever generated for this little benchmark (at least without manual loop unrolling)!

The LLVM backend more than halved the running time for this simple loop. But remember: general benchmarks aren’t seeing these kind of speedups — LLVM is really excelling itself at the tight numeric code.

Here’s the data presented in a slightly different form, with criterion and progression. The numbers are slightly different, since we won’t inline the length of the vector argument, and we’re wrapping the code in benchmarking wrappers. I wasn’t able to get -fvia-C programs to link under the HEAD, so we’ll exclude those from graphs, but report them in text form.

With the -fasm backend:

With the LLVM backend:

Or side-by-side with the progression package:

The -fasm backend under the progression tool ran around ~1s for each billion ints, while -fllvm was around 0.8s. Note that we get slightly different timings with the loops under each benchmarking tool, due to how the benchmark program and wrapper are optimized.

Zips

Zips are another good candidate, since they turn into nested loops. So, e.g.

Conclusions

The LLVM backend seems to be holding up to what we hoped: it does a better (some times much better) job on tight loops. We get better code than GHC has ever produced before. It seems pretty robust, so far everything I’ve tried has worked.

David’s benchmarks indicate that with the current — first attempt — at an LLVM backend most large programs aren’t noticeably faster, but I think the promise we see in these small examples justifies spending more time working on the LLVM backend to GHC. It has much more potential than the GCC backend.

Currently we’re not experimenting with the LLVM optimization layer at all — I think there’s likely to be a lot of win just tweaking those settings (and exposing them to the Haskell programmer via GHC flags).

20 comments

Is there still no way of plotting criterion graphs together, or changing the scales so they match? Not a big deal, but it makes it harder to read (although I suppose they’re sharp enough that it’d end up as two spikes on a graph.)

Anyway, that looks fantastic. Is it a full backend? Should FFI still work?

Interestingly, from a quick scan through the thesis, it looks like David attributes a lot/at least some of the tight loops performance advantage to not having pinned the STG registers except at function entrance and exit.

Specifically, the bottom of page 42 and top of page 43 detail how he used a custom LLVM custom calling convention (this seems to be what the LLVM patch is for) to make sure the STG registers were always set at function entry and exit. This is sufficient to make the rest of the RTS happy.

What it means, though, is that the LLVM compiler is free to spill the registers in the code body. THe bottom of page 53 and top of page 54 comment on this can be critical to speeding up tight loops (specifically, DPH ones where he was seeing some impressive performance gains).

How is the code generation from LLVM to assembly set up?
I presume it’s running ‘opt’ somewhere in there?
The flags to the optimizer probably needs some tweaking, for instance, you might want to unroll loops a bit for all of the loops in your examples, since it can make a big difference.

Mark: Criterion does have an option to plot the graphs for two benchmarks on the same axis (“–kde-same-axis”, see http://chplib.wordpress.com/2009/10/21/benchmarking-stm-with-criterion/ for an example), but that only works if you are plotting several benchmarks at the same time. Don is running his benchmarks with two different programs (since they use different backends) so that feature of Criterion can’t be used.

Any chance of a follow up with some comparisons to C compiled with GCC? I’m more interested how well we’re competing at the moment than how much better we’re doing (C gives a nice “damn we’re good, look at us!” baseline for when we’re faster than it :)).