Ultimately we’ll make a program 4x faster on 4 cores by changing one line of code, using parallelism, and tuning the garbage collector.

Update: and since I began writing this GHC HQ (aka Simon, Simon and Satnam) have released “Runtime Support for Multicore Haskell” which finally puts on paper a lot of information that was previously just rumour. As a result, I’ve rewritten this article from scratch to use GHC 6.11 (today’s snapshot) since it is just so much faster and easier to use than 6.10.x.

The garbage collector can now use multiple threads in parallel. The new -gn RTS flag controls it, e.g. run your program with +RTS -g2 -RTS to use 2 threads. The -g option is implied by the usual -N option, so normally there will be no need to specify it separately, although occasionally it is useful to turn it off with -g1. Do let us know if you experience strange effects, especially an increase in GC time when using the parallel GC (use +RTS -s -RTS to measure GC time). See Section 5.14.3, “RTS options to control the garbage collector” for more details.

Interesting. Maybe this will have some impact on the shootout benchmarks.

Binary trees: single threaded

There’s one program that’s been bugging me for a while, where the garbage collector is a bottleneck: parallel binary-trees on the quad core Computer Language Benchmarks Game. This is a pretty straight forward program for testing out memory management of non-flat data types in a language runtime – and FP languages should do very well with their bump-and-allocate heaps. All you have to do is allocate and traverse a bunch of binary trees really. This kind of data:

data Tree = Nil | Node !Int!Tree !Tree

Note that the rules state we can’t use laziness to avoid making O(n) allocations at a time, so the benchmark will use a strict tree type – that’s fine – it only helps with a single core anyway. GHC will unbox those Int fields into the constructor too, with -funbox-strict-fields (should be implied by -O in my opinion). The benchmark itself is really quite easy to implement. Pattern matching makes allocating and wandering them trivial:

And of course we get no speed from the extra cores on the system yet. We’re only using 1/4 of the machine’s processing resources. The implementation contains no parallelisation strategy for GHC to use.

Binary trees in parallel

Since Haskell (especially pure Haskell like this) is easy to parallelise, and in general GHC Haskell is pretty zippy on multicore :-) let’s see what we can do to make this faster by parallelisation. It turns out, teaching this program to use multicore is ridiculously easy. All we have to change is one line! Where previously we computed the depth of all the trees between minN and maxN sequentially,

Which yields a list of tree results sequentially, we instead step back, and compute the separate trees in parallel using parMap:

let vs = parMap rnf id $ depth minN maxN

From Control.Parallel.Strategies, parMap forks sparks for each (expensive) computation in the list, evaluating them in parallel to normal form. This technique uses sparks – lazy futures – to hint to the runtime that it might be a good idea to evaluate each subcomputation in parallel. When the runtime spots that there are spare threads, it’ll pick up the sparks, and run them. With +RTS -N4, those sparks (in this case, 9 of them) will get scheduled over 4 cores. You can find out more about this style of parallel programming in ch24 of Real World Haskell, in Algorithm + Strategy = Parallelism and now in the new GHC HQ runtime paper.

Running parallel binary trees

Now that we’ve modified the implementation to contain a parallel evaluation strategy,all we have to do is compile it against the threaded GHC runtime, and those sparks will be picked up by the scheduler, and dropped into real threads distributed across the cores. We can try it using 2/4 cores:

So still 40s, at 239% cpu. So we made something hot. And you can see a similar result at N=20 on the current quad core shootout binary-trees entry. Jobs distributed across the cores, but not much better runtime. A little better than the single core entry, but only a little. And in the middle of the pack, and 2x slower than C!

Meanwhile, on the single core, it’s in 3rd place, ahead of C and C++. So what’s going on?

Listening to the garbage collector

We’ve parallelised this logically well, so I’m not prepared to abandon the top-level parMap strategy. Instead, let’s look deeper. One clue about what is going on is the cpu utilisation in the shootout program:

Those aren’t very good numbers – we’re using all the cores, but not very well. So the program’s doing something other than just number crunching. A good suspect is that there’s lots of GC traffic happening (after all, a lot of trees are being allocated!). We can confirm this hunch with +RTS -sstderr which prints lots of interesting statistics about what the program did:

I should point out that some of the improvements mentioned in the “Runtime Support for Multicore Haskell” haven’t hit the HEAD yet, so you’re not seeing some of the benefit from the parallel GC. In particular what you have is similar to PARGC1 (terminology from the paper). Although on the binary trees program it doesn’t make a huge difference, the biggest gains here are to be had by just using a bigger heap, and not copying all those trees around.

After 6.10.2 is out I’ll focus on testing and cleaning up those patches and get them in.