Use those extra cores and beat C today! (Parallel Haskell redux)

In an oddly-titled post earlier today I’d had too much coffee, and we looked at how compiled Haskell code smokes out interpreted code (for various reasons, mostly to do with being compiled and not interpreted). However, the real point of the article (other than to burn things with my flame thrower) was to start to explore the new parallelism annotations in Haskell.

So let’s continue that, with some more explorations of how far we can get with parallel annotations, and whether Haskell can compete for C, given enough cores. (And I’d just like to note that Spencer Janssen (of the xmonads) contributed most of the code and ideas for this article :).

We’ll stick to the the naive fibonacci implementation, but switch to a 4 core machine, and see how far we can scale up the Haskell code as we add cores, without resorting to manual parallel programming.

However, we can give the compiler some hints about how best to parallelise the code, using the lovely `par` annotation (from Control.Parallel) (originally from this paper). From the manual:

The expression (x `par` y) sparks the evaluation of x (to weak head normal form) and returns y. Sparks are queued for execution in FIFO order, but are not executed immediately. If the runtime detects that there is an idle CPU, then it may convert a spark into a real thread, and run the new thread on the idle CPU. In this way the available parallelism is spread amongst the real CPUs.

So let’s naively annotate this. Just split the tree into two parts, and run the first branch in parallel with the other, hoping that it finishes about the same time as the second, so there’s no waiting:

Hmm, interesting! While the cpus are getting utilised, we’re not making much progress towards our naive single core goal. What is going on?

The problem, of course, is that we’re wasting time registering thread sparks for very small expressions (anything under about N=35 or so). We should really not use `par` for those little jobs, since the cost of registering a thread spark outweighs the cost of just evaluating it here and now.

So what we can do is use the `par` version when N is larger, and drop back to straight line code for smaller jobs. That should do the trick.

Ok. Cool, with two cores, and the `par` overhead, we’re actually beating one core now. How about 3?

./real-par +RTS -N3 75.03s user 0.82s system 262% cpu 28.854 total

Excellent. And how about the lot?

./real-par +RTS -N4 76.81s user 0.75s system 351% cpu 22.059 total

Haskell FTW! So that’s scaling up enough for now, and, considering the effort involved to parallelise it, I’m more than happy with that result.

This is, as far as I’m aware, the lightest weight parallelism mechanism in any mainstream language. And the magical thing is that we parallelised our code without ever worrying about synchronisation, communication, race conditions, dead locks, live locks. semaphores, mutexes…

The other interesting thing to think about: at what point do we beat the same algorithm in C, and how hard would it be to parallelise the algorithm in C with pthreads… I’m not going to attempt the latter, but we can check the former: