Boids Simulation: Part 5

After part 1, part 2, part 3 and part 4 of this guide, we now have fully functional boids. In this part of the guide I take the first look at performance.

Our boids are functionally complete. And they’re concurrent — we have a thread per boid, one for the cell and one for the display. But will this translate into a parallel speedup? To find out, I needed more than the netbook on which I’m writing this post. Thanks to Fred Barnes, I have a login on an eight-core x86 machine (“Intel(R) Xeon(R) E5310 @ 1.60GHz”) called octopus:

So I darcs-pushed my code from part 4 across to octopus. I stripped out the display aspect and changed the program to run 2000 frames before exiting, forcing the evaluation of the boids’ position each frame. That way I should be able to time how long the program takes to run on different numbers of cores. The appropriate GHC option is “+RTS -Nx”, where x is 1 through 8 for the number of cores we want to use. I then graphed the results — the red line represents perfect speed-up (based on the one-core version divided by the number of cores), the blue line is the actual time. Here’s my first result:

That huge spike on the right is obviously some sort of oddity in the run-time. I filed ticket #3518, but for now setting the heap with “-H400M” clears it up:

Anyway, the overall result of the graphs is bad. Our program is taking longer (in wall-clock time) the more cores we add. Rather than parallel speed-up, we have parallel slow-down. So what can we do about it?

Given that we are in Haskell, it is instructive to think about where and when the values in our program are actually being evaluated. Our channels are not strict, so values sent down channels can be unevaluated thunks.

Our intention in having a process per boid was to enable parallel speed-up. So a good step is to make sure that the boid evaluates its new velocity (rather than leaving it to the cell process). This is quite a simple matter — CHP provides a writeChannelStrict function that is just like writeChannel, but that uses the Control.Parallel.Strategies stuff to force evaluation. So we add some instances to enable that:

instance NFData BoidVel where
rnf (BoidVel a b) = rnf a >| rnf b

Then in the boid code, we change writeChannel out cur to writeChannelStrict out cur. And that’s all that was needed to add the strictness. Now we can time it again:

That is a little better (if you compare the graphs closely), but we still have parallel slow-down rather than speed-up. But our boids should be able to get parallel speed-up by evaluating their new position in parallel. Let’s consider what is happening with the cell and the boids as a possible cause. Here’s the cell code again:

The cell process reads in the boid velocities sequentially, then sends out the new positions (trivially calculated from the velocities and old positions) to the drawing process, then sends out the neighbour information sequentially (what may happen is that each boid is sent a thunk that will calculate neighbour information — so each boid will calculate neighbour information in parallel rather than the cell doing it and forming a bottleneck — which would be neat!) then recurses. Hmmm — there are a few too many mentions of “sequentially” in that sentence! We missed an opportunity for concurrency, so let’s rectify that:

We just change our uses of mapM and zipWithM into map and zipWith, then pass the results of these (lists of monadic actions) to runParallel. Let’s see if that made a difference:

Those two changes (the strict-send and the parallel communications) have finally delivered some parallel speed-up. It’s not as much as we might wish for, as it seems to tail off around 2.5x speed-up. I hope to investigate this further at some point, but I suspect that the ratio of communication to computation may be part of the problem.

Optimising for parallel performance is hard in any setting, and being in Haskell (which makes it hard enough to optimise for sequential execution) certainly makes life interesting. Perhaps I can wildly generalise this post to throw together some guidelines on optimising:

Try to work out where and when your values are actually being evaluated. In general, if the values are used to take a different monadic action, or if they are sent out of the program, they are forced. Otherwise they probably aren’t, and are being sent around as thunks. Find where you want values to be evaluated to get the most speed-up and try changing writeChannel to writeChannelStrict.

Vary the amount of parallelism. In this example, my sequential communications needed to be made parallel. Sometimes the opposite is true. Optimisation is not straightforward (alas).