Saturday, 16 January 2010

Naïve Parallelization: C++ vs Haskell

A member of the Haskell community recently published a blog article revisiting our ray tracer language comparison, claiming to address the question of how naïve parallelizations in these two languages compare. The objective was to make only minimal changes to the programs in order to parallelize them and then compare performance.

Our attempts to verify those results turned up a lot of interesting information. Firstly, the Haskell program that was supposedly naïvely parallelized was not the original but, in fact, a complete rewrite. This raises the question of whether or not the rewrite was specifically designed to be amenable to parallelization and, therefore, is not representative of naïve parallelization at all. The C++ used was the original with minimal changes to parallelize a single loop. Secondly, although the serial benchmark results covered a spectrum of inputs, the parallel results covered only a single case and retrospectively identified the optimal results without alluding to the fact that there had been no way to predict them. This raises the concern that the results may have been cherry picked. Finally, those results were gathered on a machine with only 4 real cores.

We have repeated the experiment by applying the minimal change to two of Lennart Augustsson's original Haskell programs, the first and fourth, and compiled them with Lennart's original compile line with -threaded added. Our C++ was parallelized by building the image in-place using a parallel loop and then outputting the image using serial loops, and was compiled with the command line given in our original article with an additional -fopenmp flag. These programs genuinely represent naïvely parallelized code. We shall also present all results for 9, 12 and 13 levels of spheres at 1,024×1,024 and our measurements were made on a machine with 8 real cores: two 2.1GHz quadcore AMD Opteron 2352 CPUs running 32-bit programs.

The following graph illustrates the results for 9 levels of spheres:

In this case, the naïvely parallelized C++ program is 2.5-3.5× faster than the naïvely parallelized Haskell 4 program which is, in turn, 1.8-2.7× faster than the Haskell 1 program on 1-7 cores. These results are consistent with the understanding that the Haskell 4 program is more heavily optimized than the Haskell 1 program. The C++ and Haskell 4 programs eagerly build the scene tree in memory whereas the Haskell 1 program constructs the scene lazily as it is required by the renderer. In this case, there are only 87,381 spheres in the scene so eager construction is fast and effective, taking a small fraction of the total running time and paying off thanks to faster intersection routines. Consequently, the C++ sees a healthy 5.4× speedup moving from 1 to 8 cores.

The results for the Haskell 4 program exhibit two interesting characteristics. Firstly, Haskell is very slow if all 8 cores are used. This is a bug known as the "last core slowdown" so Haskell programmers will expect to get optimal performance from n-1 cores. However, in this case the Haskell 4 program runs fastest with only 6 cores and adding another core actually degrades performance. This is an unexpected result and there was no way to predict that Haskell would see performance degraded when using more than 6 of the 8 cores.

The following graph illustrates the results for 12 levels of spheres:

In this case, the C++ program is 1.7-3.3× faster than either of the Haskell programs but the Haskell 1 program is actually faster than the more heavily optimized Haskell 4 program on 7 cores. The scene now contains 5,592,405 spheres and the time taken to eagerly construct the scene is substantially longer. Consequently, the C++ sees only a 2.3× speedup moving from 1 to 8 cores because its scene construction has not been parallelized.

The performance of the Haskell 4 program not only degrades beyond 5 cores but also exhibits a different profile to that of the Haskell 1 program, which sees improved performance right up to 7 cores. Again, this is an unexpected result and there was no way to predict which Haskell program would be faster or what the optimal number of cores would be. The fastest result for Haskell overall happened to be the Haskell 4 program on 5 cores on this machine.

The following graph illustrates the results for 13 levels of spheres:

In this case, we see the incredible result that the fastest program overall is the least optimized Haskell 1 running on 7 cores! This is another unexpected result but it is easy to explain: Haskell 1 is the only program where the scene is constructed in parallel. In contrast, The C++ and Haskell 4 programs waste most of their time in serial code that constructs a scene tree that consumes gigabytes of memory. Consequently, the C++ sees only a 40% performance improvement moving from 1 to 8 cores.

Although this last best-case result where naïvely-parallelized Haskell is 20% faster than naïvely-parallelized C++ is very impressive, there is no way to predict if or when such a result might occur. Consequently, the worst case results are equally important in practice and, in this case, we find that the same winning Haskell program is also up to 8.2× slower than C++ on 1-7 cores in another case: 9 levels of spheres on 2 cores. However, the Haskell program is much simpler than the C++ and its parallelization is statically proven to be free of concurrency bugs such as race conditions. These benefits are well worth this level of performance degradation in many applications so parallel Haskell would seem to have immediate practical applications.

8 comments:

This makes me realize how easy it is to be biased toward one thing or another, especially since I realize I give credibility to your post based only on the outcome: That Haskell won out and that I like that.

The revisions to the program were intended to improve sequential performance (as clearly explained in the post), so I don't think it's entirely fair to claim that it's not a naive parallelisation. He optimised sequential performance first, and then parallelised the result - as per normal best practice, as long as the sequential optimisations don't make it more difficult to parallelise later.

Will you be including the revised version in your language comparison?

You didn't mention which GHC version you used here - 6.12.1 I presume?

That is not true. This new "serial" Haskell code is slower than Lennart's original on 1 core but it scales better. Cherry picking results like that is certainly not "best practice", it is bad science.

I'm more than happy to compare this new Haskell program (that was obviously built for parallel and not sequential performance) to a similarly rewritten parallel C++ program when I find the time to do the rewrite.

On Reddit, Saynte simply says that he improved performance by 2-3% over Lennart's fourth Haskell implementation. He says nothing about Lennart's fifth Haskell implementation that was 24% faster than the fourth in my original benchmark. I have since shown that the fifth version does not scale when naively parallelized.

The fact that he dropped the fifth version for one that scales better and then backported only optimizations that preserve that scalability is a remarkable coincidence. Indeed, it is almost as if he had not naively parallelized the Haskell code at all but, in fact, went to great lengths to make it amenable to parallelization on the basis of extensive performance measurements before pretending that he had naively made a one-line change.