Tuesday, 26 January 2010

Naive parallelism with HLVM

The latest OCaml Journal article High-performance parallel programming with HLVM (23rd January 2010) described a simple but effective parallelization of the HLVM implementation of the ray tracer. Comparing with similarly naive parallelizations of the fastest serial implementations in C++ and Haskell, we obtained the following results:

These results exhibit several interesting characteristics:

Even the naive parallelization in C++ is significantly faster than HLVM and Haskell.

C++ and HLVM scale well, with performance improving as more cores are used.

Despite having serial performance competitive with HLVM, the naively-parallelized Haskell scales poorly. In particular, Haskell failed to obtain a competitive speedup with up to 5 cores and its performance even degraded significantly beyond 5 cores, running 4.4× slower than C++ on 7 cores.

The efficiency of the parallelization can be quantified as the speed of the parallel version on multiple cores relative to its speed on a single core:

These results show that HLVM obtained the largest speedup: 6.3× faster on 8 cores. C++ obtained the next largest speedup: 5.2x faster on 8 cores. Haskell obtained the worst speedup: only 2.9× faster on 8 cores.

More sophisticated parallelizations could obtain better performance still. Currently, the parallel for loop over the rows of the image causes each thread to compete for an iteration and results in poor locality: if a thread computes one row then it is unlikely to compute the next. Work-stealing queues and recursive subdivision of the problem space would greatly improve locality and, therefore, performance. Also, the initial construction of the scene tree has some opportunity for parallelization and further performance improvements.

Furthermore, there is plenty of room to optimize HLVM itself and some simple hacks can quantify the possibilities. Disabling bounds checking provides a substantial 20% performance improvement. Disabling the shadow stack (and, therefore, disabling GC) provides another 25% performance improvement. With these two changes, HLVM is only 24% slower than C++. In particular, HLVM's current design makes heavy reuse of its own generic routines even in performance critical sections such as manipulating the shadow stack. Optimizing these routines by hand could leverage some of this potential. For example, by removing bounds checks from manipulations of the shadow stack.

Despite the fact that everything you say is accurate... it systematically looks like you are trying to make Haskell look bad... a newcommer comming to these blog not knowing the context will jump to the conclusion that Haskell sucks.

Perhaps the whole "naive" implementation/parallelization should be given a break and you should tune c++/HLVM scenarios and pit them against Saynte's not-so-naive implementation.

Because, in the end, a company that will go through all the trouble to parallelize an algorithm will do it for performance and will do everything it can to get the best implementation and not the naive one.

I believe it is one of the features Haskell can rightfully claim as having a simple way to lazily build the scene.

Perhaps you should actually compare the actual lines of code in each implementation instead.

@David: These results prove that Haskell gives you a choice between naive parallelism that scales and performs poorly or difficult parallelism that scales well but still performs poorly. So is it not correct for a newcomer to jump to the conclusion that Haskell sucks?

Incidentally, I have repeated the experiment with a naive parallelization of Lennart's fourth version and the results are almost identical to those for the fifth: Haskell gives poor absolute performance and scales badly.

You say "go through all the trouble to parallelize an algorithm" but the C++ required only 8 different lines of code out of 143 to achieve good results. That is a comparatively tiny effort. Even Saynte's "one line change" was actually 3 lines different.

A future article will compare more extensive rewrites in other languages to Saynte's extensive rewrite in Haskell.

Finally, you say that "one of the features Haskell can rightfully claim as having a simple way to lazily build the scene" which is true but is it useful? Outside Haskell, laziness is one of the least used functional features and, in fact, I believe laziness is precisely the reason Haskell does so badly on this benchmark.

Your article from Jan 16th seemed to show a parallelized version of the Haskell 1 algorithm that scaled rather aggressively.

I just find it rather too common that I can read here and there that "Haskell does this, or F# yields that" while one would rather read "The Haskell -implementation- scales poorly."

Then again, one's got to admit that the Haskell's (or should I say GHC?) last core bug is rather aggravating. One's got to question the usefulness of that "feature" (is it?) where it considers all the cores to be wholly dedicated to the process.

Also, I find an issue with GHC's lack of support for a viable hashtable implementation. I find that one can stick them in a sheer lot of naive algorithm implementation for excellent results.

For example, I came up with the following (rather elegant, I think) non-brute force algorithm to solve ProjectEuler's 11th problem, that seems to build and run (in LINQPad) in under 50ms (granted brute-force C implementations solve the problem even faster).

@David: Excellent question. The compelling results for the first Haskell version that we published on the 16th January were actually misleading. Further investigation revealed a subtlety we had missed in our first article. The superior scalability occurs only because the first version uses a different algorithm. Specifically, with a symbolic bounding volume hard-coded. However, we have since ported the same algorithm to OCaml such that it can be leveraged and it produced similarly spectacular results in terms of scalability and it works with arbitrarily complicated scenes. However, if you crank up the resolution so even the smallest spheres are visible (e.g. 9,512 or 10,1024 or 11,2048 or 12,4096) then that first Haskell is not competitively performant when parallelized: it takes 81s on this benchmark whereas C++ takes only 12.6s.

Naively parallelized second to fifth Haskell versions using the parameters from this article all show the same result: Haskell stops scaling at 5 or 6 cores. That is an important result because it is contrary to all of the Haskell literature that portrays purely functional programming as a panacea for parallelism and the paradigm of the future in this multicore era.

The last core slowdown is certainly a silly problem but I think there is more to it that the explanations I have seen. HLVM uses exactly the same kind of naive spinlock and it never suffers a last core slowdown on Linux.

That's a very interesting project Euler problem and a nice solution! I'll have to study it in detail...