n-bodies: implementing a parallel solution in Intel TBB

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

Hrrrumph! I’m not sure I’m cut out for this blogging business, which I hear now is becoming passé anyway. Twitter has become all the rage. I’m too much a Luddite to even conceive of a meaningful technical discussion in blots of 140 characters or less. TwitterTech? Nah!

And I think I have a one-track mind-single threaded, wired so in the wetware! ;-) I get immersed in subjects that become consumptive. So it’s been many months since I’ve had a chance to reflect on my work or find an opportunity to talk about something that’s not customer confidential.

Back in April on the Intel Threading Building Blocks forumdreamthtr started a thread about trouble encountered trying to make a parallel version of the n-body gravitational dynamics problem. In the course of that dialog, Matteo Frigo over at Cilk Artsblogged a post about novel algorithm for managing the interaction process between the n bodies. At the time, I commented that I thought I could implement the same algorithm using the new parallel_invoke that had appeared in a recent open source release of TBB.

Little did I know how new that new feature was, or the hoops I would eventually surmount to deliver on that hunch. Things have finally started to come together, though, and coincident with the release of Intel Parallel Studio and the soon-to-be-unveiled Intel® Threading Building Blocks 2.2 release.

I want to thank Matteo for sharing the algorithm, which has revealed some stunning features in my experiments using TBB and Intel Parallel Studio to implement and tune it. Since dreamthtr started with a 2D cut of the problem and Matteo fully described his algorithm expressed in Cilk (naturally), I figured I’d take it a step deeper and go 3D with a ballistic stage, and turn it into a tutorial sequence where I try to show my design process integrated with the use of Intel Parallel Studio tools.

So hopefully this will be the first of a sequence of small narrations about the process, the problems and how I overcame them. It may be intermittent-I have a couple of weeks of vacation soon-but I think it’s an interesting story and one that I’ll enjoy writing about. I hope you’ll enjoy them as much as I will.

6 comments

Thanks for the details, Matteo. I'm back now from my vacation, and immediately overloaded with work, but hopefully I'll find time in the next day or so to continue my narrative about the problem and how the conventional "solutions" fare, as preface for showing your recursive approach cast in TBB colors. I'm certainly not going to try to exceed you in static analysis of the method in the real world but perhaps an experimentalist's approach can at least reveal the landscape of a work-stealing, private L1, private or shared L2, etc. environment.

Concerning the literature on the algorithm: the problem is isomorphic to the ``tournament'' problem, which
is the problem of playing each of n teams with every other team exactly once. (It is also the problem of all-to-all communication in MPI). This problem is well studied in combinatorics, and Charles Leiserson told me that there is a standard
construction for n=2^k in which you split the n teams into two halves, recursively play games within each half,
and then play games between the two halves in some round-robin order. Thus, my algorithm is a variant of this
known idea. (Other constructions exist. E.g., in round k, you match teams i,j such that i<j and i+j = k (mod n).
This yields a program with a couple of cases, depending upon whether i+j wraps around mod n.)

On the other hand, the algorithmic aspects of my technique are less well-known and probably novel. As you
have observed, the recursive algorithm is faster than the doubly nested loop. One property of the sequential recursive
algorithm is that it is ``cache oblivious'': it uses the cache optimally no matter how big your cache is (assuming a sane cache). The idea is that the recursion splits the problem into smaller subproblems. Once a subproblem of size
n fits into cache (i.e., n = Theta(C) where C is the cache size), you compute n^2 = Theta(C^2) matches in exchange
for C cache misses, i.e., you get C matches per cache miss on average. In contrast, the doubly nested loop
will essentially miss in cache for every iteration of the inner loop if n is much larger than C. So we trade off
recursion overhead but fewer cache misses for loop overhead with more cache misses and better prefetching behavior.

Understanding the cache behavior of the parallel recursive algorithm from my blog is much more complicated.
I gave it my best shot in the my paper ``The cache complexity of multithreaded cache oblivious algorithms'',
which analyzes similar recursions for other problems (stencil computations, LU decomposition, and LCS-style dynamic
programming). My analysis assumed completely private caches; an alternative analysis in a world of completely
shared caches was performed by Guy Blelloch of CMU and Phil Gibbons of Intel Research. As far as I know,
nobody knows how even this simple algorithm behaves in a real world that features work-stealing, private L1, private or shared L2, and shared L3 caches, and possibly hyperthreading (which is equivalent to sharing the L1 cache).

Well, it's a great example. I haven't built a viewer. I've thought of adding one at some point but my focus so far has been on stumbling into the obvious errors and seeing whether or not our new tools can catch them. So instead of being able to say, "that's not normal ballistic behavior" I have to notice, "oh look: here's a race condition." The coolest bit, and I'll reveal the secret for you, is that I got better performance than when I ran the racy, unsafe straight parallel-for. I haven't dug into the data yet but my hunch is that there is enough cache collision pressure with eight HW threads blindly banging that the delays imposed exceed the overhead of splitting and joining in the recursive algorithm.

Matteo, is there any literature on this algorithm, or is it really your invention? (With impending vacation, I have haven't tried to do a literature search.)

About a month ago we (Cilk Arts) ran a two-day class on parallel programming and we used this n-body example
as an exercise. The example was simulating a 2D universe with quasi-real
gravitational forces, integrating the motion of masses over time. We provided infrastucture to output a series of png files and a javascript
program to show the frames in rapid succession within a web browser, so that people could see an animation of their
particles wandering around. People liked it---it's a fun problem.

I hope my ramblings will provide you some value, though I will be pacing them as a tutorial since there seem to a lot of people just starting to learn about TBB and parallelism in general. This hopefully will be helpful for them but may be frustrating for an advanced user such as you. Perhaps I should sneak a peak at the end before showing all the effort it took to get there? Of course, if I do that, I may get distracted from my tutorial goal in responding to any commentary that ensues. All such concerns will have to wait a week, though (I'm on vacation next week).

In the meantime there may be a way you can get a copy of the code: next Friday in the San Francisco bay area, a couple of colleagues of mine will be running a pilot session for a new class on writing parallel code and n-bodies will be part of the distributed materials for the class. You can find out more details here: http://software.intel.com/en-us/blogs/2009/06/04/learn-parallelism-and-threading-opportunity-to-attend-pilot-class-for-free/