The Many Flavors of Parallelism

In my last blog, I described why parallel programming is hard. In the next few blogs, I’ll start to describe how we’re trying to make it easy (there’s tons of good work at Intel on this).

When I first started writing this entry, I intended to write about applications and the programming models that work for them. I found myself bogged down in explaining differences in parallel programming models. The problem with trying to characterize a particular application as being dominated by one flavor of parallelism or another is that these types of parallelism are themselves not orthogonal. It’s easy to confuse how parallelism is implemented and how it is expressed (i.e. how the programmer writes it down). Sometimes, it is easier to express parallelism in one style, but have it implemented as another. One way of looking at it is that you want applications to drive how programmers express parallelism and the underlying hardware driving how it’s implemented.

A case in point is our research project Ct, which is nominally a data parallel programming model but is implemented as a bunch of fine-grained tasks and synchronizations (I’ll describe Ct more in a later blog). I often think of parallel algorithms as beautiful crystalline structures comprised of little tasks, data flow and synchronizations. Depending on which facet you look through at the algorithm, it can look “data” parallel, “task” parallel, or one of the many other qualified parallelisms.

Early on, when we started the Ct project, though it is nominally a “data parallel” programming model, we took a fundamentally different approach. We decided that we’d use our applications knowledge and experiences drive our design, rather then letting architectural constraints or rigid adherence to one of the parallel models drive this. Using patterns of computation that we found in a range of applications (including game physics, graphics, search, data mining, monte carlo simulations, image and video processing), we drove the design of Ct and have ended up with something quite interesting. As you’ll see, even for simple examples, we consistently found it more confusing than not to try to pigeonhole an application into one category of parallelism or another.

As I mentioned in my last blog, one of the first things we looked at was video processing because of its extreme compute intensiveness and, to be honest, the (erroneous) perception that it was “low-hanging fruit”. For the most part, my assumption was correct, we were able to just break up each video frame into chunks for parallel threads to process (for 1080P HDTV resolution, there are 1920×1080 pixels worth of data to process in parallel!) What would you call this type of parallelism? Well, for one, the amount parallelism is driven by the size of data collection I’m processing (i.e. the video frame resolution) and I can define my processing in terms of what happens at each pixel in the video frame.

But, on the other hand, I ultimately will end up using threads to process this…so is it task parallel? Let’s work through a simple example:

A data parallel programming model is one in which you express parallelism semi-implicitly through operations on data collections (think a vector, stream, set, array, matrix, tree, etc.). That is any operation that you apply to the collection potentially executes in parallel. For example, if a programmer wishes to sum the elements of two vectors together, she simply writes an expression that adds these collections that looks something like A = B + C.

Here’s some data-parallel pseudocode (for non-techies, this is meant to be readable, though not necessarily executable, renditions of the code):

One way this might be compiled for dual core is to split up the vectors into two parts and add the two parts on separate cores:

Vector Bpart1 = {1 2 3 4};

Vector Bpart2 = {5 6 7 8};

Vector Cpart1 = {8 7 6 5};

Vector Cpart2 = {4 3 2 1};

Vector Apart1;

Vector Apart2;

spawn { // create a task on one core

Apart1 = Bpart1 + Cpart1 // element-wise add the two arrays

}

spawn { // create a task on the other core

Apart2 = Bpart2 + Cpart2 // element-wise add the two arrays

}

// The result will be Apart1 = {1+8 2+7 3+6 4+5} = {9 9 9 9}

// and Apart2 = {5+4 6+3 7+2 8+1} = {9 9 9 9}

Hopefully, you’ll get 2x speedup if you implement this on something like OpenMP. Actually, you’ll need longer vectors, but this is the basic idea.

This echoes some slides a colleague and I created to illustrate the value of working at the data parallel level, especially for less experienced parallel programmers. The main point here is that the size of the vector (among other things, like cache sizes, threading overhead, etc.) drives the choice of implementation. Here we want to add together all the elements of a vector. For small vectors, it might be faster to just sum the elements together sequentially:

For larger vectors, it becomes profitable to use vector ISA (like SSE) to accelerate the computation:

For even larger vectors, we can use a combination of threading and vector ISA to accelerate the computation:

The punch line is this: We can use tasks or threads to implement this, but without knowing the size of the vector, we’d have to implement three different ways (maybe more) of doing this. The data parallel approach hides this from the programmer by dynamically making the choice. All the programmer needs to write is something like: result = sum(aVector).

Bear in mind that data parallelism doesn’t work for all algorithms. Typically, you can implement less general models (e.g. data parallel) on more general parallel models (e.g. task parallel) (as I’ve just shown). For example, you can implement a streaming model on a data parallel model, which can be implemented on a nested data parallel model, which can be implemented on a task parallel model, and so on. This is generally how things fit together:

The increasing flexibility in the outer rings of this picture comes with some pitfalls (like data races). An important research agenda of ours is to eliminate data races for future tera-scale architectures…or more generally, create deterministic (roughly: predictable regardless of the core count) programming models that encompass as much of this picture as possible.

Notes on my last blog: I was thrilled by the responses I got. There were very good questions and observations there, more than I had time to respond to. I want to take this opportunity to point people to some of the great work to solve the problems I outlined from our software tools division at Intel. In particular, take a look at Threading Building Blocks, which was just open-sourced!

6 Responses to The Many Flavors of Parallelism

Hi Anwar,
This is a very interesting topic – great post! I’ve always wondered how do various parallel programming models compare in terms of their expressive power? And which models are best suited for particular class of applications? Hopefully you will post more about this topic in future (i love the ring diagram). Btw, do you know of any formal/academic work that attempts to answer these questions?
Also, just to be clear, in the vector sum example, for the medium size case, are we breaking a 32-element vector into 2 16-element vectors and adding them to get a partial sum?

Anwar,
Great post. My company supports multiple OSes/chip architectures (we’re in the simulation business, so lots of parallel work). Will Ct support non-Intel hardware? I understand if Intel won’t do this, but if Ct is open-sourced, others can pick up the ball.

Anwar,
Very informative post. From what I understand about how you are using data parallelism in this article, a question comes to mind: isn’t this very SIMD in nature and so not prone to taking advantage of many threads? In your example, it makes sense that a few threads may join into the computation on the vector, but what about systems with hundreds of cores? The programming model would seem to break down at that point and not offer much more than what is available today.

Thanks for the comments, folks!.
JRH: I don’t know of any such work offhand, but I’m sure there have been attempts. The tricky part is that all these models are really equally expressive, it’s just that there’s a lot of pain (and performance lost) associated with moving an application from the outer rings to the inner rings. Anecdotally, we can probably look at particular cases, like moving an application from a CPU to a GPU.
Jon: We probably won’t port Ct to non-Intel hardware, but per the response above, this might be an interesting exercise. We’ve architected the implementation so that the architecture independent pieces are pretty cleanly separated from the dependent pieces (so somebody else can do this). A lot of effort goes into this latter part. Even trying to guarantee forward-scaling across IA is non-trivial.
Jim: It is SIMD in nature, but, in some cases, there is plenty of data to break up into hundreds of threads. This is more true in traditional high performance computing (HPC) segments, but even in things like games, the collections are trending to be large (10s to 100s of thousands of rigid bodies in the future). The other trick here is that independent data parallel computations can run in parallel. In fact, and I’ll say more about this at some point in the future, we designed Ct to be thread safe (for use with legacy threading APIs) and we’ve added deterministic tasking extensions. This allows the runtime to schedule several simultaneous data parallel computations.

Luc Bouge in “The Data Parallel Programming Model: A Semantic Perspective” (Lecture Notes in Comp Sci vol 1132, pp. 4-26, 1996) argues that “the data parallel model of HPF or C* can be seen as a structured version of the control-parallel model of CSP.” He concludes that “data parallelism and control parallelism are not multually exclusive. … The former is a high-level model and the latter is a low-level model.” In that sense, I guess one can say the data parallel model is a subset of the control parallel model.