The Many Flavors of Data Parallelism

Data parallel programming models have been “in vogue” lately because of their prevalence in GPGPU programming. As I alluded to in my previous blog, there are other reasons we should be looking at data parallelism….but not all of these models are created equal. So, let’s take a tour of data parallel programming and touch on what “nested” and “flat” data parallel models are and what kernel and vector styles are. (If I’m not careful here, you’ll draw the conclusion that Excel comprises a data parallel programming model:)) I’ll use some Ct “syntax” in places, but I’ll try to explain what it means as I go. Warning: There is (mostly correct) code below, but I’ll try to explain what’s going on if you aren’t a programmer.

From the data parallel point of view, there are a number of ways I can implement an algorithm. For an image or video processing algorithm, I can specify what happens at each pixel…or I can treat the image as an object upon which I compose multiple operators. Here’s an example of what I mean using a simple color-space conversion:

for (row = 0; row < 1080; row++) {

for (col = 0; col < 1920; col ++) {

R[i][j] = (c1Y[i][j] +c2U[i][j] + c3V[i][j])

G[i][j] = (c4Y[i][j] +c5U[i][j] + c6V[i][j])

B[i][j] = (c7Y[i][j] +c8U[i][j] + c9V[i][j])

}

}

(For non-techies, what’s going on here is that we’re computing some new colors in the RGB color space, like your display uses, from the YUV color space, like a lot of video streams use. This code uses two loops to run over the rows and columns of the image.)

One data parallel style (which I’ll call the “kernel” style) would be to simply define a function that performs each line of this:

int8 colorkernel(int8 Y, int8 U, int8 V, int c1, int c2, int c3) {

Return (c1Y + c2U + c3V);

}

R = ApplyKernel(colorkernel, Y, U, V, c1, c2, c3);

G = ApplyKernel(colorkernel, Y, U, V, c4, c5, c6);

B = ApplyKernel(colorkernel, Y, U, V, c7, c8, c9);

This is the equivalent of saying “this is the program I want to run in parallel at each element.” (It’s pretty similar to shader programs for GPUs.)

This is the equivalent of saying that “this is the program defined as a series of parallel operations over the entire image.”

Which is better, kernel or vector flavors of data parallelism? Well, it depends. You shouldn’t draw a conclusion from this simple example, because I can show you examples of algorithms that look cleaner and are easier to write in one but not the other for both styles!

Recall from my last blog that one part of video compression that was tricky was motion estimation. Because the algorithms used often depend on using local or previously computed results, you end up with what we call a “wavefront” pattern, illustrated below. The arrows denote what we call “dependences”, meaning that you cannot execute the head and tail in parallel.

Without going into details now, it turns out that the kernel style (well, in particular, the Ct kernel style) is much better than the vector style (which Ct also supports). But, some might look at this and not see data parallelism, but rather task parallelism, or more precisely, data-flow. Generically, data-flow is a model where you can define a set of tasks that successively operated on data in some well-defined order. Typically, you’d end up with a graph (like the wavefront picture above) that illustrated how the data flowed through the tasks. But, it is yet another model of parallelism. (We use data-flow aggressively in the implementation of Ct.)

The next application we looked at was (sparse) linear algebra, because of its use in game and high-end physics applications. (I’ve since been surprised to find sparse linear algebra pop up in unexpected places…) Sparse matrices are the basic object you manipulate in these algorithms. It is so called because, unlike a dense matrix, you only store the values you care about. Typically, this means that a lot of the matrix entries are zeros and we only want to store and process the non-zeros. The picture below gives a small example, where the colored values are the only ones we want to store.

Using this type of representation is important because of the enormous memory footprint and calculation savings. For example, in a simulated scene where you have 1000 rigid object, you can represent constraints or relations between the objects as a 1,000,000 (1000×1000) element adjacency matrix. However, most of the object may have no relationship to each other and may not even touch, meaning that the number of populated entries in this matrix is quite small (perhaps 3 per object, meaning a sparse representation would require 3000 entries versus 1,000,000 entries!) The problem with sparse matrices is that there is no single standard representation and each form comes with its own parallelization challenges.

Compressing a dense matrix into sparse form makes the computation “irregular”. What do I mean by this? We usually refer to simpler processing patterns as regular, such as simply walking over the element of an array or image in sequence. So, by irregular in this instance, I mean that the elements we touch might jump around unpredictably. Also, we might find these “touches” also modify data, meaning that we have to coordinate these accesses when implementing them in parallel. Without going into detail about what this code means, there’s how the Ct version of a sparse matrix vector product might look (don’t worry about the details for now).

expv = v.distribute(cols);

product = Aexpv;

product = product.applyNesting(rind, ctSparse);

return product.reduceSum();

Only a few lines of code, but it’s somewhat opaque unless you’re familiar with data parallel programming. However, if you looked at what a task dependence graph for this computation using the matrix above, it’s pretty complicated. (In layman’s terms: This is how the individual pieces of work are organized and communicate with each other. For example, if Task A points to Task B, then Task A has to execute to completion before Task B executes):

This gets much worse when we look at non-toy examples. Now consider that this tangled spaghetti of (fine-grained) tasks and synchronizations can be derived by a runtime (Ct, in this case) from that few lines of code above!

Flat (or Vanilla;)) makes it quite difficult to express this type of parallelism for these sparse matrices, which essentially look like vectors of vectors of varying length. Imagine how much more difficult it might be for other types of “collections”, including trees, sets, etc. This is the trick of “nested data parallelism”, something that emerged 20 years ago, but, surprisingly, remains relatively undiscovered. Nested data parallelism generalizes the notions of data parallelism to apply to these kinds of irregular data structures.

It turns out that you can do sparse matrix computations a couple of ways, but there is an elegant symmetry to how we can use data parallel constructs to cover a range of different matrix representations. Basically, regardless of the underlying matrix type (dense, various flavors of sparse) the Ct code structure looks about the same. If you’re interested, we talk about this in more detail in our Ct whitepaper here.

We’ve since looked at sorting algorithms, computational finance, collision detection (also for physics), and many more. While data parallel models are certainly not the only models required for these cases, we’ve been surprised at how far we can take them.