Jim Dempsey

Dr. Dobb's Bloggers

How to: Write a parallel_for Loop (Matrix Multiply)

June 21, 2010

A poster on the ISN forums provided a link to an MSDN article titled How to: Write a parallel_for Loop which shows how to use the Visual Studio 2010's Concurrency::parallel_for to compute the product of two matrices. The example illustrates how to use parallel_for in a nested loop.

A poster on the ISN forums provided a link to an MSDN article titled How to: Write a parallel_for Loop which shows how to use the Visual Studio 2010's Concurrency::parallel_for to compute the product of two matrices. The example illustrates how to use parallel_for in a nested loop.

The typical C++ serial method to compute the matrix multiplication of a square matrix might look as follows:

This makes use of the C++0x Lambda functions and the Concurrency template for the parallel_for. This conversion is elegant (simplified) by essentially rewriting two lines of code: The first for statement and closing brace of that for statement.

The reported performance boost on a four-processor system for 750x750 matrix multiplication was:

Serial: 3853 (ticks)
Parallel: 1311 (ticks)

Approximately a 2.94x speed-up.

This is not an ideal scaling that produces a 4x speed-up using four-processor. But considering that there must be some setup overhead, 2.94x is not too bad on this small of matrix.

Although this article illustrates a parallel_for, a new parallel programmer might be misled into assuming that this example illustrates how to write a parallel matrix multiplication function. After all, this article was written in an MSDN article -- an authoritative source.

Let's see how we can do better at parallelization of the Matrix Multiplication.

First off, I do not have Visual Studio 2010 so I cannot use the sample program as-is.
I do have QuickThread (I am the author of this parallel programming toolkit www.quickthreadprogramming.com). So I adapted the code to use QuickThread. Adaptation is relatively easy. Change the headers and minor differences in syntax.

In the QuickThread dialect, the parallel_for passes the half open range. As opposed to the parallel_for of the VS 2010 concurrency collection passing the single index. QuickThread chose to pass the half open range, as opposed to a single index, because knowing the range, the programmer and/or compiler can optimize the code within the loop.

Using the QuickThread modified code on a four-core (2 by AMD Opteron 270 dual core) system the tick counts line up as:

Serial: 11011
Parallel: 3417

For a scaling ratio of 3.22x

The processor models were not listed for the Visual Studio test so it is hard to tell the reason why QuickThread yields 3.22x improvement on four cores while VS yields 2.94x improvement on 4 processors (cores?).

Setting aside the issue of which threading toolkit is better, let's look at the parallelization of the matrix multiplication from the stand point of improving the performance.

The first thing any programmer should know is: Improve the serial algorithm first, and then address the problem through parallelization (save intrinsic SSW functions for last).

If you look at the inner most loop of the serial function we find:

for (int k = 0; k < size; k++)
{
temp += m1[i][k] * m2[k][j];
}

While the k index in m1[i][k] sequentially accesses memory, thek index in m2[k][j] does not. What this means is the compiler is unable to vectorize the loop. And potentially worse, stepping down the rows at certain points will cause cache evictions.

To vectorize the matrix multiplication (and minimize cache evictions) you would have to transpose the matrix m2 (m2t) and then you could transpose the indexes in the inner loop. To transpose the matrix m2 you will have to allocate memory and then write the transposition of the matrix. This may sound absurd. Add overhead to:

Perform an allocation (actually size+1 allocations)
run transposition loops
write the equivalent of the results output an additional time

You might assume that this must be slower than the simple matrix multiplication. Well you would be wrong to assume this. The rewritten serial loop looks like:

In a real implementation you might want to consider preserving the buffers allocate on the first call. Then on subsequent calls you check to see if the prior allocation was sufficient for the current call. If so, the bypass the allocation. If not then delete the buffers, and allocate new buffers. This is a further optimization detail left to readers. Also note that if the m2 matrix is going to be used many times (e.g. in a filter system), then you can perform the transposition once. These are implementation strategies to keep in mind when taking these suggestions into your own code.

Let's see how the serial with transposition technique stacks up against the serial without transposition as run on the 2x dual-core AMD Opteron system:

This system, while having 16x the number of hardware threads, it has only 8x the number of floating point execution units. With this respect scaling appears almost linear.

I have purposely stayed away from using the SSE intrinsic functions. I suggest you not use the SSE intrinsic functions until the last phase of optimization. Doing so early complicates your parallization efforts.

As good as the performance is (approximately 55x on dual Xeon 5570), is there room for additional improvement before adding SSE intrinsic functions?

You will have to wait for my reply, as my next article will address this subject.

I wish to thank one of my users for letting me have access to his Dell R610 for testing purposes. I have yet to configure the test program to run on Intel's Parallel Universe system. I may need to do this for my next article (unless a reader is willing to run a test program for me).

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!