Six-Step Path to Parallel Programming Share your comment!Share on linkedinShare on redditShare on emailMore Sharing ServicesIt’s no secret that learning parallel programming is no small task, but the payoff can be enormous. To help ease the move to parallel is a six-step plan for optimizing your code, both for small quad-core chips up to the massively-parallel Xeon Phi coprocessor. The plan is outlined in Jim Jeffers and James Reinders book on high-performance programming. Jeff Cogswell explores these steps and offers some additional suggestions.While experimenting is one way to learn parallel programming, it’s best to make sure you first fully understand what is going on under the hood so that you can get the most out what you’re doing. Otherwise, you could end up with your code looking right, but running slower than serial code. Also, it’s always a good idea to consider the advice of the experts before trying to re-invent the wheel.

That’s where Intel’s Jim Jeffers and James Reinders come in. In their excellent book “Intel Xeon Phi Coprocessor High Performance Programming,” the authors provide an insider look at how to get the most out of parallel programming, for C++ as well as Fortran compiling. (And as I’ve mentioned in previous blogs, a lot of the information in the book applies not just to Xeon Phi Coprocessor programming, but general multi-core programming as well.)

In the book, Jeffers and Reinders provide a methodology for programming with vectorization. I’ve been covering vectorization and Cilk Plus programming lately in reference to scientific programming, particularly with linear algebra and matrix operations. Let’s review what these experts have to say about a proper vectorization methodology.

Six-Step Methodology

The methodology they present consists of six steps using Intel Parallel Studio. While I encourage you to get the book and read the actual text (starting on page 110), I want to offer my own additional thoughts and notes. Here are the steps, with my thoughts:

1. Measure baseline release build performance.

The authors point out the importance of using a release version of the build. This is vital, because the debug version introduces extra parameters that could ultimately decrease your performance and give you an inaccurate measurement of the final results, both before tuning and after tuning. But on the other hand, the release build also introduces extra parameters in the way of optimization. The optimization could potentially optimize out code that is a good opportunity for vectorization, as the authors point out. But that doesn’t mean you want to turn off optimization; in fact, you want it on. But you want to first find out what the baseline performance is for the release build before you add vectorization for a fair comparison to the final product.

2. Use Intel VTune Amplifier to locate hotspots

This is rather self-explanatory; use the tools for performance profiling. But note that the authors recommend using the full hotspot analysis, not the “Lightweight Hostspots” analysis.

3. Determine loop candidates

The authors suggest using the vectorization report from the compiler, and determine if there are loops that are hotspots that are not auto-vectorized. The loops didn’t auto-vectorize for whatever reason; and you can take this time to determine if they could be re-coded slightly to allow for vectorization. Remember, not everything is automatic, and there are times you might want to re-think an algorithm to look for opportunities to make it parallel. That brings us to the next step.

4. Use the Guided Auto-Parallelization (GAP)

This part is also self-explanatory; run the tool and see what it says, making note of its suggestions.

5. Follow the suggestions from the GAP

This is the main part that the previous steps were leading up to. The idea is that you may want to consider recoding in the parts that the tool suggests. But remember, the tool is only an automatic tool and doesn’t have the full analysis abilities of the human brain! In other words, the advice the tool gives you could be wrong in that you might recode the algorithm so it’s in parallel and faster, but doesn’t do exactly what it’s supposed to. Or, there might be a way to re-code it, but you simply code it wrong.

The authors bring up an interesting point, which I’ll quote here directly from the book, page 112:

“One way to ensure that the loop has no dependencies that may be affected is to consider if executing the loop in backwards order would change the results.”

The reason for this seemly bizarre but good advice is that parallel loops should be able to run independently of each other. For example, if each iteration of the loop relies on the results of the previous iteration, then the loop as defined can’t be run in parallel, and clearly running it in reverse would not work. But that’s not to say it can’t be done; perhaps you need to rethink the loop. Here’s a simple example: Suppose each iteration relies on a sum calculated in the previous iterations. In such cases you may be able to divide up the algorithm into blocks such that each block can build a partial sum, and then you finally combine the sums together. This is the concept of a reducer, and reducers can be coded to function in parallel. So don’t give up! And remember, there are large libraries at your disposal, including reducers in both Cilk Plus and the Threading Building Blocks thread library.

6. Rinse and Repeat

Then you repeat the steps from the start until you reach a point where you get the performance you’re looking for.

Conclusion

Parallel programming—both multicore and vectorized—is no small task, but the rewards can be huge, especially if you’re taking your code to a many-core chip like the Xeon Phi coprocessor. But even if you’re just sticking to a run-of-the-mill quad-core processor, you can still make big strides in maximizing your performance. Use the tools, and keep studying, keep learning, and soon you’ll be mastering parallel programming. (And seriously—get the Jeffers and Reinders book.)