Tuning a C++ MPI Code with VAMPIR: Part II

[ Part II of III. Thanks to Jim Long of ARSC for this series of articles. ]

In part I, we described a port of the UAF Institute of Arctic Biology's Terrestrial Ecosystem Model (TEM) to the Cray T3E and a linux cluster, and examined performance using VAMPIR. In this article, we explore an optimization to the communication algorithm and discuss performance on ARSC's IBM SP3.

As shown in part I, VAMPIR images suggested that TEM might be tuned by:

overlapping computation on the master and slaves, and

having the slaves begin computing as soon as they receive new data.

The relevant abstracted code section from the original implementation is:

The MPI_Barrier call serves to synchronize the two loops so as not to overload MPI buffering. This could be a problem because the code above is inside a loop that can read files for hundreds of years into the future.

The barriers also mimic the situation that would exist in a synchronous coupling with a climate model, i.e., when there is no new climate data for the master to read until the slaves have computed and sent their data to the climate model. In a synchronous coupling, the master must wait until a new climate is computed.

In a sensitivity analysis for an uncoupled TEM, however, the climate might well be prescribed (as it is now), and the master can read the next year's data and have it ready for the slaves when they need it. This addresses issue 1, above.

The fact that no slave can begin computation until all slaves receive their data was recognized in the original implementation, but was left unchanged since it mimics the worst case scenario that would exist in a global run with many slaves trying to read/write their data at the same time. Worst case simulation is not necessary, however, when a sensitivity analysis is desired for only Arctic latitudes. This addresses issue 2.

Thus, it was safe to tune the code by simply removing the barrier calls. This eliminates the "for" loop in the "else" clause. The first in the series of MPI_Sends was replaced with an MPI_Ssend. MPI_Ssend is a synchronous send that guarantees that the send will not return until the destination begins to receive the message. This effectively implements a barrier between the master and one slave only, when that slave begins to receive, instead of having to stop at an explicit barrier when each slave is receiving. A slave may now begin computation as soon as it receives its data. The tuned code looks like:

The general lesson here is to avoid global barriers if at all possible.

Figure 1
(click on icon for larger view)

Figure 1 gives two VAMPIR images, comparing old vs new communication patterns for the T3E during equal timeslices of the TEM transient portion. The T3E showed a roughly 10% reduction in time for the transient portion of the run, which shows up as a reduction in the amount of time spent in (red) MPI calls for the slaves in the VAMPIR output. (In all of these VAMPIR images, green, which shows time spent doing computation, is good, while red, which shows necessary, but unproductive, time in the communication library, is bad.)

Figure 2
(click on icon for larger view)

Figure 2 shows the communication pattern on the ARSC linux cluster, using ethernet, where an impressive 40% reduction in time during transient portion is realized.

Of all platforms tested, MPI latency and bandwidth are worst on the cluster using ethernet, thus it's no surprise that the benefits from tuning the communication algorithm is most dramatic here.

Figure 3
(click on icon for larger view)

Figure 3 shows additional results from the ARSC linux cluster, but this time, using the myrinet network. This comparison shows about a 15% reduction in time during the transient portion.

Figure 4
(click on icon for larger view)

Figure 4 is the promised look at results on ARSC's IBM SP3 (Icehawk) for an equal timeslice of the transient portion.

The original code ran in a blazing 9:55 (9 minutes, 55 seconds) total, while the tuned code ran in 8:31. The two transient portions ran in 3:50 and 2:25 respectively, a roughly 35% improvement in the tuned version for transient performance.

Since the compute time per time step is so low for the SP3, the MPI portion was a large percentage of the action, and hence a reduction in MPI time results in a large percentage improvement. The IBM SP3 is essentially a cluster technology, 4 CPUs per shared memory node, with nodes interconnected by a high speed switch. Each CPU has 8MB L2 cache, and so we have the combined benefits for this code of large cache and high performance CPUs.

In the next (and final) installment in this series, we address the question raised in part I. The problem is naturally parallel, so why doesn't it scale better? Is the tuned code more scalable?

Ahhhh... We knew that! Function and subroutine calls inhibit vectorization. Recompile with inlining to eliminate the function call. As described in
>
Quick-Tip #207
, if the function were defined in a separate source file, we'd use "-Oinlinefrom=<FNM>". In this case, use -Oinline4:

We got a 3-fold improvement, but 75 MFLOPS is still disappointing. Recompile again with "negative messages" to get guidance from the compiler:

f90 -O3,negmsgs,inline4 -rm -o trap.serial trap.serial.f

And the listing file shows:

f90-1204 f90: INLINE File = trap.serial.f, Line = 31
The call to F was inlined.
f90-6209 f90: VECTOR File = trap.serial.f, Line = 32
A loop starting at line 32 was partially vectorized.
f90-6511 f90: TASKING File = trap.serial.f, Line = 32
A loop starting at line 32 was not tasked because a recurrence was
found on "SIDE" between lines 35 and 37.

OF COURSE! There's a dependency in this loop. The value of "side" must be computed before "integral". This is probably inhibiting vectorization as well as parallelization.

It was clever to reuse the value of "side" for two adjacent trapezoids, but let's go back to the simplest coding of trapezoidal integration, and see what happens. Replacing the loop with this:

Next Newsletter

For you Santas out there, your cards from North Pole are on the way.

We're taking Dec 28th off, and will produce the next newsletter on Jan 4. Also, we're updating our technical reading list and plan to print it in the next issue. If you'd like to recommend a book, let us know.

A safe and happy holiday to everyone!

Quick-Tip Q & A

A:[[ As I migrate my code between Crays, IBMs, and SGIs, I assume
[[ I can just stick with the default optimization levels. Is this a
[[ good assumption?
Nope. Okay on Crays and IBMs, but on SGIs, default optimization is NO
optimization. Try -O2 on the SGIs for starters. Also, see the Quiz
answer, above.
If you're going into production, the compiler is your friend. It can
really pay to analyze your code.
Q: What are your "New Years 'Computing' Resolutions" ???
For example, "I resolve to learn python, change all my
passwords, and ???"
(Anonymity will be preserved when we list these in the Jan 4th
issue.)

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu