ARSC HPC Users' Newsletter 240, March 5, 2002

Contents

MPI Send/Recv Performance

The two articles on Cray's updated message passing toolkit (MPT), mpt 1.4.0.4 which appeared in our
last issue
didn't consider performance.

Given the multitude of MPI-enabled systems available these days, it seemed like a good time to resurrect a program we've used for doing simple timings of MPI sends/recvs. It was contributed by Alan Wallcraft, and made it's first appearance in
issue #66
of the T3D Newsletter.

The program sends messages of different sizes around a ring of processors and returns the average time for a single send/recv. The program was modified to use larger messages, 8-byte words, and to report bandwidth as well as time. The code is included, below.

Runs were done on the T3E, IBM-SP, SV1ex, and a small linux cluster here at ARSC. Runs were made on 4- and 16-processors on each system, and other system-specific parameters were varied as well to make this more interesting.

The tables of results include:

"mb/s"

bandwidth in mbytes/sec for all message sizes

"(usec)"

to show latency, the absolute time in microseconds for small messages.

"Size 8-byte words"

number of REAL*8 words per message.

Cray T3E-900

Observations:

MPT 1.4.0.4 gives a major improvement in the performance of MPI sends/recvs. Also note that on the T3E, there's no added cost in using 16 rather than 4 processors. Transfer rates across the torus tend to be very uniform.

You should always specify the US network and
MP_SHARED_MEMORY=yes
. The performance penalty for doing otherwise is clear from the second and third columns.

The non-uniformity in communication rates expected in the distributed shared-memory architecture shows up in the comparison between the 4 and 16 CPU runs and the 4x4 and 16x1 runs. The fastest transfers, for this particular program, are intranode. The very best bandwidth observed was in the 1x4 case, with MPI traffic using the node's shared memory (see the first column). From the 4x4 and 16x1 cases (last two columns), it's clear that there's a reward for keeping tasks in co-habitation on the nodes.

Other programs, data types, and communication patterns will show different results, as will runs on IBM's most recent switch and network technologies (Colony or Federation switches, for instance).

It might surprise a traditional PVP user, but we have an increasing number of users with some MPI component to SV1ex jobs. For example, NCAR's climate system model (CSM) has several individual components (land, atmosphere, ice, etc..) which are individually multi-tasked and vectorized, but which are coupled using MPI.

All of the MPI messages are "passed" using shared memory.

ARSC's SV1ex can be considered a single 32-processor shared memory node, and if we had more nodes, MPI could be used between them. Clusters of SMPs are going to be around for a while, so this is likely to remain a portable approach.

Note that many aspects of MPI performance on a given architecture are not measured by this particular code. All it does is one send/recv at a time, and there is no contention by multiple pairs communicating simultaneously. Collective operations and different algorithms using point-to-point communication can create competition for resources such as switches, routers, and buffers, and result in different results.

More Memory for SP Jobs: -bmaxdata

On icehawk, if your program needs more than 256 MB memory, you need to tell the loader. The compiler option:

-bmaxdata:<size_in_bytes>

will do it. For instance:

mpxlf90 -bmaxdata:375000000 -o prog prog.f

or

xlc -bmaxdata:375000000 -o prog prog.c

will request 375 MB.

Even though nodes on icehawk have 2 GB, we advise caution in using more than about 1.5 GB. The OS and MPI buffers need some too. If you're running four MPI tasks on a single node, they allocate their memory individually, and you might stay below about 1.5 / 4 GB, or 375 MB, per task. OpenMP threads, on the other hand, share memory. Thus, if you're using multilevel parallel programming, using MPI between nodes and OpenMP within nodes, you can specify the full 1.5 GB per task.

Here's more on "maxdata," from
man ld
:

Options (-bOptions)
The following values are possible for the Options variable of the -b
flag. You can list more than one option after the -b flag, separating
them with a single blank.
[...]
D: Number or maxdata:Number Sets the maximum size (in bytes)
allowed for the user data area (or user heap) when the executable
program is run. This value is saved in the auxiliary header and
used by the system loader to set the soft data ulimit. The default
value is 0.

requests two hours. Over estimate. If you request less time than your job needs the system will cut it off. You may be able to refine the request by timing a series of runs.

When you neglect the
wall_clock_limit
specification, the system must assume your job needs the maximum time available to the class. Given loadleveler's backfill algorithm, shorter requests are more likely to start sooner, so this probably isn't what you want.

Quick-Tip Q & A

Q: I resolve to stop using sed, awk, cut, split, complicated egrep
commands, etc..., in favor of perl.
My goal is to simplify, and learn just one way to do everything.
Can you help me get started? I'd appreciate a couple examples
--with explanations--of using perl on the command line or in short
scripts, to accomplish common unix tasks.

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu