ARSC T3E Users' Newsletter 165, April 05, 1999

VAMPIR Images of Parallel IO

VAMPIR is a performance analysis tool for parallel MPI programs. It does more than display just the message passing taking
place. User-selected parts of code activity can also be inspected.

In this article, we use VAMPIR and last week's example code for MPI_GATHERV to illustrate the potential benefit of doing IO in
parallel.

Given last week's test code, the IO activity can be plotted by
adding calls to three routines from the VAMPIR API. The first call defines a new symbol for IO activity, the second and third mark
the start and end of IO processing. Here's the modified IO section of code:

Comparing the single-processor IO versus parallel IO versions is now easy. For the following examples a problem size of
5,000,000 integers was used.

In the single-processor IO version, all the data is gathered onto one root processor and written out there. The IO requires
over three seconds. Here's the VAMPIR "global timeline" plot of this method:

In the parallel IO version, the data is gathered and written out on four different roots. The IO now takes just over a second,
as this plot shows:

Two caveats are in order. First, since the files are on the same filesystem, there is a point of diminishing returns as
additional processors are enlisted to perform IO. This effect was determined to occur at about the eighth processor, as shown in
newsletter #157:

Second, this approach of gathering single variables together for IO activities limits the parallelism to the number of
variables.

Despite these caveats, there is clearly a reward to implementing this simple form of parallel IO, greatest when codes need to
write many sets of data during a run. Such simple multi-processor IO is used to great effect in some codes currently running at
ARSC.

(For details, examples, and help on VAMPIR, see ARSC's tutorial at:

http://www.arsc.edu/support/howtos/usingvampir.html)

cache_bypass Accelerates Co-Array Fortran Programs

The CACHE_BYPASS directive specifies that local memory references in a loop should be passed through E registers. This can
speed co-array data transfers.

An example. Given these declarations:

real :: c(1000000)[*], a(1000000)[*]

Here are two ways to move n elements from array "c" on image 1 to array "a" on image 2. They use explicit DO loops:

CAF "get" loop version:

if (THIS_IMAGE () .eq. 2) then do i=1,n a(i) = c(i)[1] enddo endif

CAF "put" loop version:

if (THIS_IMAGE () .eq. 1) then do i=1,n a(i)[2] = c(i) enddo endif

We found through testing that the above CAF methods using explicit DO loops benefit tremendously by addition of the
"cache_bypass" directive:

Here are two more ways to move n elements from array "c" on image 1 to array "a" on image 2. They use Fortran 90 array
syntax:

CAF "get" array syntax:

if (THIS_IMAGE () .eq. 2) then a(1:n) = c(1:n)[1] endif

CAF "put" array syntax:

if (THIS_IMAGE () .eq. 1) then a(1:n)[2] = c(1:n) endif

Neither array syntax version benefitted from cache_bypass. Also, both array syntax versions were slower than their explicit
loop counterparts when cache_bypass was used with the latter. This is similar to the finding given in newsletter #127:

that array syntax is slower for processor to local memory transfers.

Finally, here are the SHMEM equivalents of the same transfer:

SHMEM_GET:

if (SHMEM_MY_PE () .eq. 1) then call shmem_get (a, c, n, 0) endif

SHMEM_PUT:

if (SHMEM_MY_PE () .eq. 0) then call shmem_get (a, c, n, 1) endif

SHMEM bypasses the cache by definition, and, as expected, achieves the fastest transfer rates. However, CAF get, with an
explicit DO loop, and with cache_bypass, is nearly as fast as SHMEM.

For reference, there's documentation on cache_bypass in the Cray on-line docs. (ARSC users, read "news documents" on any ARSC
system to see how to log onto the doc server.") It's in: "CF90(TM) Commands and Directives Reference Manual," section 3.4.3.

Book Reviews: MPI and PVM News

The cover of first edition of 'MPI, The Complete Reference', was a somewhat calming blue color. However, the new two volume
set, a second edition of the first book and a volume covering many of the MPI2 extensions, are alarming orange and yellow colors
respectively. (Another good text on MPI, 'Using MPI,' was also blue in color. This often caused confusion when helping users over
the telephone: "Have you got an MPI book?" "Yes, which one?" "The blue one.") So, what else has changed apart from color!

Volume 1, The Complete Reference.

The MPI2 meetings revised MPI and added some new features which are in the core of the language. All calls and examples have
been updated to comply with MPI2 and a C++ binding is now added along with C and Fortran. Layout is also improved with a cleaner
distinction between the MPI function argument lists and the text. The conclusions section still covers many important parallel
processing issues and should be compulsory reading for all programmers who intend serious work with MPI or any message passing
system. As in all books which are frequently referenced an updated edition is always welcome. (The first edition of the MPI book
is frequently referenced during daily work here. Editor's note, perhaps the bright colors will help finding the books when they
have been borrowed.)

Volume 2, MPI Extensions.

The second volume covers the majority of new features discussed and added in the MPI2 meetings. These cover the following
topics.

MPI and threads and mixed language programming are now covered.

One sided communications have been added, these will be of particular interest to anybody who has worked with shmem on Cray
systems.

Process creation and management has been added, building on one of the most popular features of PVM not carried across to
MPI in the first specification.

There is now a set of parallel IO routines which will allow a portable mechanism to express the parallelism of data access
and hopefully reduce the effort needed by programmers. An aim was to reduce the effort needed to write general code or tune code
to each different configuration of filesystem found.

An interface to F90 has been added to try and smooth the use of new language features from F90 and MPI and should be read
before starting an F90 and MPI programming project.

Overall, both volumes provide an essential reference and day to day survival guide to any programmer developing MPI
programs.

From the basic programmer trying to determine if there is a function which performs a needed collective operation to advanced
programmers trying to get different programs to work together on heterogeneous networks, these books cover the important issues.
One of the most useful features is the advice to users and advice to implementors notes for each MPI function. The former helps
the user know what was behind the minds of the committee when the function was considered necessary, the implementors advice is
also useful for the programmer since it gives some idea of how the MPI function should work.

Here there are tutorials, news of latest releases, conference announcements, a summary of project activity and many useful code
examples and libraries to use in your own programs.

Of particular note are the Harness and CUMULVS projects. Harness is a next generation PVM aimed at large heterogeneous networks
of systems and is looking a better control and fault tolerance of large scientific applications. CUMULVS eases adding
visualization and steering to PVM and MPI programs.

EPCC Survey on HPC

We received an announcement of the following:

>
> EPCC is co-ordinating a project for the European Commission to
> determine which of the facilities and services provided by HPC centres
> are most relevant to their users.
>
> This short questionnaire will take you around 10 minutes to complete.
>
> By participating, you will ensure that your views and opinions are
> taken into account in determining the future provision of services at
> HPC centres like EPCC. The closing date for responses is 30 April
> 1999.
>

Math Trio Named Nation's Best in Modeling Contest

[ This just arrived... Congrats to the UAF Math Department and the
"trio"! ]

April 2, 1999

Fairbanks, Alaska - With long pony-tailed hair, pierced body parts and Teva sandals, University of Alaska Fairbanks seniors
Gregg Christopher, Orion Lawlor and Jason Tedor could pass for MTV musicians, not award-winning mathematicians. But don't let their
appearance fool you. When it comes to math muscle, these guys are simply the best in the nation. This Alaskan trio of brainiacs just
won top honors in the 1999 Mathematical Contest in Modeling, one of the most grueling competitions in the country, and earned
bragging rights over teams from powerhouse schools like Harvard and Yale. Winning the competition is nothing new to Nanook
mathematicians. UAF has ranked in the top two percent of all schools participating a record number of six times. "No other school in
the universe can match this record," said Clif Lando, UAF mathematical sciences department head. The MCM, held each February for
college undergraduates, is designed to improve problem-solving and writing skills in a team setting. Students have 89 hours to come
up with the solution to real-world problems involving natural sciences and mathematics. More than 400 universities from around the
world compete. This year's triumphant triumvirate have all competed on modeling teams in the past few years, so they knew what
challenges they faced. Their coach, assistant math professor Chris Hartman, could empathize with the team's anticipation of the
event- he was on the winning 1990 modeling team. Hartman, who got his bachelor's degree from UAF in 1991, went on to get a Ph.D.
with a focus in graph theory from the University of Illinois. He now holds a joint appointment at UAF's Arctic Region Supercomputing
Center and the Department of Mathematical Sciences. At precisely midnight Thursday, the UAF team tore open an envelope containing
this year's problems. Teams across the nation were synchronized to open their envelopes at the same time, with the same 89-hour
deadline. UAF decided to tackle the problem of how to demonstrate people evacuating from rooms during an emergency, based on how
many occupants were in the room. The trio modeled two scenarios for the problem. One was a mathematical tree using fractions to show
how quickly people move through parts of a room. The second was a simulated room designed on a computer with people represented as
red discs. The discs were programmed with many human foibles- they shoved and bounced off one another, navigated around furniture,
and were indecisive about where to exit. In 89 hours, the team researched everything from fire safety codes to psychological
profiles of World Trade Center bombing escapees. They measured the dimensions of several campus facilities- Schaible Auditorium, the
olympic-sized pool and gym at the Patty Center, the Wood Center Ballroom- to use as parameters for their models. Then, they created
a computer program to crunch numbers and visually display the models. As the competition ran down to its last critical hours, the
team wrote their paper explaining their techniques- a whopping 90-page mathematical modeling manifesto. "After the competition last
year, I was so tired I couldn't even lift a nacho chip to eat at the Pub," said team member Gregg Christopher, from Anchorage. "At
least this year I was able to eat my nachos afterwards." Teammate Orion Lawlor, from Glenallen, didn't worry about food. He slept
for 21 consecutive hours after the competition. And Jason Tedor, the third member, slept on and off for two days after their winning
paper was postmarked and in the mail. "We've all been through this competition hell multiple times and ask ourselves, why do it
again?" said Tedor, who hails from North Pole. "The answer is simple. We did it for the challenge."

Quick-Tip Q & A

A:{{ Sourdough Sam sells moose for $10 each, reindeer for $3 each, and
ducks for $0.50 each. He floats his raft down the Yukon River to
the Chilkoot Pass Trading Post one spring morning and sells
exactly 100 animals for exactly $100, selling at least one of each
species. How many of each did he sell? }}
The contest is now officially closed and we'll be sending awesome
prizes to seven winners next week. Here's my favorite reply:
> He sold 0 Moose, 20 Reindeer (Totaling $60), and 80 ducks (Totaling
> $40).
I responded that, nope, there had to be at least one of each critter,
and then received a second reply:
> Ok,
>
> How about 5 moose ($50), 1 reindeer ($3), and 94 ducks ($47).
>
>
> program farm
> do i=0,10
> r1=100 - (10 * i)
> r2=100 - i
> ducks = (3*r2 - r1) / 2.5
> deer = r2 - ducks
> print *, i, deer, ducks
> enddo
> end
>
Guy lamented that this wasn't a parallel program.
Q: What simple change can I make to improve my code's performance on
the T3E?

[ Answers, questions, and tips graciously accepted. ]

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu