Iceberg Status Update

ARSC is installing a large IBM p655+/p690+ cluster, "iceberg," with IBM's new Federation Switch technology for the server interconnect. The switch was successfully installed early this month, and is now under testing by ARSC staff.

A schedule for pioneer user access will be announced later.

A "Universal" High Performance Code?

[ Thanks to Jeff McAllister for this article and code. ]

Much is said about the benefits of one architecture over another. However, from the standpoint of writing code, extensive optimization for a specific machine is often not the best use of effort.

I hoped to find some concepts which could endure beyond a product lifecycle. I set out to write a simple, portable, distributed memory multiprocessor code free of specific architecture optimizations yet able to achieve near-peak performance. For inspiration I started with Guy Robinson's hard-to-beat "gflop" code from
HPC Newsletter 213
.

What I ended up with is an MPI midpoint-rule integral solver. Code is included below. The function this version integrates is y=.5x+1 from 0 to 2000. The result should be very close to 1002000, regardless of the number of processors used. (Integrals make nice performance tests because they can generate a lot of work, each step can be done independently, and results are easily predictable.)

As I was looking for operations/second, I manually counted 12 adds/multiplies per loop, including incrementing the counter. Compiler optimizations like unrolling will affect the actual instruction count so the compute rates it generates should be considered estimates. However, the results are usually similar to actual CPU counter results reported by the various vendor tools, see:

Consistent high performance is elusive. The code is quite portable and should run wherever Fortran 90 and MPI are available. IO and memory access are not bottlenecks in this code. On vector machines this code vectorizes perfectly. On cache machines it has about as much locality as you can get. The number of memory locations necessary is so low the variables should remain in CPU registers without ever needing to access even L1 cache. Even so, peak is still far away for some platforms.

Compiler options may help. On the Power4 machines, for example, this code runs almost twice as fast when compiled with -O4 as with -O3. The -O5 option is not best in this case. And on the T3E, performance improved from 207 to 678 MFLOPS with: "ftn -Oscalar3,aggress,bl,pipeline3,split2,unroll2 integral_mpi.f90". (My default on all the other platforms is -O3.) More time with the compiler options may lead to similar improvement, especially in the cases farthest from peak.

However, I'm not convinced it will be so easy. While originally developed for our Cray X1, this code's performance showed a definite spike when moved to the IBM systems. (For other codes, the reverse could just as easily be true.) Probably the main reason this program does so well on the Power4 chips has to do with a lucky match between the architecture and the algorithm. As this is a midpoint integral solver, the main kernel has a lot of multiplies and adds in succession:

do i=0,nsteps
x1=(i*interval)+a1
x2=((i+1)*interval)+a1
xmid=(x1+x2)*.5
y=(xmid*.5)+1.0
sum1=sum1+(y*interval)
end do

Multiply-add (FMA) just happens to be a single hardware operation on the Power4. When the compiler can represent code with this instruction, two operations occur in one cycle.

Clearly, some algorithms are a better fundamental match to some architectures than others. Guy Robinson's original "gflop" still gets closer to peak on the Cray vector systems, though this code performs better on the IBMs (and, presumably, on a wider variety of MPP and vector systems). As another argument in this code's favor, it could be more easily rewritten to match the special features of any platform, possibly by just changing the function it integrates.

The search for a universal strategy to demostrate and achieve high performance continues. Fortunately, just as there is a wide variety of codes, there is a wide variety of machines to run them.

Basic X1 Optimization Tools: pat_hwpc, loopmarks, cray_pat

Once your code is running correctly on the X1, you may want to assess its performance to determine if it needs to be speeded up.

The three basic X1 performance analysis tools are:

pat_hwpc

compiler loopmark listing

cray_pat

pat_hwpc

This tool reports data from the X1 hardware performance counters. The data is collected over the entire run of the program, and thus can't point you to a specific subroutine or loop which might need attention. It has no effect on the performance of your code, and running it requires no recompilation or relinking. To use it, preface your execution command with "pat_hwpc," as follows:

% pat_hwpc ./a.out

or for an MPI job:

% pat_hwpc mpirun -np 8 ./a.out

This tool works in most cases, is trivial to use, and provides invaluable data.

The following pat_hwpc output is from an actual user application (140k lines of source). It shows us that 98% of the operations are vector (this is good!), vector length is 44 (okay), computational intensity is 4.5 flops per load (good), and performance is 1.4 GFLOPS (okay). This is acceptable, but if the user were planning multiple long production runs we'd want to dig deeper for possible improvement.

At the end of the ".lst" file you'll find messages explaining why each loop was or wasn't vectorized or streamed (for instance, there was a dependency on variable "X", a non-vectorizable function call, etc..). Loopmark listing is now available for C programs, too.

Cray_pat

Loopmark listing (above) is only really useful if you know where the code spends its time. (A loop which accounts for %0.1 of the time can be ignored, even if it doesn't vectorize.)

Profiling your code helps focus your optimization efforts. Cray_pat is like Unix prof or gprof, and will show the percentage of time spent in each subroutine or function (or loop, if needed).

Here's how to get a basic profile. First, compile your code as usual, with all desired optimizations. Then:

Step 1:

"Instrument" the executable file for profiling.

The exact object files used when the file was linked must be available in their original locations because "instrumenting" the code automatically relinks it as well. The "pat_build" tool performs the task. In this example, a.out is a pre-existing executable file, and a.out.inst will be generated:

% pat_build a.out a.out.inst
Step 2:

Run the instrumented binary exactly as you'd run the original. This will produce an experiment file (with the suffix, .xf), containing output statistics for the run.

% ./a.out.inst

Step 3:

Generate a human-readable report from the .xf file using a second tool, "pat_report." E.g.:

Given this table, you know which loopmark listing file to examine first... (in this case, that which contains the subroutine "count_pair_position").

If you need help with any of these tools, contact ARSC consulting (consult@arsc.edu). Also see our "getting started" document for the X1:

http://www.arsc.edu/support/howtos/usingx1.html

Quick-Tip Q & A

A:[[ I'm finally appreciating the benefits of the "find" command, but
[[ here's a problem.
[[
[[ When I use grep from a find command, grep doesn't tell me the names
[[ of the files! Sure I've got hits, but what good is it if I can't
[[ tell what files they're in?
[[
[[ % find . -name "*.f" -exec grep -i flush6 {} \;
[[ include(flush6)
[[ !!dvo!! include(flush6)
[[ !!dvo!! include(flush6)
[[ include(flush6)
[[
[[ Any suggestions?
#
# Many thanks to nine (yes, 9) responders. There was duplication, so here
# are 5 responses which cover the range of answers.
#
#
# John Skinner
#
You have to add an extra filename for grep. /dev/null works best:
% find . -name "*.f" -exec grep -i "program rir" /dev/null {} \;
This is needed because grep won't list the filename of a match when
given only one file on the command line or when a wildcard like *.f only
expands to one filename. Since find's -exec option runs grep on only one
filename at a time, grep never gets two or more files on its command
line. Add /dev/null to get 2 files each time grep is run, with one of
them guaranteed to NEVER match.
You can also "turn around" the find/grep,
% grep -i "program rir" `find . -name "*.f"`
but check this out when *.f winds up being only one filename:
% ls *.f
r.f
What the heck! Where's my filename, with either method??
% grep -i "program rir" `find . -name "*.f"`
program rir
% find . -name "*.f" -print
xargs grep -i "program rir"
program rir
Again, add an extra filename for grep:
% grep -i "program rir" `find . -name "*.f"` /dev/null
./r.f: program rir
% find . -name "*.f" -print
xargs grep -i "program rir" /dev/null
./r.f: program rir
#
# Brad Chamberlain
#
The key is to find the flag on your grep command that prints the filename,
since find will call grep on each file one by one. On my desktop
systems (linux-based), it's --with-filename, so I use:
find . -name "*.txt" -exec grep --with-filename ZPL {} \;
#
# Daniel Kidger
#
Many versions of grep (eg.. Gnu) have a -H option. This prefixes the
output with the filename. The -n option of grep is handy too - it shows
the line number in the file. Also I generally prefer to use 'xargs'
rather than the slightly clumsy '-exec' option of grep. (the -l option
feeds one line at a time to whatever command follows).
Hence
$ find . -name "*.f"
xargs -l grep -inH getarg
./danmung.f:91: call getarg(1,file_in)
./danmung.f:92: call getarg(2,file_out)
./danfe.f:529:! .. cf use of GETARG, & if NARG = 0.
(Note, in years gone by 'find' often needed a '-print' option in the
above.)
#
# Jed Brown
#
You are probably looking for the -H option for grep (most versions).
Otherwise, you can use:
% grep -i flush6 `find . -name "*.f"`
since usually, grep prints the name of the file if it receives several
arguments on the command line. If this does not work or if it exceeds
the maximum number of command line arguments, you can always do
something like:
% echo 'a=$1; shift; for f in $*; do grep $a $f
sed "s
^
$f:\t
"; done' > mygrep
% chmod a+x mygrep && find . -name "*.f" -exec ./mygrep "-i flush6" {} \;
#
# Kurt Carlson
#
In ksh syntax:
find . -name "*.f" -print
while read F; do
grep -i flush6 $F >/dev/null; if [ 0 = $? ]; then echo "# $F"; fi
done
Q: Are data written from a fortran "implied do" incompatible with a
regular "read"? If so, is there a way to make them compatible,
without rewriting the code?
I just want to read data elements one item at a time from a
previously written file. Here's a test program which attempts
to show the problem:
iceflyer 56% cat unformatted_io.f
program unformatted_io
implicit none
integer, parameter :: SZ=10000, NF=111
real, dimension (SZ) :: z
real :: z_item, zsum
integer :: k
zsum = 0.0
do k=1,SZ
call random_number (z(k))
zsum = zsum + z(k)
enddo
print*,"SUM BEFORE: ", zsum
open(NF,file='test.out',form='unformatted',status='new')
write(NF) (z(k),k=1,SZ)
close (NF)
zsum=0.0
print*,"SUM DURING: ", zsum
open(NF,file='test.out',form='unformatted',status='old')
do k=1,SZ
read(NF) z_item
zsum = zsum + z_item
enddo
close (NF)
print*,"SUM AFTER: ", zsum
end
iceflyer 57% xlf90 unformatted_io.f -o unformatted_io
** unformatted_io === End of Compilation 1 ===
1501-510 Compilation successful for file unformatted_io.f.
iceflyer 58% ./unformatted_io
SUM BEFORE: 5018.278320
SUM DURING: 0.0000000000E+00
1525-001 The READ statement on the file test.out cannot be completed
because the end of the file was reached. The program will stop.
iceflyer 59%

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu