ARSC T3E Users' Newsletter 116, March 21, 1997

T3E Overview

This article quickly hits on many aspects of the T3E and its user environment. In coming issues, we will delve into the details, but the simple question today is: What are the first things you will notice about the ARSC T3E?

Standalone system

First, the T3E is a standalone system, there is no frontend system such as the YMP. So what does the compiling and other tasks that used to be performed by the front end? To understand this a brief description of the structure of the T3E is needed.

OS, Command, and Application Processors

The T3E consists of a collection of processors connected by a torous in a similar manner to the T3D. However, on the T3D all of these processors were available for user tasks, on the T3E some are defined as OS processors and these perform low level support for the operating system, Command processors which support the users interactive session, and application processors which run the parallel tasks users submit to the system.

grmview, mppview, xmppview

A new utility, grmview, gives a detailed view of the processor distribution between these three classes and the current usage. The numbers in each group will change to reflect the loads on the system. mppview replaces mppinfo in describing the processor usage, a new graphical interface, xmppview, gives a 3d view of the system and various aspects of processor usage.

No more powers of two

Users are no longer restricted to run with processor numbers that are a power of two: any number of processors can be used. Users now have a greater freedom to request the most suitable number of processors for the current job, e.g. the minimum number of processors to obtain sufficient memory. The removal of this constraint should result in greater throughput and improved turnaround overall.

Fortran 90

The fortran77 compiler has been replaced with a fortran90 compiler. This should accept all current f77 codes. Users can test f77 code under the f90 compiler on denali.

Performance

Performance has been improved. The observed bandwidth between processors has increased from 120MByte/sec to 320MByte/sec for SHMEM, from 15MByte/sec to 80MByte/sec for MPI. Early experiences show that code taken across from the T3D is running 2-3 times faster without any modification.

SHMEM

With the SHMEM library, SHMEM_PUT and SHMEM_GET now give similar bandwidths so users can choose whichever best matches the data transfer activity. Special hardware how ensures the caches are consistent so calls to the SHMEM_UDC_FLUSH and SHMEM_SET_CACHE are no longer necessary. However users are recommended to keep placing these in code to be portable between the T3D and T3E. Cache coherency functions are no-ops on the T3E. Improved routing in the torous to avoid contention/hotspots means that data sent via SHMEM_PUT many not arrive in the same order as the SHMEM_PUT calling sequence. This is of particular importance for users who determine if a transfer is complete by reading the value of the last data item in a transfer. SHMEM_FENCE can be used to enforce the ordering of transfers or one may test the entire message with SHMEM_WAIT.

Malleable executables, mpprun

Users can compile for a fixed number of processors and simply type the name of the program at the command line for execution. Malleable executables, where the number of processors was not specified at compile time, conceptually replace the T3D's plastic executables. To run these, use the command: mpprun -n NPES a.out [ARGS] where you would have use the T3D command: [mppexec] a.out [ARGS] -npes NPES

Timings on Multiple Cray Platforms (revisited)

Newsletter #99
presented a fortran subroutine which used the "gethmc" system call to obtain the system clock frequency, and worked across Cray platforms. The programmer could bracket some code with a pair of calls to "irtc" to determine the number of clock ticks spent in that code, and then, using the clock speed, compute the Mflop/s rate of that code, elapsed time in seconds, etc.

"gethmc" has been replaced in the PE 2.O libraries with a posix compliant routine, PXFSYSCONF. Here's an f90 module which can now be used, across platforms, to get the clock speed (it is taken from the "man" page for PXFSYSCONF and modified):

Quick-Tip Q & A

A: {{ How would you condense every occurrence of multiple blank lines in a
file into a single blank line? }}
# Here's a perl script that will do it (except that multiple
# blanks at the top will come out as two). There are certainly
# other ways to do this, using sed, for instance, but why mess
# with sed when you've got perl? (Yes, the T3E is getting perl.)
########
#!/usr/local/bin/perl
while (<STDIN>) { # Load all input lines into one string
$f .= $_;
}
$f =~ s
\n\n[\n]*
\n\n
g; # Replace all groups of two or more
# consecutive newlines with exactly two.
print $f; # Output result.
Q: In Cray's Programming Environment 2.0, how can you tell what versions
of libraries, compilers, etc... will be used as the current default?

The University of Alaska Fairbanks is an affirmative action/equal
opportunity employer and educational institution and is a part of the University
of Alaska system.
Arctic Region Supercomputing Center (ARSC) |PO Box 756020, Fairbanks, AK 99775 | voice: 907-450-8602 | fax: 907-450-8601 | Supporting high performance computational research in science and engineering with emphasis on high latitudes and the arctic.
For questions or comments regarding this website, contact info@arsc.edu