This tutorial is intended for users of Livermore Computing's Sequoia
BlueGene/Q systems. It begins with a brief history leading up to the BG/Q
architecture. Configuration information for the LC's BG/Q systems
is presented, followed by detailed information on the BG/Q hardware architecture,
including the PowerPC A2 processor, quad FPU, compute, I/O, login and service nodes,
midplanes, racks and the 5D Torus network. Topics relating to the
software development environment are covered, followed by detailed
usage information for BG/Q compilers, MPI, OpenMP and Pthreads. Math libraries,
environment variables, transactional memory, speculative execution, system configuration information, and
specifics on running both batch and interactive jobs are presented.
The tutorial concludes with a discussion on BG/Q debugging and
performance analysis tools.

Level/Prerequisites: Intended for those who are new to developing
parallel programs in the IBM BG/Q environment. A basic understanding of parallel programming in C or Fortran is required. Familiarity with MPI and OpenMP is desirable. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.

Evolution of IBM's Blue Gene Architectures

This section provides a brief history of the IBM Blue Gene architecture.

1982-1998: QCDSP Predecessor:

The QCDSP (Quantum Chromodynamics on Digital Signal Processors)
Project led by Columbia University and 4 other collaborating institutions,
designed and built systems that incorporated features seen in the Blue Gene
architecture:

At the SC2002 conference, Energy Secretary Spencer Abraham announced
that The Department of Energy
(DOE) had awarded IBM a $290 million contract to build the two fastest
supercomputers in the world with a combined peak speed of 460 trillion
calculations per second (teraflops).

The first system would be ASC Purple, with a peak speed of 100 teraflops and
over 12,000 processors.

The second system would be Blue Gene/L, with a peak speed of 360 teraflops
and 130,000 processors.

First production Blue Gene/L system ranks #1 on the
Top500 List seven
consecutive times.

November 2004: IBM Rochester beta Blue Gene system achieves 71 Tflop Linpack
for #1 position on the Top500 List.
The 16-rack, 16384 node system later moved to
Lawrence Livermore National Laboratory for installation.

June 2005: 32-rack, 32,768 node LLNL system achieves 137 Tflop Linpack for #1
position on the Top500 List again. The press release is available
HERE

February 2009: The Department of Energy awards IBM a contract to build
a 20 petaflop, next generation Blue Gene system at LLNL. The new
architecture is called Blue Gene/Q, a follow on to the BG/P
architecture.

Although similar to BG/P, there are significant differences and
improvements.

June 2012: Sequoia, the first production BG/Q system, debuted on the
Top500 List at #1 with a
16.3 petaflop Linpack and 20.1 petaflop peak. Three other BG/Q systems
also debuted in the top 10 positions on this list.

The BG/Q software and development environment is similar in many ways
to LC's other production clusters. Topics such as batch systems,
file systems, debuggers and tools are only summarized here.
These topics are covered in more
detail in the Introduction to LC Resources tutorial and other
sources linked to below.

One major difference between a BG/Q system and non-Blue Gene
clusters is the difference between the environment where users
develop their codes (login nodes) and where the codes actually run
(compute and I/O nodes).

The diagram below summarizes the software environment for BG/Q
different node types.

Login Nodes:

Users must login to one of the designated front-end login nodes.
Use of the generic cluster login name, such as seq
or rzuseq will automatically round-robin users on
different login nodes.

After logging into the cluster login, you can see the list of available
login nodes by using the command:

NOTE: system calls should not be confused with library calls.
For example, statvfs is a library call and is
not shown in the table below. It does call the
statfs system call, which is supported, and shown
in the table below. Another example is gethostname
which is a library call and does not appear in this list.

Some users have reported problems with X11 applications such as firefox, emacs,
xemacs, evince, stat-view, userinfo, and gtk-related binaries crashing when run
on the BG/Q frontend nodes.

This is due to an incompatibility between the X-windows software on the frontend
node and the user's desktop system.

For desktops running Microsoft Windows, the solution is to install X-Win32 2011
or later. LLNL users may contact desktop support at 4-HELP (4-4357) or
http://frontrange.llnl.gov
and request installation of X-Win32 2011 or later.

If your desktop is not administered at LLNL, you or your system administrator
can contact the LC Hotline (
lc-hotline@llnl.gov, 925-422-4531) if more information is needed about
installing these patches.

For Mac desktops, LC has not had any reports of X11 incompatibilities.

LC has not tested other X-windows emulators on Microsoft Windows desktops.

Web Browsers, PDF Viewers:

The firefox web browser is available on the BG/Q frontend
nodes.

The elinks text-based web browser is also available.

PDF viewer: use evince. The popular acroread is
not available for some reason.

Man Pages:

Linux man pages (and SLURM man pages) are stored in
/usr/share/man.

If you have problems bringing up the man page for a Linux command, check
your MANPATH variable for /usr/share/man.

BG/Q compiler man pages are installed with the compiler software under
/opt/ibmcmp. They should be in your path - as shown below:

Compilations are performed on the BG/Q front-end login nodes, which
an entirely different architecture (IBM Power7) requiring cross compilation.
See Login Nodes for details.

At LC, the default BG/Q compiler commands (see tables further down)
are actually wrapped as driver scripts under
/usr/local/bin. These scripts do the following:

Check for common cross-compilation errors - mainly compile line options
that include things on the front-end nodes under
/usr/include, /usr/local/include, /usr/X11,
/usr/lib, /lib, /usr/local/lib, etc.

Call the "real" compile scripts located under
/bgsys/drivers/ppcfloor/comm/xl/ and
/bgsys/drivers/ppcfloor/comm/gcc/ which parse the
compile line options and select the MPI libraries, includes, etc.

Which in turn invoke the BG/Q serial compilers to perform the actual
compilation / linking.

Important:
If you are doing serial compiling on the Power7 LAC front-end nodes, be
sure to use the BG/Q compiler command and not the Power7 front-end command.
Commands are shown in the tables below.

IBM compilers include many options - too numerous to be covered here.
For a full discussion, consult the IBM compiler documentation.
A summary of some useful BG/Q options are listed in the table below.

To find out which options are in effect, including defaults, compile
with the -qlistopt flag and review the
*.lst report.

Transformation report showing how the code was optimized when using -qhot or -qsmp

Source code listing

Cross-reference listing

-p -pg

Generate profiling support code. -p is required for use with
the prof utility and -pg is required for use with the
gprof utility.

-qmaxmem=#bytes

Specifies how much memory the compiler can use for optimizations. If compiler complains, try using -1. Default: 8194

-qsimd=auto

Converts certain operations that are performed in a loop on successive elements of an array into vector instructions. These instructions calculate several results at one time, which is faster than calculating each result sequentially. This is the default.

-qsmp=omp

Turn on directives for OpenMP compilation. Note: implies -O2
and -qhot unless qualified as -qsmp=omp:noopt.
Use a thread-safe compiler - its name ends with "_r".

-qsmp=speculative
-qsmp=omp:speculative

Turn on directives for thread-level speculative execution compilation. If using with OpenMP, include omp:. Use a thread-safe compiler - its name ends with "_r".

-qstrict
-qnostrict

Turns on/off aggressive optimizations which have the potential to alter the
semantics of a user's program. See documentation for many suboptions.
Default: -qstrict

-qtm

Turn on directives for transactional memory compilation. Use a thread-safe compiler - its name ends with "_r".

-v -V

Display verbose information about the compilation

-qversion

Display compiler version information

-w

Suppress informational, language-level, and warning messages.

GNU Compilers:

BG/Q versions of the GNU C/C++ and Fortran compilers are available:

version 4.4.7 (default)

version 4.7.2 (use bggcc-4.7.2)

There is also an LLVM/Clang build provided by Argonne: use -hv clang
for info.

These compilers do not generate optimized code specific to BG/Q, such as
support for the quad FPUs. Therefore, the IBM compilers tend to offer better
performance.

Cross-compilation on the Large Application Compile (LAC) Power7 login nodes is
required:

The GNU compilers in /usr/bin
are for the front end, not for BG/Q compute nodes, so DON'T use these

Instead, use the compiler invocation commands specific to BG/Q - see the
Compiler Commands table below

Actual compiler locations:

Serial C/C++: different locations - see table below

Serial Fortan: different locations - see table below

MPI compile scripts:
/bgsys/drivers/ppcfloor/comm/gcc/bin

The GNU compilers are thread safe

There are over 1,200 GNU compiler options - see the documentation for
details.

The usual GNU binutils such as ar, ranlib, ld, cpp, etc.
found in /usr/bin/ are for the front end nodes,
NOT the BG/Q compute nodes.

The BG/Q versions of most binutil commands and other utilities are located in
/bgsys/drivers/ppcfloor/gnu-linux/bin/ and/or
/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/bin/:

Command

/bgsys/drivers/ppcfloor/gnu-linux/bin/

/bgsys/drivers/ppcfloor/gnu-linux/powerpc64-bgq-linux/bin/

addr2line

powerpc64-bgq-linux-addr2line

ar

powerpc64-bgq-linux-ar

ar

as

powerpc64-bgq-linux-as

as

cpp

powerpc64-bgq-linux-cpp

embedspu

powerpc64-bgq-linux-embedspu

gccbug

powerpc64-bgq-linux-gccbug

gcov

powerpc64-bgq-linux-gcov

gdb

powerpc64-bgq-linux-gdb

gdbtui

powerpc64-bgq-linux-gdbtui

gprof

powerpc64-bgq-linux-gprof

ld

powerpc64-bgq-linux-ld

ld

ld.bfd

powerpc64-bgq-linux-ld.bfd

ld.bfd

nm

powerpc64-bgq-linux-nm

nm

objcopy

powerpc64-bgq-linux-objcopy

objcopy

objdump

powerpc64-bgq-linux-objdump

objdump

ranlib

powerpc64-bgq-linux-ranlib

ranlib

ar

powerpc64-bgq-linux-ar

readelf

powerpc64-bgq-linux-readelf

size

powerpc64-bgq-linux-size

strings

powerpc64-bgq-linux-strings

strings

strip

powerpc64-bgq-linux-strip

strip

Optimization:

Default is no optimization

Optimizations may cause the compiler to relax conformance to the IEEE
Floating-Point Standard.

The compiler uses a default amount of memory to perform optimizations.
If it thinks it needs more memory to do a better job, you may get a
warning message about setting MAXMEM to a higher value. If you specify
-qmaxmem=-1 the compiler is free to use as much memory as
it needs for its optimization efforts.

As mentioned previously, the BG/Q compute nodes are
an entirely different architecture and OS than the
Power7 Large Application Compile (LAC) front-end nodes.

If you use the LAC nodes, cross-compilation is required, which can present
some problems, often hard to diagnose. For example:

Users can unknowingly link libraries and include files from
the front-end nodes into their BG/Q executables

The configure utility, and other makefile generating tools,
create and run small test programs as part of their build process.
These programs get executed on the front-end node instead of a
backend compute node, generating incorrect information about
the build requirements.

LC has developed several mechanisms to help prevent these problems. In
most cases, these mechanisms are set to defaults appropriate to the
majority of developers.

Automatic detection of invalid path arguments during compilation:

LLNL_CHECK_COMPILE_LINE environment variable is set to
"Error" by default, which displays a warning and aborts compilation.

Supported/unsupported system calls: The BG/Q compute node kernel does not
support all system calls, such as fork(), system(),
usleep(). Making such calls results in a runtime error. See the
system calls discussion for details.

Third party libraries: BG/Q builds of LC provided libraries are in
/usr/local/tools. For example:

The default versions of IBM and GNU compiler scripts use an MPI library
built with error checking and assertions enabled, plus "commthreads" support.
Commthreads are extra system threads used by the low-level PAMI messaging layer
to allow MPI messages to make asynchronous progress, with a potential for
better performance. Also called "fine-grain locking" in the IBM documentation.

As an alternative, users can "experiment" with using different versions of
the MPI library, for possible performance improvement.

Selecting which alternative version to build with can be done by setting the
COMMLIB environment variable to gcc.legacy, xl.legacy,
xl.ndebug or xl.legacy.ndebug.

To select the version that LC has found to be the fastest in most cases
(xl.legacy.ndebug), users can simply append -fastmpi to their
usual compiler command. For example:
mpicc-fastmpi.

Location of build scripts under

/bgsys/drivers/ppcfloor/comm/

Description

gcc.legacy/bin/

Built with the GNU Compiler. Coarse-grained MPICH locking (no commthreads). Error checking and assertions enabled. May provide slightly better latency in single-thread codes, such as those that do not call MPI_Init_thread( ... MPI_THREAD_MULTIPLE ...). Use one of the default gcc MPI library for initial application porting work.

xl.legacy/bin/

MPICH built with the XL compilers and PAMI compiled with the GNU compilers. Coarse-grained MPICH locking (no commthreads). Error checking and assertions enabled. May provide a performance improvement over the gcc.legacy libraries for single-threaded applications.

xl.ndebug/bin/

MPICH built with the XL compilers and PAMI compiled with the GNU compilers. Fine-grained MPICH locking (commthreads). Error checking and assertions are
disabled. May provide a substantial performance improvement when an application functions satisfactorily. Do not use this library version for initial porting and application development.

xl.legacy.ndebug/bin/

MPICH built with the XL compilers and PAMI compiled with the GNU compilers. Coarse-grained MPICH locking (no commthreads). Error checking and assertions are disabled. May provide a substantial performance improvement when an application functions satisfactorily. Do not use this library version for porting and application development. May also provide a performance improvement over the xl.ndebug library version for single-threaded applications.

Additional information can be found on these LC internal (requires
authentication) sources:

IBM's BG/Q MPI provides all levels of MPI thread support specified by the
standard:

MPI_THREAD_SINGLE - Level 0: Only one thread will execute.

MPI_THREAD_FUNNELED - Level 1:
The process may be multi-threaded, but only
the main thread will make MPI calls - all MPI calls are funneled
to the main thread.

MPI_THREAD_SERIALIZED - Level 2:
The process may be multi-threaded, and
multiple threads may make MPI calls, but only one at a time. That is,
calls are not made concurrently from two distinct threads as all MPI calls
are serialized.

Deterministic: Packets from a sender to a receiver go along the same
path. Requires less logic but may create network "hotspots" when routes
overlap.

Adaptive: Different packets from the same sender to the same receiver
can travel along different paths. The exact route is determined at run time,
depending on the current load. Generates a more balanced network load but
introduces a latency penalty.

Collective communications route deterministically, but point-to-point messages
route according to one of the four BG/Q message protocols, based upon
user message size:

Several environment variables can be used to "tune" message passing behavior.
Some examples are shown below.

setenv PAMID_EAGER 5000
setenv PAMID_RZV 5000

Both are equivalent. Sets the cutoff point for switching from eager to rendezvous protocol at 5000 bytes. Overrides the default of 2049 bytes. Note: use one or the other, but not both. If you use both, the setting for PAMID_EAGER overrides PAMID_RZV and you are not informed.

setenv PAMID_SHORT 25

Sets the cutoff point for switching from immediate to short protocol at 25 bytes. Overrides the default (and upper limit) of 113 bytes.

setenv PAMID_VERBOSE 1

Allows you to view the short and eager limit settings. Sent to stdout when your job runs.

The majority of parallel programs follow the Single Program Multiple Data (SPMD)
model. All parallel tasks run the same executable, but are free to use different
data.

Multiple Program Multiple Data (MPMD) jobs are different: each MPI task can run
a different executable. Likewise, sets of tasks can run one executable, and
other sets of tasks can run a different executable.

All tasks of the job share MPICOMMWORLD and can therefore communicate
with each other and exchange data over the 5D torus.

To enable MPMD support, first create a "mapping file" using the syntax:

Inter-node: Tasks work across the network using MPI message passing
communications

Compiling:

IBM XL: Be sure to use a thread-safe compiler - its name ends with
_r (underscore "r") and
the -qsmp=omp flag. For example:

mpixlc_r -qsmp=omp myprog.c
mpixlf90_r -qsmp=omp myprog.f

GNU: use the -fopenmp flag. For example:

mpicc -fopenmp myprog.c
mpif90 -fopenmp myprog.f

Specifying the number of threads:

By default, all four hardware threads on all 16 user cores will be
used to spawn 64 OpenMP threads in a parallel region.
This may/may not be what you want, and can easily be changed in the
usual OpenMP ways.

Set the OMP_NUM_THREADS environment variable before you run your job

Use the omp_set_num_threads routine in your code

Use the NUM_THREADS clause with the OMP PARALLEL directive in your code

Thread limit:

For the IBM compiled codes, attempting to create more than 64 threads will
result in an error message similar to below, and the maximum of 64 threads
will be used.

Note that 64 threads is a hard limit despite the OMP_THREAD_LIMIT setting.

1587-124 SMP runtime library warning. The number of OpenMP threads exceeds the thread limit 64. The result of the program may be altered if the OMP_THREAD_LIMIT is not increased.

For GNU compiled codes, attempting to create more than 64 threads will
result in an error message similar to below, and the maximum of 64 threads
will be used.

GOMP runtime library warning. An attempt was made to increase the number of OpenMP threads past the thread limit 64.
The thread limit was set to 64

Thread stack size:

The default stack size is only 4 MB per thread

Stack overflow results in job termination, with error message(s) that may
or may not point to the problem. For example:

As of Mar 2014, the valid stack size range is 131072B to 2147483647B bytes
(or the equivalent thereof).
Setting OMP_STACKSIZE outside this range will result in a run time error
message and job termination (usually).

Thread affinity:

In the BG/Q threading model, threads have absolute affinity to the hardware
threads they are associated with. This means they are "bound" to cores.

The POSIX threads API is supported by the IBM and GNU C/C++ compilers.
The API is not defined for Fortran.

Compiling:

IBM: be sure to use a thread-safe compiler - its name ends with
_r (underscore "r").
No special compiler flag required.

GNU: compile with the -pthread flag

Thread limit:

The IBM documentation states that each hardware thread can support 5
software POSIX threads, for a maximum of 320 (64*5) threads per node.
This maximum includes the process(es) responsible for spawning the Pthreads.

Attempting to create more threads than supported results in program
termination and an error message similar to:

System Configuration Details

First Things First:

Before you attempt to run your parallel application, it is important
to know a few details about the way the system is configured.
This is especially true at LC where every system is configured differently
and where things change frequently.

Several information sources and simple configuration commands
are available for understanding LC's systems - discussed below.

Provide the most timely status information for system maintenance,
problems, and system changes/updates

ocf-status and scf-status cover all machines on the OCF / SCF

Additionally, each machine has its own status list - for example:
seq-status@llnl.gov
rzuseq-status@llnl.gov
vulcan-status@llnl.gov

The LC Hotline initially adds people to the list, but just in case you
find you aren't on a particular list (or want to get off), just use
the usual majordomo commands in an email sent to
Majordomo@lists.llnl.gov.

% news job.lim.vulcan
================================ job.lim.vulcan ================================
SUMMARY OF INTERACTIVE AND BATCH JOB LIMITS ON Vulcan
----------------------------------------------------------------------
HARDWARE:
Vulcan has 24,578 total nodes (2 login nodes and 24,576 compute nodes.
Compute nodes have one socket with a 16-core IBM PowerPC A2(1.6 GHz)
and 16 GB memory/node.
SCHEDULING:
Vulcan batch jobs are scheduled using SLURM (srun), and pdebug jobs
are scheduled using SLURM (srun).
SUMMARY OF JOB LIMITS ON Vulcan
Max Max
Pool Nodes/job wall time
-------------------------- --------- ---------
| pdebug(*) | 1K | 1 hrs |
-------------------------- --------- ---------
| psmall | 1K | 12 hrs |
-------------------------- --------- ---------
| pbatch(**) | 8K | 12 hrs |
-------------------------- --------- ---------
| pall (***) | 24K | N/A |
-------------------------- --------- ---------
(*) Please limit the use of pdebug to 1k nodes on a PER USER
basis, not a PER JOB basis, to allow other users access.
Using more than the posted limit PER USER can result in job
removal without notice. Please be a good neighbor, and be
considerate of others utilizing the pdebug partition.
(**) pbatch jobs must be larger than 1K (1024) nodes. Jobs asking
for up to 1K nodes should be run in psmall or pdebug.
(***) pall pool is only used for approved full system DAT runs.
Please see the news item "dat_vulcan" for more information.
2/4/14: - pdebug partion limits changed to move more nodes to psmall
and discourage production runs in pdebug.
For documentation on building and running on Vulcan, please see the
files in /usr/local/doc or visit
https://lc.llnl.gov/confluence/display/bgq/
or
https://computing.llnl.gov/tutorials/bgq/
If you have questions about this, please contact the LC Hotline at
925-422-4531 or lc-hotline@llnl.gov.
================================ job.lim.vulcan ================================

How much memory does each MPI task require? There is a node limit of 16 GB,
which is divided evenly (minus system overhead) between the number of MPI
tasks. If any MPI task exceeds its portion, problems will occur.
See the Memory section for further discussion.

The number of MPI tasks per node should be a power of two:
1, 2, 4, 8, 16, 32, or 64. If not, problems can occur, as
discussed in the Memory Considerations section later.

Are your MPI tasks spawning OpenMP threads? If so, you may have up to
64 threads (4 per core) per node. The number of OpenMP
threads is usually evenly divided between the number of MPI tasks.
See the OpenMP and Pthreads section for
further discussion.

Are your MPI tasks spawning POSIX threads? If so, you may have up to
320 threads (5 per hardware thread) per node. The number of threads
includes the task(s) that spawned them.
See the OpenMP and Pthreads section for
further discussion.

Valid Block Sizes:

For jobs using more than 512 nodes: the total number of nodes should be a
multiple of 512 node midplanes.

Furthermore, this multiple of midplanes must map to a 4D torus of midplanes
where 4 is the maximum run in any dimension.

For example:

Requesting 8K nodes is 16 midplanes, which can be formed by
several different 4D midplane configurations:

1x1x4x4 1x2x2x4 1x4x2x2 1x4x1x4 ....

However, requesting 2.5K nodes is 5 midplanes, which cannot be
configured into a 4D midplane torus with a maximum of 4 in any dimension.

A complete list of all possible 4D midplane configurations and valid job
sizes up to the maximum size of Sequoia is available HERE.

The list of valid block sizes, up to the maximum size of Sequoia, is shown
below.

1K
1.5K
2K
3K
4K
4.5K

6K
8K
9K
12K
13.5K
16K

18K
24K
27K
32K
36K
40.5K

48K
54K
64K
72K
96K

In most cases, if you request an invalid block size of nodes, the scheduler
will "round up" to the next valid sized block. In some cases, it will
simply reject the job. So, it is best to stick with valid block sizes.

For jobs less than 512 nodes: these jobs will be put into
a sub-block which may or may not exactly match the requested number of nodes
(there may be extra unused nodes allocated). Valid sub-block sizes are a power
of 2:

Although you can run up to 64 MPI tasks per node, this is not always a good
idea. Node memory is equally divided between all tasks, resulting in each
task only have 256 MB of memory. Additionally, there are increased
overheads within the MPI library, particularly at higher task counts.

Setting the environment variable BG_MAPCOMMONHEAP=1 is
recommended for obtaining the optimal memory allocation per MPI task. Details
are discussed in the Memory Constraints section.

The best performance is achieved when you are able to keep at least two of
the hardware threads on each core busy. The preferred way of doing this is
with hybrid programming using MPI with OpenMP/Pthreads.

The srun Command

This differs from from standard IBM BG/Q systems, which use IBM's
runjob command.

The srun command is used for both interactive and
batch jobs:

Interactive jobs use srun from the command line
prompt

Batch jobs use srun from within their job control
script

Examples:

srun -n1024 a.out

Run the executable a.out using 1024 tasks with 16 tasks per node (default). The
default of 16 tasks per node corresponds with 16 cores per node. In this case,
64 nodes would be required.

srun -N1024 a.out

Run the executable a.out using 1024 nodes, one task per node (default)

srun -N64 -n1024 a.out

Run the executable a.out using 64 nodes and 1024 tasks (16 tasks/node)

srun -N64 -n4096 --overcommit a.out
srun -N64 -n1024 -O a.out

Run the executable a.out using 64 nodes and 4096 tasks (64 tasks/node). Note
that the --overcommit or -O flag is required if the number
of tasks exceeds the number of cores on a node (16).

srun -N64 --ntasks-per-node=8 a.out

Run the executable a.out using 64 nodes and 8 tasks/node. Note
that the tasks per node must be a power of 2 up to a maximum of 64
(using --overcommit).

srun -N32 -o out.file -e error.file a.out arg1 arg2

Run the executable a.out using 32 nodes, one task per node. Specify the name of
stdout file, name of stderr file, and pass two arguments to a.out.

The srun command has many options and associated
environment variables. For details, consult the man page available
HERE.

The default when using -n alone is to place one
task per core, up to sixteen per node, using as few nodes as possible.

The default when using -N alone is to place
one task per node. If you want more than one task per node, you will need to
use -N in combination with -n
or with --ntasks-per-node, as
shown in the table above.

If you want more than one task per core (16), you will need to specify the
--overcommit or
-O (uppercase letter "O") option. They are equivalent.

The maximum number of tasks per node is 64, which corresponds to the total
number of hardware threads per node.

The minimum job size allocation is one node. The is different from previous
Blue Gene systems which limit job size allocations to specific block sizes
(64, 128, 512, etc.).

When specifying the number of nodes with the -N option,
the K or k abbreviation can be
used to denote 1024 nodes. For example:

srun -N 1k

Use 1024 nodes

srun -N 2k

Use 2048 nodes

srun -N 16K

Use 16,384 nodes

Other Usage Notes:

Your executable and any arguments it takes must appear at the end of the
srun line. You cannot follow your executable name
with more srun arguments. For example:

Correct:

srun -N 128 -o myout.txt a.out myarg1 myarg2

Incorrect:

srun -N 128 -o myout.txt a.out myarg1 myarg2 --ntasks-per-node=8

Long-form (double-dash) options which take an argument may need an "=" sign
between the option and the argument
(e.g. srun --ntasks-per-node=16).

The smallest allocatable partition with a torus topology is 512 nodes. Job sizes
smaller than 512 nodes will be configured in a mesh topology.

stdin: either from the command line or file redirect, is read only
by MPI task 0. If other tasks need to read stdin, then task 0 should send it to
them via message passing.

Batch Jobs

Only SLURM:

Unlike LC's Linux clusters, Blue Gene systems do not use the
the Moab Workload Manager. Instead, they use only SLURM as the native
resource manager. Moab commands are simply wrappers for native
SLURM commands.

This section provides only a quick summary of batch usage.
For the full details see the following documents:

Uncertainty Quantification (UQ) Jobs

Overview:

Uncertainty Quantification (UQ) jobs are comprised of multiple/many
executions of a single code across a range of input parameters. The goal of
a UQ run is to provide a set of output values for every set of input parameters.

UQ jobs are often smaller in size. If the total number of nodes used is 512 or
less, a valid "sub-block" size will be used, which will be a power of 2. Valid
sub-block sizes are:

1 2 4 8 16
32 64 128 256 512

If a size other than one of these is specified, the actual size used will be
rounded up. For example, asking for 20 nodes will actually obtain a sub-block
of 32 nodes.

Optimal sizes will match a valid sub-block size shown above.

Running Jobs

CNK Environment Variables

There are over 30 documented environment variables that affect the
runtime characteristics of the Compute Node Kernel (CNK). There are
also some newer ones that don't appear in the documentation (yet).

For hybrid MPI-threaded codes, the table below shows the recommended combination
of MPI processes with threads, to fully utilize all 64 BG/Q hardware threads.
Note that the number of threads, as with MPI tasks, should be a power of two:

Most efficient memory mapping for MPI processes

Best load and memory balancing for threads within processes (typically)

Best performance? Your mileage may vary...

MPI Tasks

Threads

Total Threads

1

64

64

2

32

64

4

16

64

8

8

64

16

4

64

32

2

64

64

1

64

Several factors can significantly affect the mapping of memory to processes:

A non-power of two number of tasks

Amount of shared memory needed by the processes

Size of the application's text segment

Number of Tasks Not a Power of Two: The table below demonstrates what
happens (memory wasted) when the number of processes is not a power of two:

Uses a tiny test program with virtually no text or memory requirements

Uses BG_SHAREDMEMSIZE of 64 MB (discussed later)

Highlighted rows indicate caveats or areas of interest

Memory Mapping Efficiency Based on Number of Processes

MPI Tasks

Approx.Memory/Task(MB)

Total MemoryUsed(MB)

% MemoryAvailableto Job(of 16,384 MB)

1

16263

16263

99

2

8123

16255

99

3

4049

12148

74

4

4057

16229

99

6

2018

12106

74

8

2022

16179

99

12

1002

12027

73

16

1004

16067

98

32

499

15968

97

33

244

8052

49

64

244

15616

95

Shared Memory Requirements and Available Task Memory: Shared memory space
and program text segment requirements may/may not
significantly impact the amount of memory available to processes:

Kernel uses 16 MB, subtracted from the memory map of one process - the one
with the lowest rank on the node.

Default shared memory space is 32 MB (formerly was 64 MB).
Required for MPI/PAMI libraries.
This may/may not be subtracted from the memory of one process.

Shared memory space can be controlled with the BG_SHAREDMEMSIZE environment
variable. Specify the number of MB needed if the default is inadequate -
a runtime error will alert you.

Large applications will generate large text segment space requirements. This
may/may not be subtracted from the memory of one process.

If the shared memory / text segment space is not
subtracted from a user process, it will be carved out of global
memory.

The impacts can be significant, depending upon the
number of processes and the shared memory and program text requirements.

For most MPI programs, which assume all tasks are identical and have the
same memory requirements, the overall effect is that the task with the least
memory determines the memory limit for all tasks.

A BG/Q algorithm determines the rules. Subject to change.

Alternative to the default memory mapping algorithm:

Setting the environment variable BG_MAPCOMMONHEAP=1 will
more evenly and effectively allocate the heap between MPI tasks.
Note: As of May 2013 LC has made this the default on all BG/Q
systems.

One caveat is that one process may overwrite another process' heap due to
relaxed memory protections.

Another caveat is that dynamically linked codes may fail unexpectedly

For Transactional Memory and Speculative Execution: need to set
BG_MAPCOMMONHEAP=0

The table below partially demonstrates the relationship between a job's
shared memory requirements and the memory available per task:

Uses a tiny test program with virtually no text or memory requirements

Prior to committing the results, the transactional memory system checks to
see if the shared data has been modified since the atomic operation was
started:

If it hasn't, the commit makes the update and the thread can carry on
with its work.

If the shared value has changed, the transaction is aborted, and the
process/thread's work is rolled back. Typically when this happens, the
program will simply retry the operation

Rules: A transactional memory system must hold the following
properties across the entire execution of a concurrent program:

Atomicity: All speculative memory updates of a transaction are either
committed or discarded as a unit.

Consistency: The memory operations of a transaction take place in order.
Transactions are committed one transaction at a time.

Isolation: Memory updates are not visible outside of a transaction until
the transaction commits data.

Speculative Execution (SE):

Performing work before it is known whether the results are
correct or needed.

Wrong or unneeded results are discarded

Commonly used in modern pipelined processors, such as for reducing the costs
of conditional branches. Instructions in both paths of the branch
are executed ahead of time, and the results from the untaken branch are
discarded later.

Can be used in thread-based parallelism also. Work is broken into chunks
that are executed by threads concurrently without locking.
Data dependencies are detected and work is rolled back. Results are
guaranteed as if all chunks were executed serially by a single thread.

Motivation:

Cores/threads on modern processors are increasing and expected to continue
this trend into the foreseeable future.

Primary motivation in such an environment is to reduce wall clock time by
keeping idle processes/threads busy, increasing concurrency.

Another important motivation is to reduce the complexity and caveats
associated with locks. Simplifies the developer's task.

Using BG/Q Transactional Memory (TM):

BG/Q transactional memory is supported at both the hardware and software
level.

Hardware: the 32 MB L2 cache is "multiversioned". Data in cache
has a version tag, and the cache can store multiple versions of the same
data.

Software: the programmer defines atomic regions, the compiler generates code
telling the processor to begin a transaction, do the work, and then to
commit the work.

If other threads have modified the data, thereby creating multiple
versions, the cache rejects the transaction and the software must try
again. If other versions weren't created, the data is committed.

Use an IBM BG/Q thread-safe compiler - it will have
an _r underscore "r" suffix.
See the Compilers section for details.

Use the -qtm compiler flag to turn on TM directives

For OpenMP, you'll also need the -qsmp=omp compiler
flag

OpenMP examples:

C/C++

bgxlc_r -qtm -qsmp=omp mycode.c
mpixlc_r -qtm -qsmp=omp mycode.c

Fortran

bgxlf90_r -qtm -qsmp=omp mycode.f
mpixlf90_r -qtm -qsmp=omp mycode.f

Run Your Code:

Important: Set the environment variable BG_MAPCOMMONHEAP
to "0" (zero)

Nothing else special required here, unless you want to capture/report
TM statisitics, in which case you need to set an environment variable
or two - covered next.

Report TM Statistics: (Optional)

Use the tm_print_stats() or
tm_print_all_stats() routine in your source code
to generate a text log file of TM statistics.

C/C++

tm_print_stats()
tm_print_all_stats()

Fortran

call tm_print_stats()
call tm_print_all_stats()

Then, before you run your code set the TM_REPORT_LOG
environment variable to one of the following case-insensitive values:
SUMMARY - The statistics log file is generated only at
the end of the program.
FUNC - The statistics log file is generated and updated
at each call to the tm_print_stats built-in function.
ALL - The statistics log file is generated and updated
at each call to the tm_print_stats built-in function and at the end of the
program.
VERBOSE - The statistics log file is generated and
updated at each call to the tm_print_stats built-in function and at the end
of the program. The generated report file also includes the addresses of
memory access conflicts during the speculation.
For example:

setenv TM_REPORT_LOG SUMMARY
setenv TM_REPORT_LOG summary

Run your code, and then look for the TM transaction log file(s)
upon completion. One file per process, named as
tm_report.log.pid where
pid is the process ID or MPI rank number.
An example is provided:

Miscellaneous: There are several other TM features available to users, listed
below. See the IBM Compiler documentation for details -
in particular, the "Compiler Reference" and "Programming Guide" manuals.

Important: Set the environment variable BG_MAPCOMMONHEAP
to "0" (zero)

Nothing else special required here, unless you want to capture/report
SE statisitics, in which case you need to set an environment variable
or two - covered next.

Report SE Statistics: (Optional)

Use the se_print_stats() routine in your source code
to generate a text log file of SE statistics.

C/C++

se_print_stats()

Fortran

call se_print_stats()

Then, before you run your code set the SE_REPORT_LOG
environment variable to one of the following case-insensitive values:
SUMMARY - The statistics log file is generated only at
the end of the program.
FUNC - The statistics log file is generated and updated
at each call to the se_print_stats built-in function.
ALL - The statistics log file is generated and updated
at each call to the se_print_stats built-in function and at the end of the
program.
VERBOSE - The statistics log file is generated and
updated at each call to the se_print_stats built-in function and at the end
of the program. The generated report file also includes the addresses of
memory access conflicts during the speculation.
For example:

setenv SE_REPORT_LOG SUMMARY
setenv SE_REPORT_LOG summary

Run your code, and then look for the SE transaction log file(s)
upon completion. One file per process, named as
se_report.log.pid where
pid is the process ID or MPI rank number.
An example is provided:

One of SE's strongest selling points is that it can save the programmer
effort and mistakes in preventing race conditions. Data conflicts between
threads are detected and "fixed".
The contrived example below demonstrates how SE can detect data conflicts
in hardware, and prevent wrong results by serializing and/or rolling back thread
execution as necessary.

The programmer can do the same thing with OpenMP using an atomic or critical
directive to protect shared data.

In real codes, race conditions can be difficult to detect by the programmer.

There is of course, a performance penalty associated with "fixing" data
conflicts, either way. The overhead is application dependent.

Miscellaneous: There are several other SE features available to users, listed
below. See the IBM Compiler documentation for details -
in particular, the "Compiler Reference" and "Programming Guide" manuals.

To take advantage of QPX SIMD instructions, you need to use the higher
optimization levels such as -O3, -O4, -O5.

Using -qhot is also recommended

Note: -O4 and -O5
will add interprocedural analysis, which can be beneficial, but it
increases compile/link times.

Math intrinsics recommendation: use -qhot=novector
to tell the compiler not to
gather math intrinsics into separate vector math calls. When using SIMD
instructions, it is generally more advantageous to let those operations
intermix with the other floating-point operations.

The -qreport option will produce a
*.lst text listing file showing which loops are
SIMDized and if not, why.

Using the the -qlist option will provide a listing
of object instructions, including any QPX instructions. For example:

qvlfsx
qvlfdx
qvlfcsx
qvlfiwax

qvlpcldx
qvfmr
qvfneg
qvfabs

qvstfdx
qvfadd
qvfadds
qvsub

qvfmul
qvfmuls
qvfrsqrte
qvfmadd

An annotated example report is provided:

There are several compiler pragmas/directives that can be used to assist
the compilers with SIMDization. See the IBM Compiler
documentation for details.

The smallest 5D torus is a midplane of 512 nodes. Single midplane jobs
are always configured as a 5D torus.

Jobs smaller than a midplane (sub-block) will always be configured as a mesh.

At LC, a job larger than 512-nodes is permitted to be configured as
either a torus or a mesh:

The performance difference between torus and mesh is usually minimal, and
some if not all of the dimensions will be configured as a torus anyway.

Fragmentation is an important concern on BG/Q systems. Users don't like to
see a job requesting N nodes not running when more than N nodes are idle.
Requiring torus in all dimensions increases fragmentation significantly, so
LC elects not to do it.

You can require a torus configuration yourself for jobs 512 nodes and larger by
using the srun --conn-type=T,T,T,T option. For example:

srun --conn-type=T,T,T,T -N 4K -n 65536 myjob

You can query the block configuration of BG/Q jobs, as shown by the examples
below.

The personality of a BG/Q node is the static data given to every compute node
and I/O node at boot time by the Control System.

Contains information that is specific to the node, with respect to the
block that is being booted. For example, the node coordinates on the torus
network.

Consists of a set of C language structures

Can be useful if the application programmer wants to determine,
at run time, where the tasks of the application are running.

It can also be used to tune certain aspects of the application at run time. For
example, it can be used to determine which set of tasks shares the same I/O node
and then optimize the network traffic from the compute nodes to that I/O node.

ESSL is recommended as the best tuned math library for BG/Q. It is built upon
IBM's optimized BLAS routines.

Often employ SIMD instructions and/or multi-threading - depends upon which
library you use and which routines (covered in the ESSL Guide and Reference
manual)

The ESSL subroutines follow standard Fortran calling conventions and must run in
the Fortran run-time environment. When ESSL subroutines are called from a
program in a language other than Fortran, such as C or C++, the Fortran
conventions must be used.

Location:

At LC, the ESSL libraries are installed in
/usr/local/tools/essl

Don't use the versions in /usr/lib as these are for the
login nodes.

Libraries:

libesslbg.a - single-threaded routines

libesslsmpbg.a - multi-threaded routines. Also include
the -qsmp compile flag.

Important: both libraries require linking with a thread-safe (_r) version of
the compiler - for example, mpixlf90_r

See the IBM ESSL Documentation for more information. Also
available in the installation directory under
/usr/local/tools/essl.

The compute node to I/O node ratio is 128:1. However, only a subset
of each 128 compute nodes is directly connected to I/O nodes. These are
called bridge compute nodes.

The ratio of bridge compute nodes to I/O nodes is 2:1, which makes the ratio
of compute nodes to bridge compute nodes 64:1

Some applications may benefit from explicitly specifying which MPI tasks
perform I/O. To do this, you must first determine the mappings between MPI
ranks, compute nodes, bridge compute nodes and I/O nodes.

Sample C code and output (sorted by MPI rank) is provided:

Sorting on the I/O node field shows which MPI ranks associate with specific
I/O nodes and compute bridge nodes. Example:

You can then create a BG/Q mapping file that aligns MPI ranks with I/O
nodes and/or bridge compute nodes. See the Partitions,
Mapping and Personality section for details on how to do this.

Additional usage information will be added here as it becomes available.

HPSS Archival Storage

HPSS Storage:

High Performance Storage System (HPSS) archival storage is available on
both the OCF and SCF.

Provides "virtually unlimited" tape archive storage in the petabyte
range. Both capacity and performance are continually increasing to
keep up with the ever increasing user demand.

Also able to be accessed from Tri-lab and other remote sites. Note
that for remote access to OCF storage, VPN is required.

Storing dual-copy files in HPSS archival storage:
For mission critical files, it is possible to store two copies at once
using FTP, HSI, HTAR or NFT. Technical Bulletin 435 discusses
how to accomplish this. See the "Technical Bulletin" section of
computing.llnl.gov.

Quotas:

As of November 2012, LC has implemented an HPSS quota mechanism to help
address sustainable HPSS growth on both the OCF and SCF.

Based on a user's annual (fiscal year) growth in HPSS file space:

OCF yearly growth quota: 82 TB

SCF yearly growth quota: 156 TB

For details, see Technical Bulletin 459 in the
"Technical Bulletin" section of
computing.llnl.gov.

Find the lines at the very end of a
BG/Q core file
between the +++STACK and ---STACK tokens (example below).
Extract only the addresses under the "Saved Link Reg" column into another
file, and replace the first eight zeroes with 0x as shown
using the example below.

where -c is the directory containing your core files and
-b is the name/path of your executable.
For non-GUI mode see: coreprocessor.pl -h

Usage is fairly simple and straight-forward

Select the preferred Group Mode, in this case, "Ungrouped
w/Traceback". Other options are
shown HERE.

Select an item/routine in the corefile list

Select a core file from the Common nodes pane

The file/line which failed is displayed (if compiled with -g)

The selected corefile then appears in the bottom pane

For jobs with many core files, the most practical Group Mode is "Stack Traceback
(condensed)". This mode groups all similar core files into a single stack
trace. The number of corefiles sharing the same stack trace is displayed next to
each routine. Example available HERE.

The Stack Trace Analysis Tool (STAT), discussed in more detail in the
STAT debugging section, can also be used to debug BG/Q
lightweight core files.

Usage:

Use the core_stack_merge command to merge the
lightweight core files produced by a crashed application into STAT
.dot format files. For example:

core_stack_merge -x myapplication -c core.*

Two output files will be produced, named
myapplication.dot and
myapplication_line.dot.

Then use the stat-view command on your
_line.dot file to view the call graph prefix tree.
For example:

stat-view myapplication_line.dot

The application's call graph tree represents the global state of the crashed
program. A simple example is provided here:

Note: you can also use your .dot file with the
stat-view command, except it is missing the line
number information.

If your job is hung, and it doesn't use a built-in signal handler to catch
SIGSEGV signals, you can force it to terminate and dump core files by using
the kill_job command to send a SIGSEGV signal
to it. For example:

/bgsys/drivers/ppcfloor/hlcs/bin/kill_job --id bg_jobid -s SIGSEGV

where bg_jobid is the BG/Q jobid - not the SLURM jobid.

How to determine the BG/Q jobid:

Include --runjob-opts="--verbose INFO" as an option
to your srun command when you start the job.

Otherwise, you will need to contact the LC Hotline and request that a BG/Q
system admin use a DB2 query to get the BG/Q jobid.

Login to a front-end login node and make sure that your Xwindows
environment is setup properly. You can verify this by launching a
simple X application like xclock or
xterm

Issue the mxterm with your specific parameters.
Note that the #tasks argument is ignored, but you still need to enter a
dummy value. If you're not sure of the syntax, just enter the
mxterm command without arguments and hit return.
A usage summary will display - available HERE.

mxterm will then automatically generate and submit a batch script
for you (in the background).

You will be provided with the usual job id# which you can then use to
monitor your job in the queue.

Once TotalView's opening windows appear, you can then begin
your debug session.

Attaching to an already running/hung batch job:

You must first find where your job's srun process
is running. It will be on one of the front-end nodes - but most likely
NOT the front-end node you are logged into.
Two easy ways to do this are shown below:

STAT

Primarily intended to attach to a hung job, and quickly identify where the
job is hung.

The output from STAT consists of 2D spatial and 3D spatial-temporal graphs.
These graphs encode calling behavior of the application processes in the form
of a prefix tree. Example of a STAT 2D spatial graph shown on right.

Graph nodes are labeled by function names. The directed edges show the calling
sequence from caller to callee, and are labeled by the set of tasks that follow
that call path. Nodes that are visited by the same set of tasks are assigned the
same color.

STAT is also capable of gathering stack traces with more fine-grained
information, such as the program counter or the source file and line number of
each frame.

A GUI is provided for viewing and analyzing the STAT output graphs

Location:

/usr/local/bin/stat-gui - GUI

/usr/local/bin/stat-cl - command line

/usr/local/bin/stat-view - viewer for DOT format
output files

/usr/local/tools/stat - install directory,
documentation

Using the STAT GUI for parallel jobs:

Assuming that you have a running job that is hung, and that you are logged
into a BG/Q front-end "lac" node, use the stat-gui
command to start the STAT GUI.

After it appears, it will display your srun processes on the node
you are logged into. By default, it selects the parent srun
process. Click the "Attach" button if this is correct.

If you don't see any srun processes, that means they are running
on the "other" lac login node. Just type the other login node's name in the
STATGUI's "Search Remote Host" box, as shown in the above example.

After a few moments, a graph depicting the state of your job will appear,
allowing you to determine where your job is hung. Example:

Additional functionality for STAT can be found by consulting the
"More information" links below.

What's Available?

The following performance analysis tools are available on LC's BG/Q platforms. These tools cover the full range of performance tuning: tracing, profiling, MPI, threads, and hardware event counters. Each is discussed in more detail in following sections.

An open source performance analysis tool framework that includes the most common performance analysis steps in one integrated tool. Comprehensive performance analysis for sequential, multithreaded, and MPI applications.

Valgrind is a suite of simulation-based debugging and profiling tools.
The Memcheck tool detects a comprehensive set of memory errors, including
reads and writes of unallocated or freed memory and memory leaks.

The percentage of CPU time taken by that procedure and all procedures
it calls (the calling tree).

A breakdown of time used by the procedure and its descendents.

The number of times the procedure was called.

The direct descendents of each procedure.

Example:

Location: /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gprof. The one in /usr/bin is for the
front-end nodes.

Using gprof

Compile your program with the -pg option. If your
compilation includes the -c option (to produce a
*.o file), then you will
need to include the -pg during the link/load also.

Run the program. When it completes you should have a file called
gmon.out which contains runtime statistics. If you
are running a parallel program you will have multiple files
differentiated by the process id which created them, such as
gmon.out.0 gmon.out.1 gmon.out.2, etc.

For serial users, view the profile statistics with gprof by
typing gprof at the shell prompt in the
same directory that you ran the program. By default, gprof will
look for a file called gmon.out and display the
statistics contained in it.

For parallel users, view the profile statistics with gprof by
typing gprof followed by the name of your
executable and the gmon.out.X files you wish to
view. You may view any single file or any combination.

HPCToolkit

Overview:

HPCToolkit is an integrated suite of tools for measurement and analysis of
program performance on computers ranging from multicore desktop systems to the
largest supercomputers.

Uses low overhead statistical sampling of timers and hardware performance
counters to collect accurate measurements of a program's work, resource
consumption, and inefficiency and attributes them to the full calling context in
which they occur.

Works with C/C++ and Fortran, applications that are either statically or
dynamically linked.

hpcprof: overlays call path profiles and traces with program structure computed by hpcstruct and correlates the result with source code. hpcprof/mpi handles thousands of profiles from a parallel execution by performing this correlation in parallel. hpcprof and hpcprof/mpi generate a performance database that can be explored using the hpcviewer and hpctraceviewer user interfaces.

hpcviewer: a graphical user interface that interactively presents performance data in three complementary code-centric views (top-down, bottom-up, and flat), as well as a graphical view that enables one to assess performance variability across threads and processes. hpcviewer is designed to facilitate rapid top-down analysis using derived metrics that highlight scalability losses and inefficiency rather than focusing exclusively on program hot spots.

hpctraceviewer: a graphical user interface that presents a hierarchical, time-centric view of a program execution. The tool can rapidly render graphical views of trace lines for thousands of processors for an execution tens of minutes long even a laptop. hpctraceviewer's hierarchical graphical presentation is quite different than that of other tools - it renders execution traces at multiple levels of abstraction by showing activity over time at different call stack depths.

Location: /usr/global/tools/hpctoolkit/bgqos_0

Using HPCToolkit:

Due to its multi-component and sophisticated nature, usage instructions
for HPCToolkit are beyond the scope of this document. A few hints are
provided below.

Be sure to use the LC dotkit package for HPCTookit. The command
use -l will list all available packages.
Find the one of interest and then load it - for example:
use hpctoolkit

Instrumented/selective profiling and source line "ticks" profiling: See
the mpitrace documentation for instructions.

Running:

A number of environment variables control how profiling and tracing is
performed. Some of these are described in the table below. Please
consult the mpitrace documentation for additional details not covered
here.

Environment Variable

Description

Default

PROFILE_BY_CALL_SITE

Set to yes to obtain the call site for every MPI function call.
Requires compiling with the -g flag

no

TRACE_SEND_PATTERN

Set to yes to collects information about the number of hops for
point-to-point communication on the torus network.

no

SAVE_ALL_TASKS

Set to yes to produce an output file for every MPI rank. By
default, output files are only produced for MPI rank 0, and the ranks having
the minimum, median, and maximum times in MPI.

no

SAVE_LIST

Specify a list of MPI ranks that will produce an output file. By
default, output files are only produced for MPI rank 0, and the ranks having
the minimum, median, and maximum times in MPI. Example:
setenv SAVE_ALL_TASKS 0,32,64,128,256,512

unset

TRACEBACK_LEVEL

In cases where there are deeply nested layers on top of MPI, you
may want to profile higher up the call chain. This can be done by
setting this environment variable to an integer value above zero
indicating how many levels above the MPI calls profiling should take
place.

0

TRACE_DIR

Specify the directory where output files should be written.

Working directory

HPM_GROUP

Set to an integer value indicating which predefined hardware counter group
to use. Hardware counter groups are listed in the file:
/usr/local/tools/mpitrace/CounterGroups

0

HPM_PROFILE

Set to "yes" to turn on HPM profiling. Executable needs to have been linked with an HPM library.

unset

HPM_SCOPE

Set to process or thread to aggregate
hardware counter statistics at the process or thread level. See documentation
for explanation.

node

TRACE_ALL_TASKS

For jobs that have more than 256 tasks, setting this to yes
will cause all tasks to be traced. Can cause problems for large, long running
jobs (too much data).

no

TRACE_ALL_EVENTS

Set to yes to trace all MPI events. This is used if you
don't explicitly instrument your source code with trace start/stop routine
calls yourself.

no

TRACE_MAX_RANK

Specifies the maximum task rank that should be profiled. Can
be used to override the default of 255 (256 tasks).

255

SWAP_BYTES

The event trace file is binary, and therefore, it is sensitive to byte
order. The trace files are written in little endian format by default.
Setting this environment variable to "yes" will produce a big endian
binary trace output file.

no

Output:

MPI profiling: The default is to produce plain text files of MPI data for
MPI rank 0, and the ranks that had the minimum, median, and maximum times in
MPI. Files are named mpi_profile.#.rank where # is a unique
number for each job. The file for MPI rank 0 also contains a summary of
data from all other MPI ranks.

HPM profiling: similar to MPI profiling, except the files are named
hpm_process_summary.#.rank

MPI tracing: A single binary trace data file called
events.trc is produced. Intended to be
viewed with the traceview GUI utility.

The number of profiling output files produced, and the data they contain,
can be modified by setting the environment variables in the above table.

Examples:

Tracing caveats:

Tracing large, long running executables can generate a huge output
file, even to the point of being useless.

memP

Primary feature is to identify the heap allocation that causes an MPI
task to reach its memory in use high water mark (HWM).

Two types of memP reports:

Summary Report: Generated from within MPI_Finalize, this report
describes the memory HWM of each task over the run of the
application. This can be used to determine which task allocates the
most memory and how this compares to the memory of other tasks.

Task Report: Based on specific criteria, a report can be generated
for each task, that provides a snapshot of the heap memory currently in use,
including the amount allocated at specific call sites.

Location: /usr/local/tools/memP

Using memP:

Load the memP dotkit package with the command
use memp

Compile with the recommended BG/Q flags and link your application with
the required libraries:

After compiling, run your application as usual. You can verify
that mpiP is working by the header and trailer output it sends to
stdout, and the creation of a single output file (see "Output"
below).

Output:

After your application completes, mpiP will write its output file
to the current directory. The output file name will have the
format of myprog.N.XXXXX.mpiP
where N=#MPI tasks and XXXXX=collector task process id.

Originally, the API focused on CPU events, but the more recent PAPI-C
(PAPI Component) API includes other machine components such as network
interface cards, power monitors and I/O units.

Both a C and Fortran calling interface

On BG/Q, PAPI interfaces to a subset of IBM's BGPM (Blue Gene
Performance Monitoring) API. BGPM includes over 400 events grouped into 5
categories, which map to the hardware unit where they are counted:

Processor Unit

L2 Unit

I/O Unit

Network Unit

CNK (compute node kernel) Unit

Location: /usr/local/tools/papi

Using PAPI:

Using PAPI in an application typically requires a few simple steps:
include the event definitions, initialize the PAPI library, set up event
counters, and link with the PAPI library.

TAU

Overview:

The TAU (Tuning and Analysis Utilities) Performance System is an integrated,
portable, profiling and tracing toolkit for performance analysis of parallel
programs written in Fortran, C, C++, Java, Python.
From Performance Research Lab, University of Oregon.

Profiling: shows how much time was spent in each routine

Tracing: when and where events take place

TAU instrumentation is used to accomplish both profiling and tracing. Three
different methods:

Decide what you want to instrument by selecting the
appropriate TAU stub makefile. These are named according to
to the metrics they record, and they will be located in the
bgq/lib subdirectory of your TAU installation.
For example, in:

VampirTrace / Vampir

Overview:

VampirTrace is an open source, performance analysis tool set and
library used to instrument, trace and profile parallel applications.
Developed at TU-Dresden, in collaboration with the KOJAK project at JSC/FZ
Julich.

Supports applications using:

MPI

OpenMP

Pthreads

GPU accelerators

Trace events can include:

Application's routine/function calls

MPI calls

User defined events

PAPI performance counters

I/O

Memory allocations

Instrumentation options include:

Fully automatic - performed via compiler wrappers

Manual using the VampirTrace API

Fully automatic using the TAU instrumentor

Runtime binary instrumentation using Dyninst

Vampir is a proprietary trace visualizer developed by the Center for
Information Services and High Performance Computing (ZIH) at TU Dresden.
It is used to graphically display the Open Trace Format (OTF) output
produced by VampirTrace.

Locations:

VampirTrace: /usr/global/tools/vampirtrace/bgqos_0

VampirTrace: /usr/global/tools/vampirtrace/bgqos_0

Quickstart for basic usage at LC:

Load the VampirTrace environment:
use vampirtrace-bgq

Compile / link your code using one of the VampirTrace compiler
wrappers: vtCC, vtc++, vtcc, vtcxx, vtf77 or vtf90. Need to tell
the wrapper which native compiler you prefer. For example:

vtcc -vt:cc mpixlc -o hello mpi_hello.c

Set desired environment variables - there are many choices. For
example, to do both profiling and tracing, and to prefix the output
files with the name of the code:

setenv VT_MODE STAT:TRACE
setenv VT_FILE_PREFIX hello

Run the executable

View the output using the Vampir GUI:

use vampir
vampir myfile.otf

NOTE: As of December, 2012, Vampir is only installed on the following LC
systems: cab, edge, hera, sierra, rzmerl, rzzeus.

Output:

Profile data is written to a plain text file named
a.prof.txt by default. Use the
VT_FILE_PREFIX environment variable to name it something
different. Example:

Valgrind

NOTE: Valgrind is not currently available on the LC BG/Q systems. Usage
information will be added here when/if it becomes available.

The Valgrind tool suite provides a number of debugging and profiling tools that
help you make your programs faster and more correct.

The Valgrind distribution currently includes the following tools:

Memcheck: is a memory error detector. It helps you make your programs, particularly those written in C and C++, more correct.

Cachegrind: is a cache and branch-prediction profiler. It helps you make your programs run faster.

Callgrind: is a call-graph generating cache profiler. It has some overlap with Cachegrind, but also gathers some information that Cachegrind does not.

Helgrind: is a thread error detector. It helps you make your multi-threaded programs more correct.

DRD: is also a thread error detector. It is similar to Helgrind but uses different analysis techniques and so may find different problems.

Massif: is a heap profiler. It helps you make your programs use less memory.

DHAT: is a different kind of heap profiler. It helps you understand issues of block lifetimes, block utilisation, and layout inefficiencies.

SGcheck: is an experimental tool that can detect overruns of stack and global arrays. Its functionality is complementary to that of Memcheck: SGcheck finds problems that Memcheck can't, and vice versa..

BBV: is an experimental SimPoint basic block vector generator. It is useful to people doing computer architecture research and development.

Valgrind is also an instrumentation framework for building dynamic analysis
tools - you can also it to build your own tools.

PDF: /usr/local/docs/rzuseq.basics.pdf.
PDF files may be viewed using evince.

html: /usr/local/docs/BGQ. Note that in addition to
firefox, the text-based
elinks browser is also available. Note: If X-Windows
applications such as firefox or [x]emacs crash, you need to update the
X-server on your desktop.

On-site users can visit the LC help desk consultants in Building 453,
Room 1103. Note that this is a Q-clearance area.
Need a map?

Phone:

(925) 422-4531 - Main number

422-4532 - Direct phone line for technical consulting help

422-4533 - Direct phone line for support help (accounts,
passwords, forms, etc)

Email

Technical Help:
OCF: lc-hotline@llnl.gov
SCF: lc-hotline@pop.llnl.gov

Support:
OCF: lc-support@llnl.gov
SCF: lc-support@pop.llnl.gov

Help - BG/Q Specific:

Sequoia Users Meeting: third Thursday each month. Held in B451 White Room from
3:00-4:00pm. Web conference and phone dial-in numbers available - contact the LC
Hotline.

"BG/Q Virtual Water Cooler" telecon every Thursday (except 3rd) from 3:00-
4:00pm. Intended to be an open user forum discussion regarding the
Seq/Vulcan/rzuseq systems. Available for consulting with domain experts on
topics such as porting codes, system status, jobs scheduling, file systems,
etc.

Photos/Graphics: Permission to use IBM photos/graphics
has been obtained by the author and is on file.
Other photos/graphics have been created by the author,
created by other LLNL employees, obtained from non-copyrighted sources,
or used with the permission of authors from other presentations
and web pages.

This completes the tutorial.

Please complete the online evaluation form - unless you are doing the exercise,
in which case please complete it at the end of the exercise.