OpenMP support

For OpenMP applications use the thread-safe version of the compilers by adding an "_r" to the name, e.g. bgxlC_r or mpixlf77_r, and add -qsmp=omp -qnosave.

Support for Shared libraries and dynamic executables

Blue Gene/P offers the possibility to create shared libraries and dynamic executables, but in general it is not recommended to use shared libraries on JUGENE because loading the shared libraries can delay the startup of the dynamically linked application considerably, especially when using large partitions (8 racks or more). Therefore, please use shared libraries *only* if there is no other possiblity.
See here for further information how to generate shared libraries and create dynamic executables in C and Fortran.

Compiler Options (IBM XL compiler)

Default option (Warning)

Default for compilation is-qarch=450di.e. the DOUBLE HUMMER (450d) is used for the calculation if nothing else is specified.
This may be suboptimal. Experiences with various applications have shown that the DOUBLE HUMMER needs some preconditions to work optimal. So the use of 450d not always leads to an optimum but sometimes even can extend the calculation time.
It is therefore recommended to test all application with both options (-qarch=450 and -qarch=450d) and compare the results and the calculation times. The option -qtune should always be specified as: -qtune=450.

Compiler Options for Optimization

In the following we provide hints and recommendations how to tune applications on JUGENE by choosing optimal compiler flags and customizing the runtime environment. In order to detect performance bottlenecks connected to algorithmic or communication problems we recommend to analyze your application in detail with performance analysis tools like Scalasca on JUGENE.

We recommend to start with -O2 -qarch=450 -qtune=450and increase the level of optimization stepwise according to the following table. Always check that the numerical results are still correct when using more aggressive optimization flags. For OpenMP codes, please add -qsmp=omp -qthreaded.

Optimization level

Description

-O2 -qarch=450 -qtune=450

Basic optimization

-O3 -qstrict -qarch=450 -qtune=450

More aggressive optimizations are performed, no impact on accuracy

-O3 -qhot -qarch=450 -qtune=450

Aggressive optimization, that may impact the accuracy (high order transformations of loops)

Once you have determined the optimal flags you can switch to -qarch=450d which activates the "Double-Hummer" and see whether this improves the performance further.

Additional compiler options

Inlining of functions: In order to avoid a performance overhead caused by frequently calling functions the XL compilers offer the possibility to use inline functions. With the option -qipa=inline which is automatically set at optimization level-O4or higher the compiler will choose appropriate functions to inline. Furthermore, the user can specify explicitly functions for inlining in the following way:

-qipa=inline=func1,func2 or -Q+func1:func2

Both specifications are equivalent, the compiler will attempt to inline the functions func1 and func2.

the compiler attempts to replace some intrinsic FORTRAN 90 procedures by essl routines where it is safe to do so.

Diagnostics and reporting: In order to verify and/or understand what kind of optimization is actually performed by the compiler you can use the following flags:

-qreport

The compiler will generate for each source file <name> a file <name>.lst with pseudo code and a description of the kind of code optimizations which were performed.

-qxflag=diagnostic

This flag causes the compiler to print information about code optimization during compile time to stdout.

Choosing connection, shape and mapping

The choice of network (connection), extension of a partition in certain dimensions (shape) and order in which MPI tasks are distributed across the nodes and cores of a particular partition (mapping) can have a big influence on the performance of applications.

Connection

The Blue Gene/P offers two networks which can be used by applications for the communication on and between compute nodes. The TORUS network interconnects all compute nodes and has a topology of a three-dimensional torus, i.e. each node has six nearest neighbors. The bandwidth is 425 MB/s in each direction resulting in a total bidirectional bandwidth of 5.1 GB/s per node. The MESH network is a global collective tree network with 850 MB/s of bandwidth per link. It interconnects all compute and I/O nodes.

The default network is the MESH network. You can change it using the LoadLeveler keyword #@bg_connection specifying either MESH, PREFER_TORUS or TORUS:

Example 1 shows part of a job script for using the torus network. The choice can have a big influence on the performance of your application. In case of doubt choose TORUS. For partitions with less than 512 nodes only MESH can be used.

Shape

The extension of a partition in X,Y and Z direction is called the shape X×Y×Z of a partition. The shape can be specified in units of midplanes using the LoadLeveler keyword #@bg_shape, where 1 midplane contains 512 nodes (=2048 cores).

This job script reserves a partition with 1 midplane in X and Y, and 2 midplanes in Z direction using in total 1024 nodes (4096 cores). The keyword #@bg_rotate tells LoadLeveler whether the job can run on any partition with the correct size (TRUE, this is the default) or if you want to have exactly the specified shape (FALSE) . In the first case the next free partition with 1024 nodes will be used, regardless of its shape (e.g. a partition of shape 2×1×1 could be used). The optimal shape for an application depends on the communication pattern of the code. For example, if the application uses a communicator of dimensions 8×8×32 a shape like 1×1×2 might show a better performance than for example 2×1×1. For further information, please see also "Mapping" below.

If you use partition sizes of 1 midplane or less on JUGENE the partitions will have the following dimensions X,Y, and Z in units of nodes (32 nodes is the smallest size you can reserve on JUGENE).

Number of nodes

X

Y

Z

32

4

4

2

64

4

4

4

128

4

4

8

256

8

4

8

512

8

8

8

Mapping

General

The default mapping on JUGENE is in XYZT order. Here, X,Y and Z are the torus coordinates of the compute nodes within a partition and T is the core coordinate within a node (T=0,1,2,3). Therefore, each core is well-defined by these four coordinates. When mpirun launches a parallel application it will distribute the MPI tasks in such a way that the first coordinate of the mapping is increased first. Therefore, with XYZT mapping the first MPI task is executed on the core with the coordinates <0,0,0,0>, the second on the core <1,0,0,0>, the third on the core <2,0,0,0> and so on. Since in general adjacent tasks are not executed on adjacent cores, this might not be the optimal mapping for all applications. You can change the mapping using the-mapfile option of the mpirun command.

mpirun -mapfile mapping -exe <myproc>

"mapping" can either be any permutation of X,Y,Z and T or a file which contains explicit instructions on how to map the MPI tasks to the cores. We recommend to test at least the option -mapfile TXYZ.

Explicit mapfiles

An explicit mapping of tasks to cores can be specified using a mapfile. Each line of such a file contains four integers which represent coordinates of the four-dimensional torus in the order X, Y, Z and T. The tasks are mapped to the specified coordinates in ascending order (see example below).

The optimal mapping depends on the communicator used by the application and the shape of the partition used for the run. Therefore, it is not possible to choose an optimal mapping without detailed knowledge about the application. In order to describe the strategy to follow we discuss an example case.

Suppose we have an application which uses a three-dimensional communicator of dimensions 16x16x8 which should run on one midplane in VN mode (i.e. a four-dimensional torus of shape X×Y×Z×T=8×8×8×4). The goal is to factorize the shape dimensions by choosing a corresponding mapping in such a way that it fits to the communicator of the application. In this case (X×T/2)×(Y×T/2)×Z=16×16×8 would be optimal. The following mapfile (only shown in parts) realizes the described mapping (everything after "#" is only a comment):

When using explicit mapfile you must use #@bg_rotate=FALSE in your job script.

Environment variables

Here we list some environment variables which can be used in order to tune the performance of applications. A more comprehensive list of available Blue Gene/P MPI environment variables can be found in the Blue Gene/P Application Development Redbook.

BGLMPIO_COMM

This environment variable defines how data is exchanged on collective reads and writes. Possible values are 0 (use MPI_Alltoallv) and 1 (use MPI_Isend/MPI_Irecv). The default is 0. When using a large number of tasks with a memory-demanding application the performance of the MPI_Alltoallv might scale worse than the point-to-point version for the I/O or the application might even crash with a "signal 6". Setting this variable to 1 might help in this case and improve the performance (see also BGLMPIO_TUNEBLOCKING=0).

BGLMPIO_TUNEBLOCKING

The variable can be used to tune how aggregate file domains are calculated. Possible values are 0 (Evenly calculate file domains across aggregators, also use MPI_Isend/MPI_Irecv to exchange domain information) and 1 (Align file domains with the underlying file system's block size, also use MPI_Alltoallv to exchange domain information). The default is 1. This variable can be used together with the BGLMPIO_COMM variable (see above), for example BGLMPIO_COMM=1 and BGLMPIO_TUNEBLOCKING=0 to avoid MPI_Alltoallv.

DCMF_ALLTOALL_PREMALLOC

The ALLTOALL MPI protocols (ALLTOALL, ALLTOALLV AND ALLTOALLW) require 6 arrays each of size COMM_SIZE to be setup before communication begins. If your application does not use ALLTOALL or needs as much memory as possible you can turn off pre-allocating these arrays by setting the variable to “N”. The default setting is “Y”, i.e. allowing the pre-allocation.

DCMF_DMA_VERBOSE

Use this variable to control the output of information associated with the Direct Memory Access (DMA) messaging device. Possible values are 0 (no DMA information output) and 1 (DMA information output). The default is 0. If the job encounters events that do not cause the application to fail but might be useful for application developers (for example RAS Events) this information will be displayed by setting DCMF_DMA_VERBOSE=1. When using the option -verbose 2 with the mpirun command this variable is automatically set to 1.

DCMF_EAGER=message size (bytes)

This environment variable sets the message size (in bytes) above which the MPI rendezvous protocol is used (for further details about the MPI protocols available, please see the Blue Gene/P Application Development Redbook). The default size is 1200 bytes. The MPI rendezvous protocol is optimized for maximum bandwidth. However, there is an initial handshake between the communication partners, which increases the latency. In case your application uses many short messages you might want to decrease the message size (even down to 0). On the other hand if your application can be mapped well to the TORUS network (see "Choosing shape and mapping of partitions") and uses mainly large messages increasing the limit might lead to a better performance (e.g. DCMF_EAGER=200000).

DCMF_INTERRUPT

If set to 1 the interrupt driven communication is turned on. This can be beneficial to some applications in order to overlap communication with computation and is required if you are using Global Arrays or ARMCI. The default setting is 0.

DCMF_RECFIFO=buffer size (bytes)

Packets that arrive off the network are placed into a reception buffer. The environment variable DCMF_RECFIFO can be used to set the size of this buffer. The default size is 8 MB per process. If a process is busy and does not call often enough MPI_Wait this buffer can become full and the movement of further packets is stopped until the buffer is free again. This can slow down the application. In this case the buffer should be increased (see also Ras Events for further information).

DCMF_RGETFIFO=buffer size (bytes)

When a remote get packet arrives off the network,the packet contains a descriptor describing data that is to be retrieved and sent back to the node that originated the remote get packet. The DMA injects that descriptor into a remote get injection buffer. The DMA then processes that injected descriptor by sending the data back to the originating node. Remote gets are commonly done during point-to-point communications when large data buffers are involved (typically larger than 1200 bytes). DCMF_RGETFIFO can be used to set the size of the buffer. The default size is 32 KB. When a large number of remote get packets are received by a node, the remote get injection buffer may become full of descriptors. The size of the buffer can be adjusted using DCMF_RGETFIFO (see also RAS Events for further information).

DCMF_SSM

Setting this variable to 1 turns on sender-side matching and can speed up point-to-point messaging in well-behaved applications. The default setting is 0.

Hints for MPI usage on JUGENE

Asynchronous (non-blocking) MPI I/O

Non-blocking MPI I/O is not supported on the Blue Gene/P architecture because the operating system does not support asynchronous file I/O. The usage of the corresponding MPI routines (e.g. MPI_File_iwrite) will not cause an error. Instead blocking file I/O will be performed.

Data representation for MPI I/O

The MPI implementation on the Blue Gene/P is based on MPICH2, which supports only the "native" data representation for MPI_File_set_view. The usage of "internal" and "external32" will result in the error MPI_ERR_UNSUPPORTED_DATAREP.

MPI extensions for Blue Gene/P

IBM provides extensions to MPICH2 in order to ease the use of the BG/P hardware. For further information and available extensions, please see here.

Number of communicators

The maximum number of communicators that can be generated is 8189. This is limited by the MPI implementation on the Blue Gene/P.

Overlapping Communication and Computation

In order to enable the capability to overlap communication and computation, the environment variable

DCMF_INTERRUPT has to be set to 1.

DCMF_INTERRUPT or DCMF_INTERRUPTS turns on interrupt driven communications. This can be beneficial to some applications. Possible values:

0 - Interrupt driven communications is not used.

1 - Interrupt driven communications is used.

Default is 0.

Unfortunately, the only way to determine if a given application would benefit from interrupt mode is to run the application twice and measure the performance.

Further Information

Level of verbosity using mpirun

We recommend to use the-verbose 2 option of the mpirun command even if your job finishes with exit code 0. Events might occur during the run of your application which are not critical (i.e. the application does not crash) but which can provide hints how to set certain parameters in order to obtain a better performance (for example RAS Events). Such events are reported only at verbosity level 2 or higher.