Archive

For a lot of workloads, simply using collective I/O provides a big performance boost. Sometimes, though, it’s necessary to tune collective I/O a bit. The hint “cb_nodes” provides a way to select how many MPI processes will become aggregators. On Blue Gene, though, the story is a little more complicated.

We’ll start with Blue Gene /L and /P, even though those machines are now obsolete. The concepts on the older machines still apply, if in a slightly different form. The 163840 cores on the Intrepid BlueGene/P system are configured in a hierarchy. To improve the scalability of the BlueGene architecture, dedicated “I/O nodes” (ION) act as system call proxies between the compute nodes and the storage nodes. On Intrepid, we call the collection of an ION and its compute nodes a “pset”. Each Intrepid pset contains one ION and 64 4-core compute nodes.

The MPI standard defines ‘collective’ routines. Unlike the ‘independent’ routines, all processes in a given MPI communicator call the routine together. The MPI implementation, with the knowledge of which tasks participate in a call, can then perform significant optimizations. These collective routines provide tremendous performance benefits for both networking and I/O.

The BlueGene MPI-IO library, based on ROMIO, makes some adjustments to the ROMIO collective buffering optimization. First, data accesses are aligned to file system block boundaries. Such an alignment reduces lock contention in the write case and can yield big performance improvements.

Second, and perhaps most importantly from a scalability perspective, the “I/O aggregators” selected for the I/O phase of two-phase are a small subset of the total number of processors. On BlueGene, the MPI-IO hint “bgl_nodes_pset” defines a ratio. For each pset allocated to a process, that many nodes will be designated as aggregators. The default ratio for a job running in “virtual node” is one aggregator for every 32 MPI processes. Furthermore, these aggregators are distributed over the topology of the application so that no node has more than one aggregator and no pset contains more than “bgl_nodes_pset” aggregators.

On Mira (Blue Gene /Q) the story is a bit more complicated. I/O nodes no longer are statically assigned to compute nodes. Rather, there is a pool of I/O nodes. When a job is launched, some portion of those I/O nodes gets assigned to the compute nodes.

On Mira, a set of 128 compute nodes (known as a pset) has one I/O node acting as an I/O proxy. For every I/O node there are two network links of 2 GB/s toward two distinct compute nodes acting as bridge. Therefore, for every 128-node partition, there are nb = 1 × 2 = 2 bridges. The I/O traffic from compute nodes passes through these bridge nodes on the way to the I/O node. The I/O nodes are connected to the storage servers through Quad-data-rate (QDR) InfiniBand links. On BG/Q the programmer can set the number of aggregators per pset na_pset (the hint on BG/Q has been renamed to “bg_nodes_pset”). One can determine the total number of aggregators of an application na knowing na_pset , n, and nb with the following equation:

Computing the number of aggregators on Blue Gene is… not straightforward

The number of bridge nodes is hardware dependent. For the Argonne machines, Mira’s nb is always 1, but on Vesta, it’s 4 and on Cetus it is 8.

Sophisticated applications wishing to do their own I/O subsetting should be aware of these default parameters and optimizations. In some cases, applications will try to subset to a small number of node and find greatly reduced I/O performance.

The MPI standard defines non-blocking communication. It also defines non-blocking (independent) I/O. When it comes to collective I/O, the choices are blocking I/O or the little-developed and little-used “split collectives”.

The HDF Group pushed to add true non-blocking collective I/O to the MPI standard. MPI-3.1 finally incorporates this feature. The use cases are motivated by things the HDF5 library would like to do in a portable manner at scale:

Modifying metadata of a dataset: Each process has a cache of metadata, so updates done collectively (thus ensuring everyone’s cache is consistent between memory and file). When evicting members from this cache, HDF5 could issue a non-blocking collective I/O request for these typically tiny elements, then go do other work.

Backgrounding Data operations: HDF5 knows a bit about the structure of data on disk due to its file format. It also knows a bit about the data a user will want to operate on. A sufficiently clever HDF5 library could issue non-blocking collective I/O to either read-ahead in anticipation of what a user will need, or to maintain a write-back cache.

Sangmin Seo implemented the non-blocking collective I/O routines for ROMIO. Implementers might find it interesting that he used the extended generalized requests we added to MPICH way back in 2007.

This feature is available in mpich-master and in the last few pre-releases. If you try out non-blocking collective I/O, let us know how it worked.