Using Derived Data Types with MPI

Most programs written for distributed memory, parallel computers, including Beowulf clusters, utilize the Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) programming interfaces to exchange data or messages among processes. In the past, this column has presented many of the fundamentals of message passing and has shown a number of programming examples using both MPI and PVM. Last month's column focused on the master/slave model of parallelism using MPI, and introduced the MPI_Probe() routine. This month, let's discuss another advanced feature of MPI: how to use derived data types.

Most programs written for distributed memory, parallel computers, including Beowulf clusters, utilize the Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) programming interfaces to exchange data or messages among processes. In the past, this column has presented many of the fundamentals of message passing and has shown a number of programming examples using both MPI and PVM. Last month’s column focused on the master/slave model of parallelism using MPI, and introduced the MPI_Probe() routine. This month, let’s discuss another advanced feature of MPI: how to use derived data types.

Data Is More than Elementary

Up to now, all of the parallel code examples you’ve seen in this column have exchanged (as messages) contiguous data from simple homogeneous data types such as integers or doubles. But the elementary data types built into MPI, listed in Table One, inadequately represent the diverse kinds of data structures used in most parallel codes. While it’s true that data can be copied into (and back out of) contiguous and homogeneous buffers of these fundamental data types for message passing, additional coding effort is required, additional memory is needed on both the sending and receiving nodes, and additional CPU time is consumed by the sending and receiving processes.

Table One: Elementary MPI data types for C and FORTRAN

C

FORTRAN

MPI_CHAR

MPI_INTEGER

MPI_SHORT

MPI_REAL

MPI_INT

MPI_DOUBLE_PRECISION

MPI_LONG

MPI_COMPLEX

MPI_UNSIGNED_CHAR

MPI_DOUBLE_COMPLEX

MPI_UNSIGNED_SHORT

MPI_LOGICAL

MPI_UNSIGNED

MPI_CHARACTER

MPI_UNSIGNED_LONG

MPI_BYTE

MPI_FLOAT

MPI_PACKED

MPI_DOUBLE

MPI_LONG_DOUBLE

MPI_BYTE

MPI_PACKED

Fortunately, the designers of MPI recognized the need to exchange non-contiguous “slabs” of data as well as structures of mixed types, known as derived data types. Using the elementary data types as building blocks, programmers can define their own derived types using a series of MPI functions.

Contiguous Types

The simplest type definition function, MPI_Type_contiguous() constructs a new data type defined as some specified number of contiguous elements of an existing data type. An example is shown in Listing One. In the listing, a number of parameters (12 in this case) are read or set by the first process and subsequently broadcast to all other MPI processes.

As is required for any MPI program written in C, the MPI header file, mpi.h, must be included; MPI must be initialized using MPI_Init() prior to calling any other MPI routines; and MPI must be terminated using MPI_Finalize() before the program exits. The rank of each process is obtained using MPI_Comm_rank(), and the number of processes participating in program execution is found using MPI_Comm_size(). The MPI_COMM_WORLD communicator is pre-defined: it’s the handle representing all processes that were started when the parallel program was executed.

Next, the first process (having a rank of zero) calls get_model_params() to obtain parameters for the model. Here, the parameters are set to obvious values to verify successful exchange.

The call to MPI_Type_contiguous() establishes a new, derived data type called ParameterType. This type consists of twelve elements of type MPI_DOUBLE. (Notice that ParameterType was declared at the top of main() as an MPI_Datatype type.) After the new type has been defined, it must be committed before it can be used. This is accomplished by calling MPI_Type_ commit() with a pointer to the new type. Then MPI_Bcast() is called using the new derived data type to send a copy of all the parameters to every MPI process. Once the derived data type is no longer needed (as shown in the example), it may be destroyed by calling MPI_Type_free() with a pointer to the derived type.

The remainder of the code causes each process to print out the received parameter values, calls MPI_ Finalize(), and finally exits.

This simple use of a derived type demonstrates the usefulness of MPI_Type_contiguous(). However, the parameter broadcast in this example could have been accomplished by simply broadcasting twelve MPI_DOUBLEs, replacing the calls to MPI_Type_contiguous(), MPI_Type_ commit(), MPI_Bcast(), and MPI_Type_free() with the following single call:

MPI_Bcast(params, NUM_PARAMS, MPI_DOUBLE, 0, MPI_COMM_WORLD);

Using MPI_Bcast(), the declaration of ParameterType isn’t needed. Nevertheless, MPI_Type_contiguous() is useful for defining concatenated types from either existing fundamental data types or programmer-defined derived types.

Non-Contiguous Regular Types

A more general type constructor, MPI_Type_vector(), can be used to reference equally-spaced, fixed sized blocks of data. Each block (or “slab” of data) is simply a concatenation of the old data type (for example, a block of MPI_DOUBLEs) and the spacing between blocks is a multiple of the extent of the old data type (for instance, some number of skipped MPI_DOUBLEs). MPI_Type_Vector() is called as follows:

An example using MPI_Type_vector() is shown in Listing Two. A double precision array of length 1024 called data is declared. The first process (having a rank of zero) sets values in this array and subsequently sends parts of the array to a second process by first defining a new derived data type using MPI_Type_vector(). The call to MPI_Type_vector() defines a new type called Data16Type (declared as an MPI_ Datatype at the top of main()), consisting of 64 blocks. Each block contains a single element of type MPI_DOUBLE, and blocks are separated by 16 MPI_DOUBLE elements. This separation is called the stride.

Next, the new type is committed by calling MPI_Type_commit(), the data are sent from process zero to process one, and the type is freed by calling MPI_Type_free(). The second process receives 64 double precision values which are stored contiguously at the beginning of its data array. Printing the first 64 values demonstrates this behavior.

Notice that only the first process defined, committed, and freed the derived type; the second process merely needed to receive 64 MPI_DOUBLE values. Had the second process defined and used the Data16Type, the 64 single-element blocks would have been distributed in its data array in the same fashion as they had originated in process zero. Thus derived data types may be used for sending, for receiving, or for both sending and receiving.

Non-Contiguous Irregular Types

Still more general is the MPI_Type_indexed() constructor. Like MPI_Type_vector(), MPI_ Type_indexed() allows for replication of an old type in non-contiguous blocks; however, block sizes and strides may vary. It’s called as follows:

Listing Three contains an example using MPI_Type_ indexed() to send the upper triangular section of a square matrix from the first process to the second. The 8×8 matrix is stored in a one dimensional, double precision array named data. Process zero fills the elements of the array with their element count, and then uses a new derived type referring only to the upper triangular section of the matrix to subsequently send that portion of the matrix to the other process, process one.

After MPI is initialized, a loop steps over each row in the matrix building up a list of block lengths and displacements. The block length for each row is the number of columns minus the row number, and the starting displacement for each block is the row number multiple of the number of columns plus the row number.

Next, both processes use the computed block lengths and displacements to define a new type called BlockIndexType. This derived type consists of eight blocks (one for each row) of double precision types with eight different block lengths and displacements. The type is committed with MPI_Type_commit().

Next, process zero stores element number values in its data array and uses the new type in MPI_Send() to send the matrix section to process one. Meanwhile, process one initializes its data array to 0.0 and then calls MPI_Recv() with the new derived type (BlockIndexType) to receive the upper triangular elements. Since the second process received the data with the derived type, the matrix elements are stored at their original locations within the full data array. Hence, the output from this program looks like this:

Both MPI_Type_vector() and MPI_Type_indexed() have alternative forms: MPI_Type_hvector() and MPI_ Type_hindexed(), respectively. The “h” stands for heterogeneous, and the difference is that the stride (for vector) and the displacement (for indexed) is specified in terms of bytes instead of multiples of the old type extent.

Non-Contiguous Mixed Types

The most general type constructor is MPI_Type_struct(). Like MPI_Type_indexed() it uses varying block lengths and displacements; however, it also allows each block to consist of replications of different data types. MPI_Type_struct() is called as follows:

An example using MPI_Type_struct() to exchange an array of C structures containing different data types is shown in Listing Four. In that example, a C structure is defined containing information about land cells from a map. Each land cell has two, integer spatial coordinates (coord[2]), a double precision elevation value (elevation), a one-byte land cover type (landcover), and four, double precision seasonal albedos (albedos[4]). (An albedo is the fraction of incident electromagnetic radiation reflected by a surface.) A 10×10 array of _cell structures called cells is declared outside of main(). It’s filled by process zero and sent all at once to process one.

Both processes must establish the new derived type by describing the exact makeup of the C structure to MPI. An array of fundamental MPI types called type[] is declared just inside main(), containing MPI_INT, MPI_DOUBLE, MPI_CHAR, and MPI_DOUBLE. The blocklen[] array has the same dimension as type[] and contains the number of elements of each old type in the same correct order. The array of displacements, called disp[4], is declared as type MPI_Aint because these displacements are actual integer addresses instead of some number of old types to skip. The new type will be called Cells.

After MPI is initialized, the displacements of the structure components must be found on each process. Multiple calls to the MPI_ Address() function are used to store these displacement values in the disp array. MPI_Address() takes two parameters: a pointer to a starting location and a pointer to an MPI_Aint address where the actual address will be stored. MPI_Address() is called once for each structure component and the addresses are returned in disp. Since relative addresses are desired (i.e., address distance into the structure), the base address is subtracted from each displacement.

Next, MPI_Type_struct() is called. The 4 signifies that the structure contains four component old types; the blocklen array describes how many contiguous elements of each old type it contains; the disp arrays lists the relative starting address displacements for each old type; the type array lists each of the four old types; and the pointer to Cells provides the handle for the new type. The new type is then committed with MPI_Type_commit().

Process zero fills its cells structure with coordinates and bogus elevation values and then calls MPI_Send(), using the new Cells derived data type to send all 100 cells to process one. Process one receives the cells into its cells structure by calling MPI_Recv() using the new Cells type. It then prints out coordinate and elevation values to verify that the exchange worked. Both processes free the new type with MPI_Type_free() before calling MPI_Finalize() and terminating.

Better Than Bytes

C programmers may be tempted to use the address operator (&) instead of calling MPI_Address() to obtain displacements. While this may often work correctly, ANSI C does not require that the value of a pointer be the actual absolute address of the object to which it points. Moreover, on machines with a segmented address space, referencing may not have a unique definition. The use of MPI_Address() to refer to C variables guarantees portability to all architectures.

At this point, some of you may be asking why it’s necessary to describe all the details of structures to MPI. Why not just force it to blindly send the right number of bytes to the correct recipient? It turns out this usually works when all nodes have the same architecture, but on heterogeneous clusters it may not. If MPI doesn’t know what’s in the “black box,” it can’t provide byte order (or endian) translation among different architectures. Instead, by constructing derived data types using the built-in fundamental types, we can take advantage of all the features MPI offers for message passing.

If this foray into MPI derived data types has left you hungering for more, try combining some of these functions to build more complex types. For instance, try adding to the code in Listing Four so that process zero sends only the upper triangular portion of land cells to process one by defining a new type using the Cells type.

And be sure to tune in next month for more examples of programming with MPI.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov.