One-Sided Communications with MPI-2

Traditional interprocess communication requires cooperation and synchronization between sender and receiver. MPI-2's new remote memory access features allow one process to update or interrogate the memory of another, hence the name one-sided communication. Here's a hands-on guide.

While Parallel Virtual Machine (PVM) and other applications programming interfaces are still available and in widespread use, the Message Passing Interface (MPI) has become the preferred programming interface for data exchange in most new parallel scientific applications. The MPI-2 standard was released in 1997 and implementations of the standard are beginning to become widely available.

Both MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) and LAM/MPI (http://www.lam-mpi.org/) — the two most popular MPI implementations for Linux clusters — comply fully with the MPI-1.2 standard, which is described in the MPI-2 standards document. Moreover, both implementations now support subsets of the new MPI-2 features. LAM/MPI includes support in its regular 7.0.x versions, whereas a completely new implementation of MPICH, called MPICH2, provides early support for many MPI-2 features.

Installation and testing instructions for both of these implementations (LAM/MPI 7.0.6 and MPICH2 beta 0.96p2) and a short MPI-2 program demonstrating one-sided communication were included in the August 2004 “Extreme Linux” column. In the September “Extreme Linux” column, MPI-2′s new process creation and management features were discussed and demonstrated using a manager/worker program example.

This month, let’s look at MPI-2′s new remote memory access (RMA) capabilities, usually called one-sided communications, because only one process needs to issue a send or receive call to achieve the communication. This scheme can simplify programming in cases where the memory locations that must be updated or interrogated are known on only one side of a communicating pair of processes.

One-Sided Communication

Traditionally, the parameters for interprocess communications had to be known by both processes, and both had to issue matching send/receive calls. The new RMA mechanism in MPI-2 obviates the need to both communicate parameters prior to the real data communication and to poll periodically for data exchange requests. Additionally, RMA can be used to simplify or eliminate time-consuming global communications.

Message passing is used in parallel programs to communicate data from one process to another and to synchronize the memory of the receiver with that of the sender. In MPI-2, these two functions are provided separately. The communications functions are MPI_Put() for remote writes, MPI_Get() for remote reads, and MPI_Accumulate() for remote updates. In addition, many different synchronization calls are available, and they operate much like a weakly coherent memory system. As a result, ordering of memory accesses must be enforced by your code.

The terminology used in the MPI-2 specification is best for describing the relationship among processes. The origin is the process that performs the communications call, and the target the process in which memory is accessed. So, in a put operation, the source is the origin and the destination is the target, while in a get operation the source is the target and the destination is the origin. This terminology will be used throughout the discussion that follows.

Memory Allocation

The MPI-2 RMA functions were designed to take advantage of fast communications offered by shared memory and special put/get operations available in hardware on some architectures.

Moreover, RMA operations may be faster on some systems when accessing specially allocated memory, like blocks of shared memory on a symmetric multiprocessing (SMP) system. A mechanism for allocating and freeing such special memory is provided in MPI-2. The routines are called MPI_Alloc_mem() and MPI_Free_mem(), respectively.

If no special memory is used, then these functions simply invoke malloc() and free(). Otherwise, MPI_Alloc_mem() should allocate a shared memory segment.

Nevertheless, it should always be safe to use these MPI-2 functions to manipulate memory that may be used for message passing. Such memory may provide an advantage and like chicken soup, it can’t hurt! The examples below demonstrate the use of these calls in C.

Windows into the Mind

To access the memory of another process, that memory must be exposed by the process that owns it. Through a collective operation, a window of memory can be made accessible to remote processes. Such windows are created by calling MPI_Win_create(), which returns an opaque object (a pointer to an object of type MPI_Win) that represents the group of processes in the specified communicator and the attributes of each window as specified in the call. The returned window object can be used subsequently to perform RMA operations.

The arguments to the call are the initial address of the window (base), the size of the window in bytes (size), the local unit size for displacements into the window in bytes (disp_ unit), an info object (info), the group communicator (comm), and a pointer for the new window object returned by the call (win).

Processes in the comm group can specify completely different target windows, sizes, displacement units, and info arguments, but all processes in the comm group must make the call. A process can expose no memory by specifying a size of zero. The same area of memory may appear in multiple windows associated with different window objects, but simultaneous communications to different overlapping windows may produce undesired results.

Windows can be used to provide synchronization in a variety of ways, and they are used in one-sided communication operations. Attributes for a window are cached and can be retrieved with one of the following calls:

When called with MPI_WIN_BASE, a pointer to the start of the window, win, is returned in base. When called with MPI_ WIN_SIZE, the size of the window in bytes is returned in size. When called with MPI_WIN_DISP_UNIT, the size of the displacement unit of the window is returned in disp_unit.

The group of processes attached to the window can be retrieved by calling:

MPI_Win_get_group(MPI_Win win, MPI_Group
*group)

This call returns a copy of the group communicator used to create the window win in group.

RMA communications occur within an access epoch for a window. An access epoch begins with an RMA synchronization call on the window, proceeds with one or more RMA communications calls, and completes with another synchronization call on the window. The most simple synchronization method is provided by the MPI_Win_fence() collective synchronization call. It’s the “Swiss Army knife of MPI-2 synchronization.”

int MPI_Win_fence(int assert, MPI_Win win)

This call forces all RMA operations on win originating at a given process and initiated before the fence call to complete at that process before the fence call returns. The call starts an RMA access epoch if it is followed by one or more RMA communications calls and another fence call. Likewise, the call completes an RMA access epoch if it is preceded by another fence call with one or more RMA communications calls in between.

Similarly, MPI_Win_fence() starts an RMA exposure epoch if it is followed by one or more RMA accesses and another fence call. It completes an RMA exposure epoch if it is preceded by another fence call and the local window was the target of one or more RMA accesses in between the two calls. In most cases a fence call results in a barrier synchronization since a call to MPI_Win_fence() returns only after all other processes in the group enter the matching call.

The assert argument may be set to different pre-defined values to optimize the fence call and provide the desired behavior from the synchronization. Not all implementations may support assertions, but zero is always a valid value. Zero indicates a general case, but assertions described in the MPI-2 standard for MPI_Win_fence() include:

* MPI_MODE_NOSTORE. The local window wasn’t updated by local stores (or local get or receive calls) since the last synchronization.

* MPI_MODE_NOPUT. The local window will not be updated by put or accumulate calls after the fence call until the ensuing (fence) synchronization.

* MPI_MODE_NOPRECEDE. The fence does not complete any sequence of locally issued RMA calls.

* MPI_MODE_NOSUCCEED. The fence does not start any sequence of locally issued RMA calls.

If either MPI_MODE_NOPRECEDE or MPI_MODE_NOSUCCEED is given by any process in the window group, then it must be given by all processes in that group. These assertions may be combined using the logical or operator as shown in the examples below.

After they’re created, windows may be subsequently destroyed by calling MPI_Win_free(), passing it a pointer to the window object. This is also a collective operation, and no process returns until all processes in the comm group of the window have called this free routine. Therefore, all RMA operations must be completed and all locks removed prior to calling MPI_Win_free().

Taking What You Want

The MPI_Get() communications routine can be used to grab data from the memory of a remote process that has exposed it by including it in a window created for this purpose. Calls to this routine, as well as to the MPI_Put() and MPI_Accumulate() communications routines, are nonblocking. This means that they initiate the transfer, but the transfer need not complete before the calls return — hence the requirement for an independent synchronization mechanism like that provided by MPI_Win_fence(). Affected buffers shouldn’t be modified until the synchronization call returns.

Listing One contains an example C program that uses MPI_ Get() to extract values from remote process arrays (a) and store them into local arrays (b).

As in most MPI programs, MPI is initialized with a call to MPI_Init(), each process obtains its rank in the global MPI_ COMM_WORLD communicator by calling MPI_Comm_rank(), the size of the communicator is obtained by calling MPI_ Comm_size(), and the processor name (that is, the node hostname) is obtained by calling MPI_Get_processor_ name(). Here the version of MPI is also obtained by calling MPI_Get_version(). All this information is subsequently printed by each process.

Next, two integer arrays, a and b, are allocated using MPI_ Alloc_mem(). This memory allocation routine is called because these memory segments will be used for communications later in the program. Both arrays are allocated to be the size of the number of MPI processes participating in the computation (size). Then MPI_Win_create() is called to expose the a array to all processes (MPI_COMM_WORLD). The window handle is returned in win, which is used for subsequent communication.

The a array is then filled with unique values for each process and printed. Prior to the communication, the MPI_Win_ fence() synchronization routine is called for the win window to initiate an exposure epoch for this window. In this call, the MPI_MODE_NOPUT and MPI_MODE_NOPRECEDE assertions are made so that MPI knows that no remote processes will store data locally until the next synchronization call and that no previous RMA communications have taken place that need to be synchronized.

Next, MPI_Get() is called size times to retrieve the rankth value from the a array on each process. These values are then stored in the b array. Since MPI_Get() is non-blocking, these communications may overlap. This is not a problem since the values are stored in different locations in the b array. In fact, parallel communications may improve performance. Notice that each process also retrieves a value from its own a array. It talks to itself. This is completely valid, and it simplifies programming.

The arguments to MPI_Get() are the address of the destination buffer (&b[i]), the number of entries in the origin buffer (1), the datatype for the origin buffer (MPI_INT), the rank of the target (the loop index i), the displacement from the beginning of the window to the start of the target buffer (rank), the number of entries in the target buffer (1), the datatype for target buffer (MPI_INT), and the window object used for communication (win).

After all of the get requests have been posted, the MPI_Win _fence() routine is called again to end the exposure epoch on window win. The MPI_MODE_NOSUCCEED is made so that MPI knows no RMA communications follow this fence call. This MPI_Win_fence() call forces synchronization; the communications all complete and the buffers are updated prior to returning from the call. Then each process prints out the values it obtained in its b array. Finally, MPI_Win_ free() is called to destroy the window exposing the a array, MPI_Free_mem() is called to free the memory allocated for the a and b arrays, and MPI_Finalize() is called to stop MPI.

The output from compiling and running this program is contained in Output One. MPICH2 is used to compile and run this program and the remaining programs below. The code is compiled with mpicc and run with mpiexec, using five processes. The “Hello world!” lines are printed by every process followed by the values of the elements of the a array. After each process has taken values from all the other processes (including itself), the values are printed in order. As shown in Output One, each process has successfully obtained values from all other processes in the group.

Listing Two is very similar to Listing One, but instead of getting data from remote processes, this new program deposits data into arrays on remote processes. In this program, the b target array is exposed instead of the source array, a. MPI_Win_ create() is called with the address to b and a pointer to the window object win is returned. The a array is then filled as before.

Next, MPI_Win_fence() is called to initiate an access epoch. The MPI_MODE_NOPRECEDE assertion is made to notify MPI that no previous RMA communications occurred on this window. Then MPI_Put() is called to remotely store every element in a. Each value is stuffed into the rankth position in the remote process’ exposed b array. Like the previous example, each process also puts data into its own b array in this loop. MPI_Win_fence() is then called to end the access epoch and synchronize buffers. The MPI_MODE _NOSTORE and MPI_MODE_NOSUCCEED assertions are made because no local stores were performed since the previous fence call and no RMA communications calls follow this fence call.

The arguments to MPI_Put() are the address of the origin buffer (&a[i]), the number of entries in the origin buffer (1), the data type of the origin buffer (MPI_INT), the rank of the target (the loop index i), the displacement from the beginning of the window to the target buffer (rank), the number of entries in the target buffer (1), the data type of the target buffer (MPI_INT), and the window object used for communication (win).

The received values in the b array are printed by each process. Then the window win is destroyed by calling MPI_Win_ free(), the a and b arrays are freed by calling MPI_Free_ mem(), and MPI is stopped by calling MPI_Finalize().

The output from compiling and running this program with MPICH2 is shown in Output Two. This output closely matches that of Output One since the arrays are constructed in the same way, albeit that the first program performs get operations while the second program performs put operations.

The program in Listing Three demonstrates the third MPI-2 one-sided communication mechanism, MPI_Accumulate(). This program exposes the b array with MPI_Win_create(), but fills both a and b with the same values prior to synchronization and communication. That way the b array already has some values that can be updated by remote processes.

Listing Three: mpi2_sum2.c, a third example of one-sided communication

After synchronizing with MPI_Win_fence() using the MPI_MODE_NOPRECEDE assertion, the MPI_Accumulate() communications routine is called. It takes the same form as MPI_Put(), except that it also includes an argument for the desired reduction operation. Here, MPI_SUM is used, but any operation valid for a call to the traditional MPI_Reduce() routine may be used. A new operation, MPI_REPLACE, is defined and has the same effect as calling MPI_Put().

The arguments to MPI_Accumulate() are the address of the origin buffer (&a[i]), the number of entries in the origin buffer (1), the data type of the origin buffer (MPI_INT), the rank of the target (loop index i), the displacement from the beginning of the window to the target buffer (rank), the number of entries in the target buffer (1), the datatype of the target buffer (MPI_INT), the predefined reduction operation (MPI_ SUM), and the window object used for communication (win).

The output from compiling and running Listing Three with MPICH2 is shown in Output Three. A quick check of the code shows that the sums were performed correctly across remote processes. Notice that each process also updated its own b array remotely by calling MPI_Accumulate().

In addition to the generalized MPI_Win_fence() routine, other methods are available in MPI-2 for synchronization. Using the MPI_Win_start(), MPI_Win_complete(), MPI_ Win_post(), and MPI_Win_wait() calls, the scope of synchronization can be restricted to only a pair of communicating processes (which may be more efficient when communication occurs with a small number of neighboring processes). Alternatively, shared and exclusive lock methods are provided, using the MPI_Win_lock() and MPI_Win_unlock() routines, which emulate a shared memory machine model using MPI.

The former method, officially called general active target synchronization, is used for active target communication where both members of a communicating pair of processes are involved in the communication. The latter method, employing locks, is used for passive target communication, in which only one process of a communicating pair is involved in the communication.

Listing Four shows a program that uses these synchronization calls to explicitly retrieve a “secret” number from a neighboring process in a ring. Group handles are established on each process for the source and destination processes for the active target communication using MPI_Group_ incl() after obtaining the group handle for the MPI_ COMM_WORLD communicator using MPI_Comm_group(). Space is allocated for the local secret number and the remote secret number, and the local secret number is set to the remote rank.

Next, MPI_Win_post() is called to start an exposure epoch with the next higher ranked process in a ring. This exposes local_secret to the next rank up. MPI_Win_start() is then called to start an access epoch with the next lower ranked process in a ring. MPI_Get() is called to obtain the secret of the next lower ranked process. This value is stored in remote_secret. Then MPI_Win_complete() is called to end the access epoch, and MPI_Win_wait() is called to end the exposure epoch. While this may appear more tedious than using a fence call, in some circumstances, this active target synchronization is more efficient and allows for sophisticated message passing and synchronization schemes.

Output Four shows the results of compiling and running this program. Each process stores the rank of the next higher process in local_secret, and as expected, each process retrieves its own rank from the next lower process in a ring.

These examples provide a very basic introduction to one-sided communications with MPI-2. These RMA methods can make programming many parallel algorithms easier, and can provide efficient methods for sharing data and synchronizing on a wide variety of hardware. Very complex synchronization schemes can be constructed using general active target synchronization routines available in MPI-2.

Now is a good time to experiment with these mechanisms since they’re becoming available in MPI implementations used on Linux clusters.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov. You can download the code from this article from http://www.linux-mag.com/downloads/2004-11/extreme.
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62