buffering
A message passing program is
dependent on the buffering of messages if
its communication graph has a cycle.
The communication graph is a directed graph
in which the nodes represent MPI communication calls
and the edges represent
dependencies between these calls: a directed edge uv indicates
that operation v might not be able to complete before operation u is started.
Calls may be dependent because they have to be executed in
succession by the same process, or because they are matching
send and receive calls.

The execution of the code results in the
dependency graph illustrated in Figure ,
for the case of a three
process group.

Figure: Cycle in communication graph for cyclic shift.

The arrow from each send to the following receive
executed by the same process reflects the program dependency
within each process: the receive call cannot be executed until
the previous send call has completed. The double arrow between
each send and the matching receive reflects their mutual
dependency: Obviously, the receive cannot
complete unless the matching send was invoked. Conversely,
since a standard mode send is used, it may be the case that
the send blocks until a matching receive occurs.

The dependency graph has a cycle.
This code will only work if the system provides sufficient
buffering, in
which case the send operation will complete locally, the call to
MPI_Send() will return,
and the matching call to MPI_Recv()
will be
performed. In the absence of sufficient
buffering MPI does not specify an
outcome, but for most implementations deadlock will occur, i.e., the
call to MPI_Send() will never return: each process will
wait for the next process on the ring to execute a matching
receive.
Thus, the behavior of this code will differ from system to
system, or on the same system, when message size
(count) is changed.

There are a number of ways in which a shift operation can be performed
portably using MPI. These are

alternate send and receive calls
(only works if more than one process),

use a blocking send in buffered mode,

use a nonblocking send and/or receive,

use a call to MPI_Sendrecv(),

If at least one process in a shift
operation calls the receive routine before
the send routine, and at least one process calls the send routine
before the receive routine, then at least one communication can proceed,
and, eventually,
the shift will complete successfully.
One of the most efficient ways of doing this is to alternate the send and
receive calls so that all processes
with even rank send first and then
receive, and all processes with odd rank receive first and then send.
Thus, the following code is portable provided there is more than one
process, i.e., clock and anticlock are different:

The resulting communication graph is illustrated in
Figure . This graph is acyclic.

Figure: Cycle in communication graph is broken by reordering
send and receive.

If there is only one process then
clearly blocking send and receive routines
cannot be used since the send must be
called before the receive, and so
cannot complete in the absence of buffering.

We now consider methods for
performing shift operations that work even if
there is only one process involved.
A blocking send in buffered mode
can be used to perform a shift operation.
In this case the application program passes a buffer to the MPI
communication system, and
MPI can use this to buffer messages. If the buffer
provided is large enough, then the shift will complete successfully.
The following
code shows how to use
buffered mode to create a portable shift operation.

MPI guarantees that the buffer supplied by a call to
MPI_Buffer_attach()
will be used if it is needed to buffer the message.
(In an implementation
of MPI that provides sufficient buffering,
the user-supplied buffer may be
ignored.)
Each buffered send operations can complete locally, so
that a deadlock will not occur. The acyclic communication graph for
this modified code is shown in Figure .
Each receive depends on the matching send, but the
send does not depend anymore on the matching receive.

Figure: Cycle in communication graph is broken by using
buffered sends.

Another approach is to use nonblocking communication.
One can either use a
nonblocking send, a nonblocking receive, or both.
If a nonblocking send is
used, the call to MPI_Isend() initiates the send
operation and then returns.
The call to MPI_Recv() can then be made,
and the communication completes
successfully.
After the call to MPI_Isend(), the data in buf1 must
not be changed until one is
certain that the data have been sent or copied
by the system. MPI provides the routines MPI_Wait() and
MPI_Test()
to check on this. Thus, the following code is portable,

Figure: Cycle in communication graph is broken by using
nonblocking sends.

Each receive operation depends on the matching send,
and each wait depends on the matching communication; the send
does not depend on the matching receive, as a nonblocking send
call will return even if no matching receive is posted.

(Posted nonblocking communications do consume resources: MPI
has to keep track of such posted communications. But the
amount of resources consumed is proportional to the number of
posted communications, not to the total size of the pending messages. Good MPI implementations will support a large number of
pending nonblocking communications, so that this will not cause
portability problems.)

An alternative approach is to perform a nonblocking receive first to
initiate (or ``post'') the receive,
and then to perform a blocking send in
standard mode.

The call to MPI_Irecv()
indicates to MPI that incoming data should be
stored in buf2; thus, no buffering is required. The call to
MPI_Wait()
is needed to block until the data has actually been received
into buf2. This alternative code will often result in
improved performance,
since sends complete faster in many implementations
when the matching receive is already posted.

Finally, a portable shift
operation can be implemented using the routine
MPI_Sendrecv(),
which was explicitly designed to send to one process
while receiving from another in a safe and portable way. In this
case only a single call is required;