I've been looking at a "fast path" for sends and receives. This is like
the sendi function, which attempts to send "immediately", without
creating a bulky PML send request (which would be needed if, say, the
send had to be progressed over multiple user MPI calls). One can do
something similar on the receive side, and I have a workspace in which
each BTL has the option of defining a "recvi" (receive immediate)
function. The speedups I see in the prototype are gratifying: np=2
pingpong latencies are down 30%-2x, and they stay flat as np is
increased. (OMPI, straight out of the box, sees pingpong latencies
climb as np climbs due to the costs of polling.)

I'd like to have MPI_Sendrecv see the same performance benefits as well,
but the MPI layer performs an MPI_Sendrecv as a Irecv/Send/Wait. The
Irecv necessarily involves a receive request. So, the Send might be
fast, but you lose most of the benefit of doing a fast path. I think
the real way of doing a fast Sendrecv would be to do an immediate send
(if you can) followed by an immediate receive.

It seems to me, there are two approaches here:

*) Teach the MPI layer about "fast path" sends and receives (sendi and
recvi).
*) Teach the PML layer about "Sendrecv". That is, have MPI_Sendrecv
call something like mca_pml_ob1_sendrecv(). (This is the approach I'd
prefer.)