I think that sounds like a rational path forward. Another, more long
term, option would be to move from the FIFOs to a linked list (which can
even be atomic), which is what MPICH does with nemesis. In that case,
there's never a queue to get backed up (although the receive queue for
collectives is still a problem). It would also solve the returning a
fragment without space problem, as there's always space in a linked list.

Brian

On Tue, 23 Jun 2009, Eugene Loh wrote:

> The sm BTL used to have two mechanisms for dealing with congested FIFOs. One
> was to grow the FIFOs. Another was to queue pending sends locally (on the
> sender's side). I think the grow-FIFO mechanism was typically invoked and
> the pending-send mechanism used only under extreme circumstances (no more
> memory).
>
> With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs. The
> code added complexity and there seemed to be no need to have two mechanisms
> to deal with congested FIFOs. In ticket 1944, however, we see that repeated
> collectives can produce hangs, and this seems to be due to the pending-send
> code not adequately dealing with congested FIFOs.
>
> Today, when a process tries to write to a remote FIFO and fails, it queues
> the write as a pending send. The only condition under which it retries
> pending sends is when it gets a fragment back from a remote process.
>
> I think the logic must have been that the FIFO got congested because we
> issued too many sends. Getting a fragment back indicates that the remote
> process has made progress digesting those sends. In ticket 1944, we see that
> a FIFO can also get congested from too many returning fragments. Further,
> with shared FIFOs, a FIFO could become congested due to the activity of a
> third-party process.
>
> In sum, getting a fragment back from a remote process is a poor indicator
> that it's time to retry pending sends.
>
> Maybe the real way to know when to retry pending sends is just to check if
> there's room on the FIFO.
>
> So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by checking if
> there are pending sends. If so, it'll retry them before performing the
> requested write. This should also help preserve ordering a little better.
> I'm guessing this will not hurt our message latency in any meaningful way,
> but I'll check this out.
>
> Meanwhile, I wanted to check in with y'all for any guidance you might have.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>