> On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
>
>>> Ah, this is the key. If I have one process (out of many) fail the create_cq() function, I get a segv during finalize. I'll dig.
>>
>> Is there an assumption that if process A claims to be able to communicate with process B that process B can also communicate with process A. It almost sounds like the code needs to do a allreduce on the bitmask returned by the btls.
>
> Actually, this is exactly the case (I just dug into the code and verified this).
>
> In this case, we're already well beyond the point where we synchronized and decided who can connect to whom. I.e., the modex is already done -- the openib BTL in process X has decided that it is available and has advertised its RDMACM CPC and OOB CPC contact info.
>
> But then later in process X during the openib BTL add_procs, something fails. So the openib clears the connect bits and transparently fails over to TCP. No problem.
>
> The problem is the other peers who think that they can still connect to process X via the openib BTL.
>
> 1. In this case, the openib BTL was not finalized, so there was a stub still there listening on the RDMACM CPC. When another process tried to connect to X's RDMACM CPC port, Bad Things happened (because it was only half setup) and we segv'ed.
>
> Obviously, this should be fixed. "Fixed" in this case probably means closing down the RDMACM CPC listening port. But then that leads to another form of Badness.

I wonder how this is possible. If a process X fails to connect to Y, how can Y succeed to connect to X ? Please enlighten me ...

>
> 2. If the openib BTL cleanly shuts down and is *not* still listening on its modex-advertised RDMACM CPC contact port, then if some other process tries to contact process X via the modex info, it'll fail. This will then be judged to be a fatal error. Failover in the BML will simply have delayed the job abort until someone tries to contact X via the openib BTL.

Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one and the connection fails, then the PML will automatically try to use the next available BTL, so it will eventually fail over TCP (if available).

>
> I think that the majority of this discussion about the BML failure (or not) behavior assumed that *all* processes had the same failure (at least: *I* assumed this). But if only *some* of the processes fail a given BTL add_procs, we have a problem because we're beyond the point of deciding who can connect to whom. Shutting down a single BTL module at that point will create an inconsistency of the distributed data.

We did assume that at least the errors are symmetric, i.e. if A fails to connect to B then B will fail when trying to connect to A. However, if there are other BTL the connection is supposed to smoothly move over some other BTL. As an example in the MX BTL, if two nodes have MX support, but they do not share the same mapper the add_procs will silently fails.