Hmm. I'm not sure the BML is the right place to do this. The BML
doesn't know anything about the internals of the BTLs; it's just a
dispatch / multiplexer.

Unfortunately, few of us are in a good place to respond at the moment
-- SC is next week and we're all hosed trying to get ready for that...

On Nov 13, 2008, at 1:07 PM, Leonardo Fialho wrote:

> Ralph,
>
> Very good document.
>
> About the MPI layer (in case of fault), my idea is to give to BML
> the ability to handle BTL errors which occurs when a process die
> (and probably have been migrated), discovering the new location. I
> think that it is possible because the HNP request the restart for
> the orted daemon, so it knows the new location of the faulty process.
>
> Leonardo
>
> Ralph Castain escribió:
>> If you look at the Dec meeting wiki, you will see that we are
>> moving quickly to a modex-less launch anyway. It won't be the
>> default because it requires pre-discovery of the cluster's network
>> resources (for which we will provide a tool or method), but it will
>> help resolve some of these problems.
>>
>> Outside of that, I will have to leave it to the FT folks to figure
>> out how to resolve modex situations. We have the ability to support
>> multiple modex models (and already do), but I don't know if you can
>> do what you describe or not - I'm not sure how the MPI layer will
>> handle that situation.
>>
>> Ralph
>>
>> On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote:
>>
>>> Jeff,
>>>
>>> I agree with your viewpoint, principally about the "reachability".
>>> But...
>>>
>>> Looking from the FT viewpoint, sometimes (or some FT
>>> architectures), wants to recover an application process on other
>>> node different from the first. In this case a new modex should be
>>> called. It's fine for coordinated C/R, on the other hand, for
>>> uncoordinated C/R its not a good choice, I think. One more time
>>> the tradeoffs...
>>>
>>> A possible solution is to perform n-1 modex involving the
>>> recovered process and each one of the other processes... It's
>>> better than an allgather modex? I don't now. I think not. And what
>>> is the impact of a allgather modex while MPI thread is delivering
>>> messages? These answers about these questions could suggest that a
>>> uncoordinated C/R is not possible on Open MPI.
>>>
>>> Leonardo Fialho
>>>
>>>
>>> Jeff Squyres escribió:
>>>> On Nov 7, 2008, at 10:18 AM, Leonardo Fialho wrote:
>>>>
>>>>> I understand that a process need to have the contact information
>>>>> to send MPI messages to other processes, and modex permits it.
>>>>> My question is, why do not perform the contact exchange when it
>>>>> is necessary?
>>>>>
>>>>> For example: in a M/W application, the workers does not need
>>>>> more information than the masters contact info.
>>>>>
>>>>> I think that it reduces the startup time, but increases the
>>>>> *first* communication between two peers.
>>>>
>>>>
>>>> FWIW, this is actually a pretty complex topic. There are many,
>>>> many tradeoffs in terms of what performance do you want vs. what
>>>> functionality do you want. This subject has been discussed for
>>>> many, many hours by the OMPI developers. :-)
>>>>
>>>> The modex is performed during MPI_INIT; the v1.3 series' modex is
>>>> quite a bit more efficient than the v1.2 series' modex. The
>>>> modex information comprises of several things, some of which are
>>>> either the contact info or "reachability" info of BTL modules.
>>>> For the openib BTL, for example, port subnet ID's and MTU's are
>>>> passed in the modex, but LIDs don't need to be passed (in some
>>>> cases) until two processes actually try to reach each other. We
>>>> use the reachability information to determine whether a given BTL
>>>> module *could* be used to connect to a remote peer. For example,
>>>> if we get to the end of MPI_INIT and find a peer that cannot be
>>>> reached, we abort (after hours of debate, we decided it was
>>>> better to abort right away when there was a peer that could not
>>>> be reached rather than abort only on attempted first contact
>>>> because it could be a simple network/configuration error that
>>>> should be detected immediately, rather than erroring out
>>>> [potentially] long into a multi-hour run).
>>>>
>>>> We have been discussing a "modex-less" startup for quite a while;
>>>> this is actually one of the topics on the agenda for an
>>>> engineering meeting that we're having December. modex-less is
>>>> quite important for scalability to many thousands of processes,
>>>> but other tradeoffs may be necessary to make this work (read:
>>>> we've talked about modex-less for forever; we're finally likely
>>>> to do it in the near future because of some upcoming very very
>>>> large scale machines at US DOE labs).
>>>>
>>>> Does that make sense?
>>>>
>>>
>>>
>>> --
>>> Leonardo Fialho
>>> Computer Architecture and Operating Systems Department - CAOS
>>> Universidad Autonoma de Barcelona - UAB
>>> ETSE, Edifcio Q, QC/3088
>>> http://www.caos.uab.es>>> Phone: +34-93-581-2888
>>> Fax: +34-93-581-2478
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>
> --
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos.uab.es> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel