This is better than nothing, but really not very helpful for looking at the
specific issues that can arise with this, unless these systems have several
parallel networks, with tests that will generate a lot of parallel network
traffic, and be able to self check for out-of-order received - i.e. this
needs to be encoded into the payload for verification purposes. There are
some out-of-order scenarios that need to be generated and checked. I think
that George may have a system that will be good for this sort of testing.

Rich

On 12/12/07 3:20 PM, "Gleb Natapov" <glebn_at_[hidden]> wrote:

> On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote:
>> Gleb --
>>
>> How about making a tarball with this patch in it that can be thrown at
>> everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere)
> I don't have access to www.open-mpi.org, but I can send you the patch.
> I can send you a tarball too, but I prefer to not abuse email.
>
>>
>>
>> On Dec 11, 2007, at 4:14 PM, Richard Graham wrote:
>>
>>> I will re-iterate my concern. The code that is there now is mostly
>>> nine
>>> years old (with some mods made when it was brought over to Open
>>> MPI). It
>>> took about 2 months of testing on systems with 5-13 way network
>>> parallelism
>>> to track down all KNOWN race conditions. This code is at the center
>>> of MPI
>>> correctness, so I am VERY concerned about changing it w/o some very
>>> strong
>>> reasons. Not apposed, just very cautious.
>>>
>>> Rich
>>>
>>>
>>> On 12/11/07 11:47 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>
>>>> On Tue, Dec 11, 2007 at 08:36:42AM -0800, Andrew Friedley wrote:
>>>>> Possibly, though I have results from a benchmark I've written
>>>>> indicating
>>>>> the reordering happens at the sender. I believe I found it was
>>>>> due to
>>>>> the QP striping trick I use to get more bandwidth -- if you back
>>>>> down to
>>>>> one QP (there's a define in the code you can change), the reordering
>>>>> rate drops.
>>>> Ah, OK. My assumption was just from looking into code, so I may be
>>>> wrong.
>>>>
>>>>>
>>>>> Also I do not make any recursive calls to progress -- at least not
>>>>> directly in the BTL; I can't speak for the upper layers. The
>>>>> reason I
>>>>> do many completions at once is that it is a big help in turning
>>>>> around
>>>>> receive buffers, making it harder to run out of buffers and drop
>>>>> frags.
>>>>> I want to say there was some performance benefit as well but I
>>>>> can't
>>>>> say for sure.
>>>> Currently upper layers of Open MPI may call BTL progress function
>>>> recursively. I hope this will change some day.
>>>>
>>>>>
>>>>> Andrew
>>>>>
>>>>> Gleb Natapov wrote:
>>>>>> On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote:
>>>>>>> Try UD, frags are reordered at a very high rate so should be a
>>>>>>> good test.
>>>>>> Good Idea I'll try this. BTW I thing the reason for such a high
>>>>>> rate of
>>>>>> reordering in UD is that it polls for MCA_BTL_UD_NUM_WC completions
>>>>>> (500) and process them one by one and if progress function is
>>>>>> called
>>>>>> recursively next 500 completion will be reordered versus previous
>>>>>> completions (reordering happens on a receiver, not sender).
>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> Richard Graham wrote:
>>>>>>>> Gleb,
>>>>>>>> I would suggest that before this is checked in this be tested
>>>>>>>> on a
>>>>>>>> system
>>>>>>>> that has N-way network parallelism, where N is as large as you
>>>>>>>> can find.
>>>>>>>> This is a key bit of code for MPI correctness, and out-of-order
>>>>>>>> operations
>>>>>>>> will break it, so you want to maximize the chance for such
>>>>>>>> operations.
>>>>>>>>
>>>>>>>> Rich
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/11/07 10:54 AM, "Gleb Natapov" <glebn_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I did a rewrite of matching code in OB1. I made it much
>>>>>>>>> simpler and 2
>>>>>>>>> times smaller (which is good, less code - less bugs). I also
>>>>>>>>> got rid
>>>>>>>>> of huge macros - very helpful if you need to debug something.
>>>>>>>>> There
>>>>>>>>> is no performance degradation, actually I even see very small
>>>>>>>>> performance
>>>>>>>>> improvement. I ran MTT with this patch and the result is the
>>>>>>>>> same as on
>>>>>>>>> trunk. I would like to commit this to the trunk. The patch is
>>>>>>>>> attached
>>>>>>>>> for everybody to try.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Gleb.
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>>>>>
>>>>>> --
>>>>>> Gleb.
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>>>
>>>> --
>>>> Gleb.
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel