Using MX_CSUM should _not_ make a difference by itself. But it
requires the debug library which may alter the timing enough to avoid
a race (in MX, OMPI, or the application).

Correct, if you use the MTL then all messages are handled by MX
(internode, shared memory and self).

Scott

On Jul 3, 2009, at 7:41 AM, 8mj6tc902_at_[hidden] wrote:

> Scott,
>
> Thanks for your advice! Good to know about the checksum debug
> functionality! Strangely enough running with either "MX_CSUM=1" or "-
> mca
> pml cm" allows Murasaki to work normally, and makes the test case I
> attached in my previous mail work. Very suspicious, but at least this
> does make a functional solution (however, if I understand OpenMPI
> correctly, I shouldn't be able to use the CM PML over a network where
> some nodes have MX and some don't, correct?).
>
> Scott Atchley atchley-at-myri.com |openmpi-users/Allow| wrote:
>> Hi Kris,
>>
>> I have not run your code yet, but I will try to this weekend.
>>
>> You can have MX checksum its messages if you set MX_CSUM=1 and use
>> the
>> MX debug library (e.g. LD_LIBRARY_PATH to /opt/mx/lib/debug).
>>
>> Do you have the problem if you use the MX MTL? To test it modify your
>> mpirun as follows:
>>
>> $ mpirun -mca pml cm ...
>>
>> and do not specify any BTL info.
>>
>> Scott
>>
>> On Jul 2, 2009, at 6:05 PM, 8mj6tc902_at_[hidden] wrote:
>>
>>> Hi. I've now spent many many hours tracking down a bug that was
>>> causing
>>> my program to die, as though either its memory were getting
>>> corrupted or
>>> messages were getting clobbered while going through the network, I
>>> couldn't tell which. I really wish the checksum flag on btl_mx_flags
>>> were working. But anyway, I think I've managed to recreate the
>>> core of
>>> the problem in a small-ish test case which I've attached
>>> (verifycontent.cc). This usually segfaults at MPI_Issend after
>>> sending
>>> about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's
>>> mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl
>>> (mpirun
>>> -mca btl ^mx) makes it work (likewise, the same for my own larger
>>> project (Murasaki)). The MPI_Ssend using version
>>> (verifycontent-ssend.cc) also works no problem over mx. So I
>>> suspect the
>>> issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but
>>> it's
>>> also possible I've horribly misunderstood something fundamental
>>> about
>>> MPI and it's just my fault, so if that's the case, please let me
>>> know
>>> (but both my this test case and Murasaki work over mpichmx, so
>>> OpenMPI
>>> is definitely doing something different).
>>>
>>> Here's a brief description of verifycontent.cc to make reading it
>>> easier:
>>> * given -np=N, half the nodes will be sending, half will be
>>> receiving
>>> some number of messages (reps)
>>> * each message consists of buflen (5000) chars, set to some value
>>> based
>>> on the sending node's rank and the sequence number of the message
>>> * the receiving node starts an irecv for each sending node, tests
>>> each
>>> request until a message arrives
>>> * the receiver then checks the contents of the message to make
>>> sure it
>>> matches what was supposed to be in there (this is where my real
>>> project,
>>> Murasaki, fails actually. I can't seem to replicate that however).
>>> * the senders meanwhile keep sending messages and dequeuing them
>>> when
>>> their request tests as completed.
>>>
>>> Testing out the current subversion trunk version, 1.4a1r21594, that
>>> seems to pass my test case, but also tends to show errors like
>>> "mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)"
>>> on
>>> start up, and Murasaki still fails (messages turn into zeros about
>>> 132KB
>>> in), so something still isn't right...
>>>
>>> If anyone has any ideas about this test case failing, or my larger
>>> issue
>>> of messages turning into zeros after 132KB (though sadly sometimes
>>> it
>>> isn't at 132KB, but straight from 0KB, which is very confusing)
>>> while on
>>> MX, I'd greatly appreciate it. Even a simple confirmation of "Yes,
>>> MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity.
>>> --
>>> Kris Popendorf
>>>
>>> Keio University
>>> http://murasaki................... <- (Probably too cumbersome to
>>> expect
>>> most people to test, but if you feel daring, try putting in some
>>> Human/Mouse chromosomes over MX)
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
> --
> --Kris
>
> $B3p$C$F$7$^$&L4$OK\Ev$NL4$H8@$($s!#(B
> [A dream that comes true can't really be called a dream.]
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users