Ashley Pittman wrote:
> On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:
>
> Thank you. I think you missed the top three lines of the output but
> that doesn't matter.
>
>
>> main() at ?:?
>> PMPI_Comm_dup() at pcomm_dup.c:62
>> ompi_comm_dup() at communicator/comm.c:661
>> -----------------
>> [0,2] (2 processes)
>> -----------------
>> ompi_comm_nextcid() at communicator/comm_cid.c:264
>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>> coll_tuned_decision_fixed.c:61
>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>> coll_tuned_allreduce.c:223
>> ompi_request_default_wait_all() at request/req_wait.c:262
>> opal_condition_wait() at ../opal/threads/condition.h:99
>> -----------------
>> [1,3] (2 processes)
>> -----------------
>> ompi_comm_nextcid() at communicator/comm_cid.c:245
>> ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
>> ompi_coll_tuned_allreduce_intra_dec_fixed() at
>> coll_tuned_decision_fixed.c:61
>> ompi_coll_tuned_allreduce_intra_recursivedoubling() at
>> coll_tuned_allreduce.c:223
>> ompi_request_default_wait_all() at request/req_wait.c:262
>> opal_condition_wait() at ../opal/threads/condition.h:99
>>
>
> Lines 264 and 245 of comm_cid.c are both in a for loop which calls
> allreduce() twice in a loop until a certain condition is met. As such
> it's hard to tell from this trace if it is processes [0,2] are "ahead"
> or [1,3] are "behind". Either way you look at it however the
> all_reduce() should not deadlock like that so it's as likely to be a bug
> in reduce as it is in ompi_comm_nextcid() from the trace.
>
> I assume all four processes are actually in the same call to comm_dup,
> re-compiling your program with -g and re-running padb would confirm this
> as it would show the line numbers.
>
Yes they are all in the second call to comm_dup.