On 5/4/2012 8:26 AM, Rolf vandeVaart wrote:
>
>>> 2. If that works, then you can also run with a debug switch to see
>>> what connections are being made by MPI.
>> You can see the connections being made in the attached log:
>>
>> [archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
>> 138.23.141.162 on port 2001
> Yes, I missed that. So, can we simplify the problem. Can you run with np=2 and one process on each node?
> Also, maybe you can send the ifconfig output from each node. We sometimes see this type of hanging when
> a node has two different interfaces on the same subnet.
>
> Assuming there are multiple interfaces, can you experiment with the runtime flags outlined here?
> http://www.open-mpi.org/faq/?category=tcp#tcp-selection>
> Maybe by restricting to specific interfaces you can figure out which network is the problem.
>
Another cause of tcp hangs, if you are on linux, is if the virbr0
interfaces are configured. The tcp btl will incorrectly think that it
can use the virbr interfaces to communicate with other nodes. You
either need to disable the virbr interfaces or exclude them from being
used by the tcp btl.