Are you sure you have the same OpenMPI version installed on
/usr/local/openmpi on *all* nodes?

The fact that the programs run on the xserver0, but hang
when you try xserver0 and xserver1 together suggest
some inconsistency in the runtime environment,
which may come from different OpenMPI versions.

You can check this, say, by logging in to each node and doing
/usr/local/openmpi/bin/ompi_info and comparing the output.

Anyway, this is just a guess.

Gus Correa

Jody Klymak wrote:
> Hello,
>
>
> On Aug 11, 2009, at 8:15 AM, Ralph Castain wrote:
>
>> You can turn off those mca params I gave you as you are now past that
>> point. I know there are others that can help debug that TCP btl error,
>> but they can help you there.
>
> Just to eliminate the mitgcm from the debugging I compiled
> example/hello_c.c and run as:
>
> /usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01
> hello_c >& hello_c4_1host.txt
>
> There is no ostensible problem. If I run as:
>
> /usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host
> xserve01,xserve02 hello_c >& hello_c4_2host.txt
>
> The process says Hello, but hangs at the end, and needs to be killed
> with ^C.
>
> I then modified connectivity_c to include a printf as MPI is
> initialized, and hardwired verbose=1. This completes, and appears to
> work fine..
>
> /usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host xserve01
> connectivity_c >& connectivity_c8_1host.txt
>
> However, again, two hosts sours the mix:
>
> /usr/local/openmpi/bin/mpirun --debug-daemons -n 8 -host
> xserve01,xserve02 connectivity_c >& connectivity_c8_2host.txt
>
> This hangs, and after waiting a minute or so we see that rank 0--4 on
> xserve01 cannot contact rank 5 (presumably on xserve02).
>
> It seems that I have something wrong in my tcp setup, but communication
> between these servers worked yesterday using 1.1.5, and ping etc all
> work fine, so something else is up. Some sort of port permissions?
>
> Th most glaring error I see in these is:
>
> [xserve02.local:43625] [[28627,0],2] orte:daemon:send_relay - recipient
> list is empty!
>
> I see reference in the archives to a similar error where "contacts.txt"
> could not be found. I've had trouble with 10.5.7 with temporary
> directories, so maybe that is the issue?
>
> Thanks Jody
>
>
> ------------------------------------------------------------------------
>
>
> ------------------------------------------------------------------------
>
>
> ------------------------------------------------------------------------
>
>
> ------------------------------------------------------------------------
>
>
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users