Hello,
I'm a new torque user and setting up a test cluster. We use several MPI
applications so I was trying to use the MPI TM interface (lam-7.1.3). I've
been using torque-2.3.2 so far.
SETUP:
LAN
|
| public network
|
HEADNODE
|
| private network
|
CLUSTER
The headnode has both an public and private address and the hostname is set to
the public name (requirement).
The public name is neptune, the private name is agcnms2.
On the headnode a pbs_mom is running as well.
PROBLEM:
I need to start an MPI application with the first node (n0) on the headnode as
that is the one with the diskpack and some applications collect all data on
the first node
When I start a job with the script:
<file>
#PBS -N mpi_hello
#PBS -l nodes=agcnms2:ppn=4+3:ppn=4
#
cd $PBS_O_WORKDIR
lamboot -d -ssi boot tm
mpiexec ./hello_mpi
lamhalt
</file>
Then booting lam fails. Reason is that the private name agcnms2 is translated
to the public name neptune. A boot is attempted over two network which
obviously fails.
Digging through the code and stracing showed that when Torque was collecting
it hosts for the LAM environment, it contacted all relevant MOM's, probably
to update the resource availability (I don't know exactly). Part of the
returned info by the MOM is uname. However, that contains the hostname and
not necessarily the name of the MOM. Usually these are the same, but not in
my case.
In $torquesrc/src/resmom/mom_main.c:785 and further I've changed:
-----------------
sprintf(ret_string,"%s %s %s %s %s",
n.sysname,
n.hostname,
n.release,
n.version,
n.machine);
----- TO -----
sprintf(ret_string,"%s %s %s %s %s",
n.sysname,
mom_short_name,
n.release,
n.version,
n.machine);
----------------
Now it listens to the -H flag of pbs_mom and MPI TM boots properly.
As far as I can see this hack does not influence any other behavior, but I
don't really know that as I'm just starting.
Best regards,
Sander
--
ARGOSS: Atmospheric, marine & coastal information, systems and consultancy.
P.O. Box 61
8325 ZH Vollenhove
The Netherlands
Tel: +31 (0)527-242299
Fax: +31 (0)527-242016
E-mail: hulst at argoss.nl
Web: www.argoss.nl
---Confidentiality Notice & Disclaimer---
The contents of this e-mail and any attachments are intended only for the
use of the e-mail addressee(s) shown. If you are not that person, or one of
those persons, then you are not allowed copy, forward, distribute or disclose
the contents of the mail or base any actions upon it.
ARGOSS Holding BV and its subsidiaries do not accept any liability for any
errors or omissions in the context of this e-mail or its attachments which
arise as a result of Internet transmission, nor accept liability for
statements which are those of the author and not clearly made on behalf of
ARGOSS.