Dear Torque users,
We have previously discussed a problem starting LAM-MPI parallel jobs
with torque-1.2.0p6 in this thread:
http://www.supercluster.org/pipermail/torqueusers/2005-September/002079.html
If you use Torque on Redhat Enterprise Linux 4, Fedora Core 4 or any
other system using gcc 3.4 (or later), you should know about a
problem caused by a new feature in gcc 3.4, as well as the solution
to this problem:
We found that the Torque build process has a problem with
gcc 3.4.3, namely that a "make install" will cause a second,
superfluous recompilation of everything. If you're building
an RPM, this causes subtle problems in the resulting RPMs
because some hardcoded paths may be incorrect. This was the
problem that made LAM-MPI booting fail because pbs_mom
could not find the pbs_demux executable (see the above thread).
The quick summary:
------------------
1. With Torque up to and including 1.2.0p6, a workaround is to
configure Torque with an additional CFLAGS option
-fno-working-directory, if your system uses gcc 3.4 or newer.
2. Torque 1.2.0p7 (current snapshot and later) has a patch in
buildutils/makedepend-sh which is the permanent solution,
so the -fno-working-directory workaround is not needed here.
Additional details:
-------------------
The gcc 3.4 man-page describes a new feature:
-fworking-directory
Enable generation of linemarkers in the preprocessor output that
will let the compiler know the current working directory at the
time of preprocessing. When this option is enabled, the prepro-
cessor will emit, after the initial linemarker, a second line-
marker with the current working directory followed by two slashes.
...
This new default feature causes Torque's buildutils/makedepend-sh
script to add a dependency of all .o files upon the timestamp of
the current working directory in the Makefile, in case you use the
-g flag in CFLAGS (the default). Look for the following pattern
in the Makefile:
# DO NOT DELETE THIS LINE -- makedepend-sh depends on it
accounting.o: ./accounting.c
accounting.o: /scratch/Torque/torque-1.2.0p6/src/server//
The line terminated with "//" refers to the current working directory.
This dependency causes all .o files to be rebuilt every time you
do a "make" in any directory, including the case where you do a
"make install".
In the case of RPM building, this is a real problem because all files
will be installed into a temporary location. The pbs_mom will
now have an incorrect hardcoded path to pbs_demux and pbs_rcp,
for example, /var/tmp/torque-1.2.0p6-buildroot/usr/sbin/pbs_demux
(check this by "strings /usr/sbin/pbs_mom | grep pbs_demux").
In this scenario all parallel jobs using the "tm" boot interface
will fail because the pbs_demux process failed to be started
by pbs_mom. A simple test to perform is to run "pbsdsh hostname"
within a multi-node PBS batch job. If pbsdsh gives error messages,
you may have the above problem, and other environments such as
LAM-MPI using the "tm" interface are going to fail as well.
If you want to patch your current Torque installation, here's
the diff (now in the CVS for 1.2.0p7) as provided by Garrick:
--- buildutils/makedepend-sh_orig 2005-09-18 10:04:34.000000000 -0700
+++ buildutils/makedepend-sh 2005-09-18 10:04:05.000000000 -0700
@@ -575,6 +575,7 @@
eval $CPP $arg_cc $d/$s $errout | \
sed -n -e "s;^\# [0-9][0-9 ]*\"\(.*\)\";$f: \1;p" | \
+ grep -v "$PWD//\$" | \
grep -v "$s\$" | grep -v command | grep -v built-in | \
sed -e 's;\([^ :]*: [^ ]*\).*;\1;' \
>> $TMP
Many thanks go to Garrick Staples (USC) for much ping-pong debugging
and for coming up with the patch as well as the -fno-working-directory
workaround.
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark