Hi everyone,
Currently we're using torque 2.5.11 and would like to migrate to 4.x
pretty soon. However, some testing with 4.0.2 has shown that programs
linked against a version of OpenMPI (1.4.x) that was compiled with torque
2.5 won't run across more than one node. My guess is that the task
manager API has changed between 2.5 and 4.0.
Certainly, best practices would suggest recompiling all libraries that
depend on torque when the torque version changes. However, a significant
number of our users would be very unhappy having to re-test and possibly
recompile their codes with a recompiled OpenMPI. I think that in some
cases they are even required to use identical libraries across a whole
suite of runs to guarantee consistency. This makes it a little tough to
ever change the resource manager.
So, getting around to my questions, is it likely that I am understanding
the dependency between torque, the task manager, and OpenMPI correctly?
And if so, is it really going to be necessary to recompile OpenMPI? What
do you all do in this situation? Is it a bad idea to run torque (on a big
cluster, ~1400 nodes and >10000 jobs/day) without using the task manager?
Any commentary or pointers to relevant documentation appreciated!
Pete Ruprecht