On 2/20/09 10:23 AM, "Prentice Bisbal" <prentice at ias.edu> wrote:
Bogdan Costescu wrote:
> On Fri, 20 Feb 2009, Glen Beane wrote:
>>> I looked into SGE a long time ago, but I found the MPI support
>> terrible when compared to TORQUE/PBS Pro
>> Indeed and AFAIK is still in a similar state today. There was talk for a
> long time on the SGE devel list for a TM API to be added, but it seems
> like this is not considered a high priority feature. I've not only
> looked but actually used SGE for about 1 year (IIRC, about 5 years ago)
> during which I had to spend time fixing the interactions with LAM/MPI
> and many of the parallel applications that were used on that cluster -
> and finally gave up. On the plus side, during the time that SGE was
> used, I have never seen a process left behind from a job and the
> queueing system itself seemed very stable - something that I could not
> say for the OpenPBS/Torque that I've also tested at that time.
>
You need to take a fresh look at SGE and Open MPI. Open MPI seems to be
the new de facto standard MPI library, and you can compile it to be
fully integrated with both SGE and Torque. I just set up a cluster using
SGE and Open MPI (built with the --with-sge option), and I there's no
need to tinker with the SGE's MPI startup wrapper scripts like in the
past. Everything just works: SGE and OpenMPI communicate directly with
each other, and SGE has complete control over ALL the MPI processes.
A couple of years ago I did setup SGE with MPICH, and had to tinker with
SGE's startup scripts to get everything to work correctly. Not that
difficult.
I could be wrong but I think at that time, to use Torque you needed to
compile a separate mpiexec program developed by a 3rd party to get
"tight integration" between MPI and Torque.
It depended on the MPI implementation (and it still does). If it supported TM, then no 3rd party job launcher was necessary. For things like mpich 1.x that did not have TM support OSC's mpiexec job launcher provided tight torque/PBS integration. LAM-MPI and OpenMPI have had TM support for a long time.
--
Glen L. Beane
Software Engineer
The Jackson Laboratory
Phone (207) 288-6153
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090220/6ec4b2fb/attachment.html>