Garric Staples wrote:
> On Wed, Apr 19, 2006 at 09:34:59AM -0700, Austin Godber alleged:
> > Within the last few weeks I have been seeing my AMD cluster (using
> > torque and maui) pause and then eventually restart. Looking more
> > closely it appear that torque briefly dies (qstat cant talk to the
> > server) then starts running again. This seems to cause maui to hang for
> > 15-30 minutes (showq and pals don't work). Then miraculously maui
> > starts scheduling again and all is well.
> >
> > For what its worth, I think I can force it to happen by doing an
> > interactive qsub like this:
> > qsub -I -v DISPLAY=desktop.host.com:0.0 -q x86_64
> > although I am not certain as testing it is fairly disruptive. But it
> > definately happens under other circumstances.
> >
> >
> > I have attached torque and maui logs. Maui is maui-3.2.6p13 and torque
> > is torque-2.0.0p0. I did not disable rpp.
>> Things like this mostly happen with slow responses from MOMs. The
> situation is much improved in later versions of torque. Be sure you
> have poll_jobs enabled, and try to reproduce with the current versions
> of torque and maui (don't forget to build new maui after new torque is
> installed.)
Very slow responses from MOMs (or pbs_server) give timeouts in the
communication. The quick solution to the problem is to set a higher
timeout value in Maui, like
RMCFG[base] TIMEOUT=90
(if you are using the 'base' name in your RMCFG configuration), but Garrick's
solution is much better as soon as you can do the upgrade.
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
National Supercomputer Centre in Linkoping, Sweden
http://www.nsc.liu.se