<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi Marvin,<div><br></div><div>I couldn't do that, as it's a heavily loaded production system, so there are jobs coming to the WNs all the time. I would need a solution that affected only the stale jobs and not the entire worker node.</div><div><br></div><div>Cheers,</div><div>Paco.</div><div><br><div><div>On Apr 15, 2010, at 3:12 PM, Marvin Novaglobal wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">Hi Paco,<div>&nbsp;&nbsp; &nbsp;You're right. It is always safer to set the node to offline state before clearing all stale jobs. In my case though, I just make sure there is no registered job in the execution node at server side then I clear all the stale jobs.</div>
<div><br></div><div><br></div><div>Regards,</div><div>Marvin</div><div><br><br><div class="gmail_quote">On Wed, Apr 14, 2010 at 5:03 PM, Paco Bernabé <span dir="ltr">&lt;<a href="mailto:fbernabe@nikhef.nl">fbernabe@nikhef.nl</a>&gt;</span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word">Hi Marvin,<div><br></div><div>Thanks for your reply, this actually works; but in order to execute 'momctl -h $wn -c all', I have to set to 'offline' the node in advance, so that no new jobs come into the node. Do you know possible reasons for the jobs to get stuck when the status is EXITED? Anything relevant that I could find in the log files? Is there any other strategy that doesn't require to set the node to offline?</div>
<div><br></div><div>Thanks,</div><div>Paco.</div><div><div></div><div class="h5"><div><br></div><div><br></div><div><div><div>On Apr 14, 2010, at 8:01 AM, Marvin Novaglobal wrote:</div><br><blockquote type="cite">Hi,<div>
&nbsp;&nbsp; &nbsp;Perhaps you can use 'pbsnodes $wn' and grep whether there is a registered job running on current compute node. Then, use 'momctl -c ALL' to clear all the stale jobs if there is no running job registered on the pbs_server side. Optionally, you can recycle the pbs_mom as well. So far, it has served us well.</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">Hi,<div><br></div><div><div><div>I run daily (via cron) 'momctl -d2 -h $wn', in order to detect jobs that have got stuck. The torque server runs the version 2.3.8 of Torque/Maui under CentOS 5.4. I wrote a small script that detects those jobs that wouldn't be cleared automatically by the Torque server and clears them with 'momctl -h $wn -c $job_id'. So far I've seen that those kind of jobs have 'state' set to either PREOBIT or EXITED. In the first case (First example below) a SIGKILL signal is sent eventually by the torque server, the script detects this after running 'tracejob -n 30 -q $job_id' and clears the job via momctl, in the second case (2nd and 3rd example below) I've tried several times to clear the jobs via momctl without success.</div>

<div><br></div><div>After talking to some colleagues a solution would be to stop the mom, to remove the related files inside /var/spool/pbs/, to remove the related files in /tmp/jobdir and start the mom; but it would be great to find a better solution as the system is in production. By the way, all these jobs are not in the queue anymore, so I cannot use qdel.</div>

<div><br></div><div>So my questions are:</div><div><br></div><div>&nbsp;&nbsp; &nbsp;1.- Is there any alternative strategy to clear the jobs, besides via momctl and mom restarting?</div><div>&nbsp;&nbsp; &nbsp;2.- Are there other examples/cases where the jobs get stuck? If yes, what is the strategy to clear them?</div>