[torqueusers] CONTRIBUTION: pestat utility version 2.0

Hi Tom,
I'm glad if you like the pestat utility. The pestat is actually a
great tool for diagnosing the kind of problem that you describe.
I have seen such NONE* zombie jobs many times, and it may happen
briefly while a job is finishing up, but it should never persist
for more than a few seconds. Maybe you have used "qdel -p"
to kill a hung job ? This may very well create zombie jobs
in the pbs_mom processes on the sister nodes, while pbs_server
thinks that these jobs are long gone.
Anyway, the solution is to make sure the node doesn't run any jobs
(you may want to offline the node and wait until its jobs finish).
Log in to the node and stop pbs_mom (service pbs_mom stop),
go to /var/spool/torque/mom_jobs/ and you will quite likely
find some job status files and directories belonging to those
zombie job ids. Do an "rm -rf" of those zombie job files and
then start pbs_mom again - now it's memory of the jobs has
been wiped clean. In all the cases that I have seen
this clears away the zombie jobs reliably (we currently run Torque
version 2.1.8).
Best regards,
Ole
> I copied and installed pestat - Works great on RHEL4.
>> Two questions tho.
>> ./pestat -f
> Listing only nodes that are flagged by *
> node state load pmem ncpu mem resi usrs tasks jobids/users
> node07 free 0.61* 3946 4 5866 1229 3/2 0
> node08 free 0.00 3946 4 5866 203 2/2 0* 423 NONE* 578
> NONE*
> node12 free 0.99 3946 4 5866 324 3/3 1* 418 NONE* 422
> NONE* 1034 rsnxgp
> node13 free 0.75 3946 4 5866 163 2/2 1* 417 NONE* 924
> NONE* 1023 rsnxgp
> node55 free 1.20* 3946 4 5866 308 2/2 0* 420 NONE* 421
> NONE*
> node56 free 1.02* 3946 4 5866 225 3/2 0* 416 NONE* 419
> NONE* 425 NONE* 577 NONE* 579 NONE*
> silvio free 0.66* 8112 4 8017 3686 37/9* 0
>> On node08 - is it saying that the mom logs think that there is a job 423
> and 578 but there really is not (load=0.0)?
>> How can I kill or clean those records?