first, update to torque-1.1.0p2 :)
NCSA has been stress testing all the snaps up to yesterday's release
on 900 node cluster. The last snapshot has been stable.
All jobs, or just the ones running on the nodes that died?
2 method to clear a job-
less painful:
find the mom node (the first node that qstat -n lists)
stop the pbs_mom process on that node
go to /var/spool/pbs/mom_priv/jobs (or whereever you've installed pbs/torque) and
remove all entries for the jobid (all files and dirs)
restart the mom
qdel the job.
more painful:
go to your torque server
stop the pbs_server process
cd /var/spool/pbs/server_priv/jobs
remove all entries for the offending job
restart the server.
Hope this helps.
On Wed, 13 Oct 2004 Stewart.Samuels at aventis.com wrote:
> We have a 90 compute node, dual master node beowulf cluster executing torque-1.0.1p6 and maui-3.2.6p6. The problem we are seeing is that when a compute node fails, jobs get hosed in the queue and cannot be deleted either by the user or root administrator. Has anyone else seen this problem? If so, how did you clear the job(s)? The only way I have been able to clear the job(s) is to rebuild a new server database by performing the command "pbs_server -t create" and then restarting the whole system normally subsequently with a "/sbin/services pbs_server reboot" scenario.
>> We are running Redhat's EL Advanced Server 3.0 Update 2 on the cluster.
>> Any help in removing the jobs without rebuilding the database would be greatly appreciated.
>> Thanks.
>> Stewart Samuels
> Technical Advisor
> Global Unix Engineering Services
> <<ole0.bmp>>
> 1041 Route 202-206
> Bridgewater, NJ 08807
>> (908) 231-4762
>Stewart.Samuels at Aventis.com>>
--
---
Daniel LaPine, System Engineer
National Center for Supercomputing Applications (NCSA)
email: dlapine at ncsa.uiuc.edu
phone: 217-244-9294