Hi,
I've been running Torque 2.3.0 in a very small cluster for a while
with no problems. But I just added an 8-core server to the cluster
with np=4, and I'm often getting a weird condition where some jobs
that run on this new 8-core machine never get out of the running
state, even long past when the scripts have actually exited. Here's
the output of tracejob on the offending node with pbs_mom's loglevel=5
for a job that claims to never finish. Seems like a race condition
where the mom thinks it already sent an obit but it never actually
gets sent, so it's stuck in this state.
04/16/2008 12:47:27 M evaluating limits for job
04/16/2008 12:47:27 M about to fork child which will become job
04/16/2008 12:47:27 M phase 2 of job launch successfully completed
04/16/2008 12:47:27 M task/session info loaded
04/16/2008 12:47:27 M job successfully started
04/16/2008 12:47:27 M job 1100315.lnxmaster reported
successful start on 1 node(s)
04/16/2008 12:47:27 M encoding "send flagged" attr: session_id
04/16/2008 12:47:28 M job is in non-exiting substate 42, no obit sent at
this time
04/16/2008 12:47:28 M sending signal 9 to task
04/16/2008 12:47:28 M scan_for_terminated: job
1100315.lnxmaster task 1 terminated,
sid=1190
04/16/2008 12:47:28 M job was terminated
04/16/2008 12:47:28 M master task has exited - sending kill job request to
all sisters
04/16/2008 12:47:28 M task is dead
04/16/2008 12:47:28 M sending preobit jobstat
04/16/2008 12:47:28 M performing job clean-up
04/16/2008 12:47:34 M job is in non-exiting substate 58, no obit sent at
this time
04/16/2008 12:47:34 M job is in non-exiting substate 58, no obit sent at
this time
(this message continues being printed for about 10 more seconds, but
then the log stops after that)
Thanks for any help you can offer.
Ari