We have several multi-processor jobs that will not start. showq -b
shows them as deferred; later showq -i will show them as idle; then
they will be deferred and so forth. Checkjob -v shows messages similar
to these:
Message[0] 9 nodes unavailable to start reserved job after 63 seconds
(reserved node vmp089 is in state 'Running' - check node)
Message[1] 9 nodes unavailable to start reserved job after 63 seconds
(reserved node vmp090 is in state 'Running' - check node)
Message[2] 9 nodes unavailable to start reserved job after 63 seconds
(reserved node vmp066 is in state 'Running' - check node)
Message[3] 10 nodes unavailable to start reserved job after 63 seconds
(reserved node vmp069 is in state 'Running' - check node)
I haven't found anything revealing in the log files, but I am not
exactly sure what to look for. The identified nodes have jobs running
on them, but there are free processors.
We use torque 2.3.6, and moab 5.3.2 (revision 12709)
I would appreciate any suggestions.
Cheers--
Charles
---
Charles Johnson
Advanced Computing Center for Research and Education
Vanderbilt University
Office: 615-343-2776
Cell: 615-478-8799