Hey all -
I am using the job-array dependency functionality and I have found what I think is a repeatable bug in torque-3.0.0.
I routine submit a finalization job that depends on hundreds of jobs which are grouped into several job arrays. The finalization
job is started BEFORE the all the depending jobs have finished in certain circumstances with respect to the job array run states.
I am using the qsub format "-W depend:afterokarray:1[]:2[]" which is working find except for the following case:
If the finalization job depends on 2 job arrays finishing and array#1 is partially running (say 5 out of 10 are R, the other 5 are Q)
and array#2 finishes completely, at that moment the finalized job is released from H, only to be reset to H since array#1
has not finished yet.
Here is the server log showing the state transitions:
10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:11;0008;PBS_Server;Job;15320.madrid.local;Clearing HOLD_s due to dependencies
10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:12;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:26;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:26;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
10/12/2011 10:58:26;0008;PBS_Server;Job;15320.madrid.local;Setting HOLD_s due to dependencies
Is this expected? This tiny transition is messing our pipeline up since we don't no support checkpointing and
the job state gets all screwy from that point onward.
thx -
Fred
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111012/f735c18d/attachment.html