Hi,
I'm trying to write a health check script for my nodes and I want it
to reject a job, set the nod offline and have the job immediately be
eligible to run again somewhere else when the health check fails. To
get a grip on what happens when the epilogue script exits at different
values, I have in my eliplogue script (a perl script)
exit (4);
I have also specified a fake resource associated with my test nodes.
So I submit my job
qsub -l feature=fakeresource testjob.sh
And it gets rejected correctly and sent back to the queue, but is now
deferred. When I release the hold on it, it runs again but this time
ignores my fakeresource request and runs on the next available node.
How do I get this to return a job to the queue without a hold on it
and get my resource request to stick?
Actually, any pointers to some sample prologue/epilogue scripts or
more information about how they work would be appreciated.
Thanks,
jbh
John Hanks
Utah State University