>>> E.g. you see a system disk going bad, but the user
>>> will lose all their output unless the job runs for
>>> 4 more weeks...
until fairly recently (sometime this year), we didn't constrain
the length of jobs. we now have a 1 week limit - generally
argued on the basis of expecting longer jobs to checkpoint.
we also provide blcr for serial/threaded jobs.
I have mixed feelings about this. the purpose of organizations
providing HPC is to _enable_, not obstruct. in some cases, this
could mean working with a group to find an alternative better than,
for instance, not checkpointing a resource-intensive job.
our node/power failure rates are pretty low - not enough to justify
a 1-week limit. but to he honest, the main motive is probably to
increase cluster churn - essentially improving scheduler fairness.
> It's not inevitable that the policy be that 3 month jobs are allowed.
if a length limit is to be justified based on probability-of-failure,
it should be ~ 1/nnodes; if fail-cost-based, 1/ncpus. unfortunately, the
other extreme would be a sort of "invisible hand" where users experimentally
derive the failure rate by their rate of failed jobs jobs ;(
personally, I think facilities should permit longer jobs, though perhaps
only after discussing the risks and alternatives. an economic approach
might reward checkpointing with a fairshare bonus - merely rewarding
short jobs seems wrong-headed.