> You could imagine jobs which checkpoint often, and automatically restart
> themselves from
> a checkpoint if a machine fails like this.
I find that apps (custom or commercial) normally need some help to restart.
(some need to be pointed at the checkpoint to start with, others need to
be told it's a restart, rather than from-scratch, etc). and I expect that
anything but a dedicated, single-person cluster will also be running a
scheduler, which means that the app would, upon starting, need to queue
a restart of itself as a dependency. we actually have one group that does
this, but their main script already contains multiple iterations of gromacs
(apparently to force re-load-balance.) their code contains a intelligence
about catching crashed processes, finding where to pick up, etc. (this group
also tends to be one that asks interesting questions about, for instance,
when attributes propagate across NFS vs Lustre filesystems. I had never
looked very closely, but various NFS clients have quite different behavior,
including some oldish versions that will cache stale attrs *indefinitely*.)
anyway, we strongly encourage checkpointing, and usually say that you should
checkpoint as frequently as you can without inducing a significant IO
overhead. our main clusters have Lustre filesystems that can sustain several
GB/s, so I usually rule-of-thumb it as "a couple times a day, and more
often with higher node-count. fortunately our node failure rate is fairly
low, so we don't push very hard. it's easy to imagine a large-scale job
needing to checkpoint ~hourly, though: if your spontaneous node failure rate
is 1/node-year, then a 365-node job is 1/day, and that's not a very big job...
> My philosophy though would be to leave a machine down till the cause of
> the crash is established.
absolutely. this is not an obvious principle to some people, though:
it depends on whether your model of failures involves luck or causation ;)
and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc,
console logging for panics) is what lets you rule out bad juju...