Occasionaly, when I have to notify a user about a necessary shutdown I'm
looking into sad eyes. "Duhh. Another 200 Teracycles lost."
Our users often run large numbercrunching jobs, models in N dimensions.
Mostly, these jobs can not be continued once interrupted (although some users
try to implement restartability in their code).
My proposal is to add checkpointing capabilities to NetBSD. I thought about
implementing checkpointing as a new system call 'chkpoint' (to my knowledge
IRIX 6.2 has implemented something similar; I don't know details, though). A
process designed to be restartable would then simply install a signal handler
which would invoke 'chkpoint' when SIGXCPU (or a signal alike) is delivered.
Maybe even a new signal could be added: SIGCHKP.
>From discussions with Ignatios Souvatzis and Christoph Badura I am aware
of the inherent problems of checkpointig: open files (vnode -> filename
problem), initialization of devices (initialization history lost), shared
memory segments and so on.
Most of the checkpointing can probably be made in userland, similar to the
Condor batchsystem (http://www.cs.wisc.edu/condor/checkpointing.html)
implementation or the 'save_world' routines by Bennet Yee
(ftp://Play.Trust.CS.CMU.Edu/usr0/ftpguest/pub/save_world.tar.Z).
(Hmm. No new syscall? Shouldn't I have submitted this proposal to tech-kern
in the first place?)
One additional advantage of checkpointing would be that processes could be
migrated from one system to an other (Condor uses that).
I'm not a kernel hacker, and therefore I will stop here and leave further
discussions to the experts. I hope that enough of you find checkpointing
appealing, so that it will eventually be implemented.
--
Dr. Alexandre Wennmacher
Institut fuer Geophysik und Meteorologie wennmach@geo.Uni-Koeln.DE
Universitaet zu Koeln phone +49 221 470 - 3387
D-50923 Koeln fax +49 221 470 - 5198