> TIGRANA@dstiuk.ccmail.compuserve.com writes ("Linux kernel and disaster recovery."):> > What if the Linux server itself crashes? If it is under UPS and> > there was some clever kernel module that would be able to somehow> > save the state of all (or specific) running processes and write to a> > separate disk partition and then after reboot to be able to restore> > the "memory dump" from the partition into memory thus revitalising> > all those running processes that would be very nice.> > The term you are looking for is "checkpointing". You take a snapshot> of a process every so often and when the machine is rebooted after a> crash, you can restore the process to the state of the snapshot. > Alternatively, you can stop a process and start it again later> (perhaps on a different machine).> > > Of course, I understand that the network sockets will be lost but it> > is fine because with the scheme described above one simply> > reattaches to the sessions using the UNIX domain socket and resumes> > it.[SNIPPED]

If kernel driver code was written so that it, too, was checkpointed,then it is possible to restore a running kernel and all the user tasks.VAX/VMS had this capability since Version 6.0.

Basically, the boot process commences as usual, getting all thehardware interface up and running, then the driver's buffers andinternal software state machine is "overlayed" with the previouslysaved image. Then the rest of the kernel is overlayed with the savedmemory image. Then a "return" is made from the previous checkpointtrap and the machine runs as it was running before.

VAX/VMS uses this for a "fast boot" option. The snapshot is takenwhen the system is up with all it's normal "System" tasks running.

Then when you "fast boot", the machine will be quickly restored tothis saved state.

__BUT__ The problem is that you don't want to restore the machine toits exact saved state at the moment it crashed. It will immediatelycrash again! Most crashes are the result of the CPU executing garbageeither because of a hardware or programming error. You need to restorethe state of the machine before it executes garbage and you don't knowwhen that was.

In the days when it took 30 to 40 minutes to boot a VAX, it was usefulto save the state of a perfectly-running machine so that it could bequickly re-booted within a minute or two.

Now we can boot the most complex machine in 30 seconds or so. It reallydoesn't make much sense to save this "freshly booted" state. Instead,servers and database engines should be written to quickly recover bydoing internal checkpointing at regular intervals. If you pull the plugand reboot the machine, these programs should "know" how to completelyrecover in a very short period of time. Problems cited about socketsand file descriptors are not problems at all. The checkpointing routinessave all information necessary to reestablish logical connections, including any security considerations. It's just part of the completesolution and is application specific.

Checkpointing of database engines generally forces a designer to producea superior product because more discipline must be used during codedevelopment. When a programmer has to think about how to unroll andredo something that could be terminated at any instant, the result isusually "lean and mean" code.