This is our recent round of checkpoint/restart patches. It cancheckpoint and restart interactive sessions of 'screen' across kernel reboot. Please consider applying to -mm.

Patches 1-17 are clean-ups and preparations for c/r: * 1,2,3,4 and 9,10: cleanups, also useful for c/r. * 5,6: fix freezer control group * 7,8: extend freezer control group for c/r. * 11-17: clone_with_pid

Patch 32 implements a deferqueue - mechanism for a process todefer work for some later time (unlike workqueue, designed forthe work to execute in the context of same/original process).

Thanks,

Oren.

----Application checkpoint/restart (c/r) is the ability to save the stateof a running application so that it can later resume its executionfrom the time at which it was checkpointed, on the same or a differentmachine.

This version brings support many new features, including support forunix domain sockets, fifos, pseudo-terminals, and signals (see thedetailed changelog below).

With these in place, it can now checkpoint and restart not only batchjobs, but also interactive programs using 'screen'. For example, userscan checkpoint a 'screen' session with multiple shells, upgrade theirkernel, reboot, and restart their interactive 'screen' session frombefore !

This patchset was compiled and tested against v2.6.31. For moreinformation, check out Documentation/checkpoint/*.txtQ: How useful is this code as it stands in real-world usage?A: The application can be single- or multi-processes and threads. It handles open files (regular files/directories on most file systems, pipes, fifos, af_unix sockets, /dev/{null,zero,random,urandom} and pseudo-terminals. It supports shared memory. sysv IPC (except undo of sempahores). It's suitable for many types of batch jobs as well as some interactive jobs. (Note: it is assumed that the fs view is available at restart).Q: What can it checkpoint and restart ?A: A (single threaded) process can checkpoint itself, aka "self" checkpoint, if it calls the new system calls. Otherise, for an "external" checkpoint, the caller must first freeze the target processes. One can either checkpoint an entire container (and we make best effort to ensure that the result is self-contained), or merely a subtree of a process hierarchy.

Q: What about namespaces ?A: Currrently, UTS and IPC namespaces are restored. They demonstrate how namespaces are handled. More to come.

Q: What additional work needs to be done to it?A: Fill in the gory details following the examples so far. Current WIP includes inet sockets, event-poll, and early work on inotify, mount namespace and mount-points, pseudo file systems, and x86_64 support.

Q: How can I try it ?A: Use it for simple batch jobs (pipes, too), or an interactive 'screen' session, in a whole container or just a subtree of tasks:

[2008-Dec-05] v11: - Use contents of 'init->fs->root' instead of pointing to it - Ignore symlinks (there is no such thing as an open symlink) - cr_scan_fds() retries from scratch if it hits size limits - Add missing test for VM_MAYSHARE when dumping memory - Improve documentation about: behavior when tasks aren't fronen, life span of the object hash, references to objects in the hash

[2008-Jul-29] v1: - Initial version: support a single task with address space of only private anonymous or file-mapped VMAs; syscalls ignore pid/crid argument and act on current process.

--At the containers mini-conference before OLS, the consensus amongall the stakeholders was that doing checkpoint/restart in the kernelas much as possible was the best approach. With this approach, thekernel will export a relatively opaque 'blob' of data to userspacewhich can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which wasthat a userspace application would be responsible for collectingall of this data. We were also planning on adding lots of new,little kernel interfaces for all of the things that neededcheckpointing. This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernelstructures such as vmas and mm_structs. It will also containcopies of the actual memory that the process uses. Any changesin this blob's format between kernel revisions can be handled byan in-userspace conversion program.

This is a similar approach to virtually all of the commercialcheckpoint/restart products out there, as well as the researchproject Zap.

These patches basically serialize internel kernel state and writeit out to a file descriptor. The checkpoint and restore are donewith two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore asingle task. The task's address space may consist of only private,simple vma's - anonymous or file-mapped. The open files may consistof only simple files and directories.--