--Why do we want it? It allows containers to be moved between physicalmachines' kernels in the same way that VMWare can move VMs betweenphysical machines' hypervisors. There are currently at least twoout-of-tree implementations of this in the commercial world (IBM'sMetacluster and Parallels' OpenVZ/Virtuozzo) and several in the academicworld like Zap.

Why do we need it in mainline now? Because we already have plenty ofout-of-tree ones, and want to know what an in-tree one will be like. :) What *I* want right now is the extra review and scrutiny that comes witha mainline submission to make sure we're not going in a directioncontrary to the community.

This only supports pretty simple apps. But, I trust Ingo when he says:>> > > Generally, if something works for simple apps already (in a robust, >> > > compatible and supportable way) and users find it "very cool", then >> > > support for more complex apps is not far in the future. but if you>> > > want to support more complex apps straight away, it takes forever and>> > > gets ugly.

We're *certainly* going to be changing the ABI (which is the format ofthe checkpoint). I'd like to follow the model that we used forext4-dev, which is to make it very clear that this is a development-onlyfeature for now. Perhaps we do that by making the interface onlyavailable through debugfs or something similar for now. Or, reservingthe syscall numbers but require some runtime switch to be thrown beforethey can be used. I'm open to suggestions here.--

--Todo:- Add support for x86-64 and improve ABI- Refine or change syscall interface- Handle multiple namespaces in a container (e.g. save the filesystem namespaces state with the file descriptors)- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Dec-05] v11: - Use contents of 'init->fs->root' instead of pointing to it - Ignore symlinks (there is no such thing as an open symlink) - cr_scan_fds() retries from scratch if it hits size limits - Add missing test for VM_MAYSHARE when dumping memory - Improve documentation about: behavior when tasks aren't fronen, life span of the object hash, references to objects in the hash

[2008-Jul-29] v1: - Initial version: support a single task with address space of only private anonymous or file-mapped VMAs; syscalls ignore pid/crid argument and act on current process.

--At the containers mini-conference before OLS, the consensus amongall the stakeholders was that doing checkpoint/restart in the kernelas much as possible was the best approach. With this approach, thekernel will export a relatively opaque 'blob' of data to userspacewhich can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which wasthat a userspace application would be responsible for collectingall of this data. We were also planning on adding lots of new,little kernel interfaces for all of the things that neededcheckpointing. This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernelstructures such as vmas and mm_structs. It will also containcopies of the actual memory that the process uses. Any changesin this blob's format between kernel revisions can be handled byan in-userspace conversion program.

This is a similar approach to virtually all of the commercialcheckpoint/restart products out there, as well as the researchproject Zap.

These patches basically serialize internel kernel state and writeit out to a file descriptor. The checkpoint and restore are donewith two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore asingle task. The task's address space may consist of only private,simple vma's - anonymous or file-mapped. The open files may consistof only simple files and directories.--