Ptrace SEIZE, used to grab pages from task's VM into a pipe (with vmsplice)

The last step deserves a more detailed explanation. In order to drain memory from a task, we first generate the bitmap of pages needed to be dumped (using the smaps, map_files and pagemap cache filled from proc). Next, we create a set of pipes to put pages into. Then we infect the process with parasite code, which, in turn, gets the pipes and vmsplices the required pages into it. Finally, we splice the pages from pipes into image files.

Anonymous private mappings might have pages shared between tasks till they get COW-ed. To restore this CRIU pre-restores those pages before forking the child processes and mremap-s them in the final stage.

Those areas are implemented in the kernel by supporting a pseudo file on a hidden tmpfs mount. So on restore we just determine who will create the shared are and who will attach to it (see the postulates). Then the creator mmap-s the region and the others open the /proc/pid/map_files/ link. However, on the recent kernels, we use the new memfd system call that does similar thing but works for user namespaces. Briefly -- creator creates the memfd, all the others get one via /proc/pid/fd link which is not that strict as compared to the map_files.

Having said that, the restore of memory is done in the following steps:

Open images and read in VMAs

Open all the mm.img, read mappings in, resolve shared memory segments and check whether we need to special-care mapped files.

Fork and pre-mmap

Each task pre-mmaps private anonymous areas and populates them with pages (from pagemap/pages images). Then task forks the child which does the same. It is done in such way in order to make COWed areas actually share the pages they should. On fork() the shared pages become actually shared, as currently this is the only way to make Linux kernel do this.

Open file mappings

Soon after fork we check which VMA-s are MAP_FILE ones and request the files engine to open them.

For things as remote dump, stackable images, and incremental dumps, CRIU supports a more sophisticated memory C/R policies rather than "dump all -- restore all" one. There are several CLI knobs that can be used.

dump action

pre-dump action

--track-mem option

--prev-images-dir option

--leave-running option

--page-server option

Let's see what all of this means.

First of all, the pre-dump action always turns on the --track-mem and the --leave-running options even if they are not specified in the command line. Next, the pre-dump action dumps only the memory, while the dump one dumps all the state including open files, sockets and other stuff. Having said that, let's see all the possible combinations and what they result in.

dump

Without any options, dump everything and kill the dumped tasks.

dump --track-mem

Dump everything, turn on memory changes tracking, and kill tasks after this. As you might have noticed, this is pretty useless combination of options!

dump --leave-running

Dump everything, and leave the tasks running after dump.

dump --track-mem --leave-running

Same as above, but turn on memory changes tracking.

dump --track-mem --leave-running --prev-images-dir <path>

Same as above, but during dump also check whether the page in question is present in parent, and skip dumping it this time.

pre-dump

Only dump memory, turn on memory changes tracking and leave the tasks running.