On Monday 11 August 2008, Dave Hansen wrote:> Thanks for all of the very interesting comments about the ABI. > > Considering that we're still *really* early in getting this concept> merged up into mainline, what do you all think we should do now?

I think the two most important aspects here need to be security andsimplicity. If you have to choose between the two, it probably makessense to put security first, because loading untrusted data intothe kernel puts you at a significant risk to start with. If youcan show a restart interface that lets regular users restart theirtasks in a way anyone can verify to be secure, that will be agood indication that you're on the right track.

The other problem that you really need to solve is interfacestability. What you are creating is a binary representationof many kernel internal data structures, so in our commonrules, you have to make sure that you remain forward andbackward compatible. Simply saying that you need to runan identical kernel when restarting from a checkpoint is notenough IMHO.

Some more words on specific interfaces that we have discussed:

The single-file-descriptor approach has the big advantage ofkeeping the complexity in one place (the kernel). To be consistentwith other kernel interfaces, I would make the kernel hand out afile descriptor, not let the user open a file and pass that intothe kernel as you do now.

A new file system is a good idea for many complex interfaces thatmake their way into the kernel, but I don't think it will helpin this case.

For checkpointing a single task, or even a task with its children,a different interface I could imagine would be to have a newfile in procfs per pid that you can read as a pipe giving ourthe same data that you currently save in the checkpoint filedescriptor. It does mean that you won't be able to pass flagsdown easily (you could write to the pipe before you start reading,but that's not too nice).

On the restart side, I think the most consistent interface wouldbe a new binfmt_chkpt implementation that you can use to execvea checkpoint, just like you execute an ELF file today. The binfmtcan be a module (unlike a syscall), so an administrator that isafraid of the security implications can just disable it by notloading the module. In an execve model, the parent process canset up anything related to credentials as good as it's allowedto and then let the kernel do the rest.