Commit Message

From: Andrei Vagin <avagin@virtuozzo.com>
It is a hybrid of process_vm_readv() and vmsplice().
vmsplice can map memory from a current address space into a pipe.
process_vm_readv can read memory of another process.
A new system call can map memory of another process into a pipe.
ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov,
unsigned long nr_segs, unsigned int flags)
All arguments are identical with vmsplice except pid which specifies a
target process.
Currently if we want to dump a process memory to a file or to a socket,
we can use process_vm_readv() + write(), but it works slow, because data
are copied into a temporary user-space buffer.
A second way is to use vmsplice() + splice(). It is more effective,
because data are not copied into a temporary buffer, but here is another
problem. vmsplice works with the currect address space, so it can be
used only if we inject our code into a target process.
The second way suffers from a few other issues:
* a process has to be stopped to run a parasite code
* a number of pipes is limited, so it may be impossible to dump all
memory in one iteration, and we have to stop process and inject our
code a few times.
* pages in pipes are unreclaimable, so it isn't good to hold a lot of
memory in pipes.
The introduced syscall allows to use a second way without injecting any
code into a target process.
My experiments shows that process_vmsplice() + splice() works two time
faster than process_vm_readv() + write().
It is particularly useful for iterative migration. In those cases we enable
a memory tracker, and then we are dumping a process memory while a process
continues to work. On the first iteration we are dumping all memory, and
then we are dumpung only memory that was modified from the previous
iteration. After a few pre-dump operations, a process is stopped and
dumped finally. The pre-dump operations allow to significantly decrease a
process downtime, when a process is migrated to another host.
The primary disadvantage of the process_vm_readv() + write() approach for
the iterative migration usecase is the need to stop the processes with, e.g.
ptrace, and inject parasite code. And, since the the number of pipes and
their size is limited, we either have to limit the dump size to about 1.5GB
or to keep the target processes stopped. The process_vmsplice system call
allows splicing memory without the parasite code and even without stopping
target processes.
This system call maybe usefult for debuggers. For example, a debugger can
use it to generate a core file, it can splice memory of a process into a
pipe and then splice it from the pipe to a file. This method works much
faster than using PTRACE_PEEK* commands.
Debuggers may also use process_vmsplice() to observe memory changes in a
process page. process_vmsplice() attaches a real process page to a pipe, so
we can splice it once and observe how it is being changed over time.
The process_vmsplice syscall might be of interest for users of
process_vm_readv(), in case if they read memory to send it to somewhere
else.
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
---
fs/splice.c | 205 ++++++++++++++++++++++++++++++++++++++
include/linux/compat.h | 3 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 2 +
5 files changed, 218 insertions(+), 1 deletion(-)