Tag: zerocopy

Apologies for the terrible pun in the title – I just couldn’t resist 🙂

I was hard at work on my current project the other day, a user-mode Linux server daemon, when I realized that I would need to both copy incoming data to disk and forward it to another daemon via a socket. This caused me a moment’s consternation, since I was using splice() to move incoming data from a socket to a file without needing an intermediate copy in a user-mode buffer, but then I remembered mention of tee(), a companion to splice().

Where splice() moves data directly from a socket (or file) to a pipe (or vice versa), tee() copies data from one pipe to another leaving the data intact in the source pipe. You can then use splice() again to move the data from tee()’s destination pipe to another file descriptor.

It was the work of a few minutes to code up a quick sample app to test this. Since it’s short, and there seems to be a dearth of tee()/splice() examples, here it is in its entirety:

The example doesn’t need much explanation; I added the ‘number_of_bytes‘ parameter so that you can copy a limited amount of data from an infinite source such as /dev/zero or /dev/urandom. Note that a ‘real’ implementation needs a bit more code, since it’s not safe to assume that all the bytes get moved to the destination pipes in one hit, but that would obscure the example 🙂

After my recent excursion to Kernelspace, I’m back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor – like this:

// do_read and do_write are simple wrappers on the read() and
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient – we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren’t examining or altering the data in any way – buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this – zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Now, as the man page states, there's a limitation here: "Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.". So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It's a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here's my first attempt to write data from a socket to a file:

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

I haven't benchmarked the 'write' speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().