Zero-Copy in Linux with sendfile() and splice()

After my recent excursion to Kernelspace, I’m back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor – like this:

// do_read and do_write are simple wrappers on the read() and
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient – we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren’t examining or altering the data in any way – buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this – zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Now, as the man page states, there's a limitation here: "Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.". So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It's a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here's my first attempt to write data from a socket to a file:

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

I haven't benchmarked the 'write' speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().

Hi Mukesh – that’s interesting. Not sure why you’d see a perf degradation like that – what sort of numbers are you seeing? I’m not sure it’s LRO – that’s been around for a while. Perhaps VSFTP is tuned for the traditional read/write pattern?

Thanks for the info. I looked in the details of the threads you shared and then ported the patch from one of the threads to enable splice call accepting buffer size greater than 16348. I was able to make it operational. But to my surprise the performance went down further to ~17000 K bytes/sec 🙁

Wow – I don’t understand that at all – why would fixing that bug result in a performance degradation? How many “splice Write” messages do you see in the kernel log? You should be able to figure out what size of chunk is being transferred – it should be more than 16k! Of course, the other thing you could do is to comment out that perror call to see if it is adding significant overhead.

I don’t think the read should have SPLICE_F_NONBLOCK, but you could try it and see. Unfortunately I don’t have a system set up to test any of this, so you’re pretty much on your own.

BTW – a good tip for posting source is to use https://gist.github.com/ – paste your source there and just post the URL here.

I had a similar experience. Performance of spilce() is much worse than buffered read/write if the splice size is small. After utilizing it with the pipe size (64K), it is better than read/write. However, it is still far away from my expected zero-copy. LRO seems to be supported by some 10G NICs because of patent issues. I only need OS-based zero-copy on basic Gb/100Mb hardware. (I am using it with a embedded Linux.)

Hi Ron – you may be encountering the same issue that I did – if you read carefully to the end of the blog post, you’ll see that I saw the first splice() call hang when I asked for all of the data on the socket. The solution is to modify the first splice call to limit the amount of data you read. I forked and modified your server-side gist: https://gist.github.com/3165080

I can think of two possible explanations – offset is somehow being corrupted, so it’s seeking to some crazy offset in the file while it’s writing, and so creating a sparse file, or the loop is somehow broken and it’s running round the loop too many times.

What does the content of the file look like, compared to the input file?

As far as I know, you can post as much as you like in a gist. If I were you, though, I’d add some more printf calls to see what is happening – how much it’s reading in each call to splice, what the offset is, that sort of thing. I don’t actually have a Linux box handy to run anything on right now :-/

In my opinion the main issue is on the server, I guess. When I’m going to create the output file, the file size (in byte) is unknown. The client should send “count” informations before the server starts write on the output file.

Ron – I can’t think what’s happening. You could send the file size, but you shouldn’t need to – the server should just read data from the socket until the client closes it. Good luck debugging tomorrow!

Thanks for writing this. Do you know how this compares to mmapping a file, and supplying the resulting buffer to read calls? It should in theory achieve something similar. The socket data goes straight to the shared buffer and we avoid the read/write pair calls.