Table of Contents

Overview

Thanks to the efforts of a number of people, zero copy sockets and NFS
patches are available for FreeBSD-current at the URL listed below. Please
note that the patches below are out of date, and are only for people who
want to patch against older versions of -current. The code was checked
into FreeBSD-current on June 25th, 2002, so the latest zero copy code can
now be found in the FreeBSD-current tree.

Header splitting firmware for
Alteon'sTigon II
boards (written by
Ken Merry <ken@FreeBSD.ORG>),
based on version 12.4.11 of their firmware. This is used in combination
with the zero copy receive code to guarantee that the payload of TCP or
UDP packet is placed into a page-aligned buffer.

Alteon firmware debugging ioctls and supporting routines for the Tigon
driver (also written by Ken Merry). This will help anyone who is doing
firmware development under FreeBSD for the Tigon boards.

At the moment the NFS patches aren't a part of this patchset. The original
zero copy NFS patches, which eliminated a copy from the kernel to userland,
used the vfs_ioopt code that Matt Dillon says is broken.

Drew Gallatin wrote a different set of NFS speedups that eliminate a copy
from a struct uio to a struct mbuf. I have the patches, but right now I'm
concentrating on getting the sockets code done before I try to speed up the
NFS code.

The Alteon firmware header splitting and debugging code was written for
Pluto Technologies (www.plutotech.com),
which kindly agreed to let me release the code.

Current Status

The zero copy sockets code was checked into the FreeBSD-current tree on
June 25th, 2002.

Many thanks to all the people who tested the code and provided feedback
over the years.

So if you want the latest version of the code, CVSup FreeBSD-current.

The following changes went into the tree between the June 23rd, 2002
snapshot and the commit to -current on June 25th, 2002:

Added MAKEDEV glue for the ti(4) device nodes.

There is a link to the final set of patches below. The June 25th, 2002
patchset is what was applied to -current, minus a missing $FreeBSD$ in
ti_fw2.h that I hand-edited in at the last minute.

The following fixes went into the June 23rd, 2002 snapshot:

Added a zero_copy(9) man page that describes the general
characteristics of the zero copy send and receive code, and
what an application author should do to take advantage of the
code.

Update the ti(4) man page to include information on the ioctl
interface and the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS
kernel options.

ti(4) driver cleanup: cleaned up some unused code, commented
out some stray diagnostic printfs, and added a comment
describing the transmit flow control problem for posterity.

Added a new jumbo(9) man page that describes the jumbo
allocator.

This snapshot hasn't yet gone through the normal set of regression tests,
though hopefully it will in the next day or so.

Barring any complaints, I'm planning on checking the zero copy code into
the tree on Tuesday evening.

The following fixes went into the June 20th, 2002 snapshot:

Use SLIST_FIRST() macros to access the first entry in a SLIST
in uipc_jumbo.c. I didn't fix all of these when Alfred
pointed out the problem. Pointed out by Bosko Milekic.

Remove superfluous TI_LOCK()/TI_UNLOCK() calls in the zero
copy version of ti_newbuf_jumbo(). We already have the lock
in all the places ti_newbuf_jumbo() is called. Prompted by
Bosko Milekic.

In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling
malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc
fails.

The ti(4) driver and the wi(4) driver, at least, call this with
a mutex held. This causes witness warnings for 'ifconfig -a'
with a wi(4) or ti(4) board in the system. (I've only verified
for ti(4)).

The following fixes went into the June 18th, 2002 snapshot:

Take mutex locking out of ti_attach(), it isn't really needed.
As long as we can assume that probes of successive ti(4)
instances happen sequentially, we'll be safe in doing this.
Thanks to John Baldwin for pointing out the solution to that
problem.

Added a new routine, vm_object_allocate_wait(). This is a
variant of vm_object_allocate() that allows the user to
specify whether the uma_zalloc() call inside
vm_object_allocate_wait() is called with M_WAITOK or M_NOWAIT.
This eliminates a WITNESS warning caused when jumbo_vm_init()
calls vm_object_allocate() with the jumbo lock held, and
potentially gives other callers the option of eliminating the
mandatory wait on the uma_zalloc() call.

With those fixes, plus several fixes that have gone into -current over the
past week or so, the zero copy sockets code runs without any WITNESS
warnings at all.

The following fixes went into the June 9th, 2002 snapshot:

Thanks to Alfred Perlstein for pointing out these problems.

fix a race in the vm object allocation in jumbo_vm_init()

use a sysinit to initialize the jumbo_mutex, since there is
really no other way to avoid a race between checking the mutex to
see if it has been initialized and actually initializing it.

use SLIST_FIRST instead of directly accessing the first element
in the inuse list.

don't call malloc(9) with M_WAITOK while holding a mutex.

Please note that the code will currently spew out TONS of warnings if you
have WITNESS enabled. I am working on fixing these, so if you don't want
to help debug that stuff, make sure you disable WITNESS for now.

The following fixes went into the May 17th, 2002 snapshot:

I have ported the code to -current as of May 17th, 2002. This
includes some fixes for changes Alan Cox made to the vfs_ioopt code
in kern_subr.c.

The following fixes went into the May 4th, 2002 snapshot:

I have ported the code to -current as of May 3rd, 2002.

The jumbo code now has mutex protection, so it should be ready
when the layers above it are moved out from under Giant.

Zero copy send and receive can now be turned on and off on the
fly via the sysctl variables kern.ipc.zero_copy.send and
kern.ipc.zero_copy.receive, respectively. This allows easy tests
of performance with zero copy turned on and off.

The zero copy NFS code is gone, see above for a note on why it
was removed and future plans for a different zero copy NFS codeset.

The following fixes went into the November 29th, 2000 snapshot.

No changes went into this snapshot, other than merging
with -current.

The following fixes went into the November 20th, 2000 snapshot.

The fix to the "localhost panic" problem has been revamped.
We now use a new external mbuf type, EXT_DISPOSABLE, to indicate
that the external mbuf payload may be page-flipped or otherwise
discarded. Instead of attempting to page flip any pages that
meet the size and alignment criteria, we now only page flip
external mbufs marked as disposable. (Thanks to Drew Gallatin for
suggesting this approach.)

The decision process on when to use vm_uiomove() versus
vm_pgmoveco() in uiomoveco() has been revamped somewhat. We no
longer panic in any case. Anything that isn't handled by
vm_pgmoveco() (according to the page flip criteria described above)
is passed to vm_uiomove().

uiomoveco() has been reorganized somewhat, with some of the
functionality split out into a subfunction.

The following fixes went into the November 14th, 2000 snapshot.

The "localhost panic" problem has hopefully been fixed. The
fix was to avoid page-flipping pages with a wire count greater than 0.
I believe this is the right fix, but I would welcome feedback from
someone more familiar with the VM system.

The new external mbuf code has been integrated.

The following fixes went into the November 2nd, 2000 snapshot.

Robert Picco's zero copy send code has been removed. It was
never fixed to eliminate a data corruption problem, and it is
likely that Drew Gallatin's code will make it into -current
instead.

Bring the major number used in the ti(4) driver in line with
the one we have reserved in sys/conf/majors.

Make sure calls to ti_hdr_split() are only made inside #ifdef
TI_JUMBO_HDRSPLIT.

Convert the non-stock portions of the ti(4) driver from spls to
mutexes.

Get rid of an extra make_dev(), and make sure the one in
ti_attach() comes before we return.

The following fixes went into the September 5th, 2000 snapshot:

Merged in the new mbuf reference counting code from -current.

Fixed a bug in writev(2) and sendmsg(2) handling noticed by
Alan Cox <alc@FreeBSD.ORG>.
We weren't incrementing the iov pointer in the uio structure, like
uiomove() does.

Fixed another bug in the zero copy code, noticed by
Alan Cox <alc@FreeBSD.ORG>.
Move the initialization of the cow_send in sosend() (in
uipc_socket.c) into the inner while loop.

The following fixes went into the August 4th, 2000 snapshot.

Support has been merged in from -current for Alteon and Netgear
1000baseT boards. Initial tests with Alteon boards indicate that
their performance is identical to the 1000baseSX model.

The zero copy send support code has been renamed and moved to a
new file, src/sys/kern/uipc_cow.c.

Drew Gallatin has made some performance enhancements in the
ti(4) driver that decrease receive-side CPU utilization and
increase performance somewhat. (CPU utilization changes are hard
to quantify, but are probably in the 10-20% range. TCP performance
on my test Pentium II 350's increased from about 746Mbps to about
763Mbps, as measured by netperf.)

Drew Gallatin submitted a fix to the IP fragmenting code to
make sure that outgoing fragments are 8-byte aligned.

Incorporated Bill Paul's fixes to Alteon's 12.4.11 firmware
that hopefully include most of the true bugfixes to their 12.4.13
firmware. Alteon's 12.4.13 firmware, when used with 1000BaseT
boards, doesn't seem to autonegotiate anything other than 1000BaseT,
and also doesn't like to be forced to a speed other than 1000Mbps.
Alteon hasn't yet responded to queries about the problems with
version 12.4.13 of their firmware, so we're using version 12.4.11
with some selected fixes from 12.4.13. This seems to properly
negotiate at all supported speeds and duplex settings with 1000baseT
boards.

Header splitting is now restricted to Tigon 2 boards only. We
only have source for firmware for the Tigon 2, thus the reason
header splitting is only supported for those chips.

Drew Gallatin's NFS read header splitting code is now included
in the firmware. This can dramatically improve NFS read performance.

There is a new references section in the FAQ that includes
pointers to some relevant papers and proposals.

The following fixes went into the July 8th, 2000 snapshot:

There was a potential panic caused by a bug in the driver side
of the header splitting code. The bug only popped up with
non-split packets that were long enough to fill up a mbuf. This
generally meant IP fragments with a non-zero fragment offset,
usually generated by NFS reads. Essentially the length of the
initial receive buffer in the mbuf chain was overstated by two
bytes, which caused the next mbuf pointer in the next contiguous
mbuf to get partially overwritten. That could cause a panic in some
situations. Thanks to Drew Gallatin for tracking this one down.

We now do header splitting on IP fragments with a fragment
offset greater than 0. Thanks to Justin Gibbs for the idea.

The Tigon driver now loads and unloads cleanly. Thanks to Drew
Gallatin for getting this working.

Outgoing IP fragments are now generated in page-multiple chunks
if the outgoing interface's MTU is greater than a page in size.
This helps receive-side bandwidth NFS significantly, since page
flipping techniques can be used. Thanks to Drew Gallatin for this
performance enhancement.

A couple of things have been added to the benchmarks section of this web
page, below:

Drew Gallatin has achieved 986Mbps throughput over gigabit
ethernet with the patches below.

The patches above are based on -current from early in the day on
June 13th 2000, i.e. before Peter's config changes.

Frequently Asked Questions:

Known Problems.

What is "zero copy"?

How does zero copy work?

What hardware does it work with?

Configuration and performance tuning.

Benchmarks.

References.

Possible future directions.

Known Problems:

There are no known problems, although bug reports and feedback are welcome.

What is "zero copy"?

Zero copy is a misnomer, or an accurate description, depending on how you
look at things.

In the normal case, with network I/O, buffers are copied from the user
process into the kernel on the send side, and from the kernel into the user
process on the receiving side.

That is the copy that is being eliminated in this case. The DMA or copy
from the kernel into the NIC, or from the NIC into the kernel is not the
copy that is being eliminated. In fact you can't eliminate that copy
without taking packet processing out of the kernel altogether. (i.e. the
kernel has to see the packet headers in order to determine what to do with
the payload)

Memory copies from userland into the kernel are one of the largest
bottlenecks in network performance on a BSD system, so eliminating them can
greatly increase network throughput, and decrease system load when CPU or
memory bandwidth isn't the limiting factor.

How does zero copy work?

The send side and receive side zero copy code work in different ways:

The send side code takes pages that the userland program writes to a
socket, and puts a COW (Copy On Write) mapping on each page, and stuffs it
into a mbuf. The data the user program writes must be page sized and start
on a page boundary in order for it to be run through the zero copy send
code.

If the userland program doesn't write to the page before it has been sent
out on the wire and the mbuf freed (and therefore the COW mapping revoked),
the page will not be copied. For TCP, the mbuf isn't freed until the packet
is acknowledged by the receiver.

So send side zero copy is only better than the standard case, where
userland buffers are copied into kernel buffers, if the userland program
doesn't immediately reuse the buffer.

Receive side zero copy works in a slightly different manner, and depends in
part on the capabilities of the network card in question.

One requirement for zero copy receive to work is that the chunks of data
passed up the network and socket layers have to be at least page sized, and
be aligned on page boundaries. This pretty much means that the card has
to have a MTU of 4K or 8K in the case of the Alpha. Gigabit Ethernet cards
using Jumbo Frames (9000 byte MTU) fall into this category. More on that
below.

Another requirement for zero copy receive to work is that the NIC driver
needs to allocate receive side pages from a "disposeable" pool. This means
allocating memory apart from the normal mbuf memory, and attaching it as an
external buffer to the mbuf.

It also helps if the NIC can receive packets into multiple buffers, and if
the NIC can separate the ethernet, IP, and TCP or UDP headers from the
payload. The idea is to get the packet payload into one or more page-sized,
page-aligned buffers.

The NIC driver receives data into these buffers allocated from a
disposeable pool. The mbuf with these buffers attached is then passed up
the network stack where the headers are removed. Finally it reaches the
socket layer, and waits for the user to read it. Once the user reads the
data, the kernel page is then substituted for the user's page, and the
user's page is then recycled. This is otherwise known as "page flipping".

The page flip can only occur if both the userland buffer and kernel buffer
are page aligned, and if there is at least a page worth of data in the
source and destination. Otherwise the data will be copied out using
copyout() in the normal manner.

What hardware does it work with?

The send side zero copy code should work with most any network adapter.

The receive side code, however, requires an adapter with an MTU that is at
least a page size, due to the alignment restrictions for page substitution
(or "page flipping").

The Alteon firmware debugging code requires an Alteon Tigon II board. If
you want the patches to the userland tools and Tigon firmware to debug it
and make it compile under FreeBSD, contact
ken@FreeBSD.ORG.

Configuration and performance tuning.

There are a number of options that need to be turned on for various things
to work:

I would also recommend turning off WITNESS, as well as SMP, if you want to
get a good idea of the performance impact of this code.

To get the maximum performance out of the code, here are some suggestions
on various sysctl and other parameters. These assume you've got an
Alteon-based board, so if you're using something else, you may want to
experiment and find the optimum values for some of them:

A send window of 512K seems to work well with 1MB Tigon boards, and a
send window of 256K seems to work well with 512K Tigon boards. Again,
you may want to experiment to find the best settings for your hardware.

In particular, this paper, entitled "End-System Optimizations for
High-Speed TCP", by Jeff Chase, Andrew Gallatin and Ken Yocum, includes
some performance graphs for Drew's zero copy code (which is available in
the diffs referenced above), and a good overview of a number of
optimizations that can be used to increase TCP performance:

One of the obvious problems with the current send side approach is that it
only works if the userland application doesn't immediately reuse the
buffer.

In the case of many system applications, though, the application will reuse
the buffer immediately, and therefore performance will be no better than
the standard case. Many common applications (like ftp) have been written
with the current system buffer usage in mind, so they function like this:

That makes sense if the kernel is only going to copy the data, but it
doesn't in the zero copy case.

Another problem with the current send side approach is that it requires
page sized and page aligned data in order to apply the COW mapping. Not
all data sets fit this requirement.

One way to address both of the above problems is to implement an alternate
zero copy send scheme that uses async I/O. With async I/O semantics, it
will be clear to the userland program that the buffer in question is not to
be used until it is returned from the kernel.

So with that approach, you eliminate the need to map the data
copy-on-write, and therefore also eliminate the need for the data to be
page sized and page aligned.

Receive side zero copy:

The main issue with the current receive side zero copy code is the size and
alignment restrictions.

One way to get around the restriction is if it were possible to do
operations similar to a page flip on buffers that are less than a page
size.

Another way to get around the restriction is to have the receiving client
pass buffers into the kernel (perhaps with an async I/O type interface) and
have the NIC DMA the data directly into the buffers the user has supplied.

One proposal for doing this is called RDMA. There is a problem statement
here that makes a good case for the need for an RDMA framework for TCP:

There used to be a draft standard for RDMA extensions to TCP floating
around, but I haven't been able to locate it lately.

Essentially RDMA allows for the sender and receiver to negotiate
destination buffer locations on the receiver. The sender then includes the
buffer locations in a TCP header option, and the NIC can then extract the
destination location for the payload and DMA it to the appropriate place.

One drawback to this approach is that it requires support for RDMA on both
ends of the connection.