netmap - the fast packet I/O framework

'autoPlay':false]">
netmap is a framework for high speed packet I/O.
Together with its companion VALE software switch,
it is implemented as a single kernel module and available
for FreeBSD, Linux and now also Windows (OSX still missing,
unfortunately).
netmap supports access to
network cards (NICs), host stack, virtual ports (the "VALE" switch),
and "netmap pipes".
It can easily reach line rate on 10G NICs (14.88 Mpps),
over 30 Mpps on 40G NICs (limited by the NIC's hardware),
over 20 Mpps on VALE ports, and over 100 Mpps on
netmap pipes. There is netmap support for QEMU, libpcap (hence,
all libpcap applications can use it), the bhyve hypervisor,
the Click modular ruter, and a number of other applications.
You can find more information in our papers,
and have a look at the Usenix ATC'12 presentation
here.

netmap/VALE can be used to build extremely fast traffic
generators, monitors, software switches, network middleboxes,
interconnect virtual machines or processes, do performance testing
of high speed networking apps without the need for expensive hardware.
We have full support for libpcap so most pcap clients can use it
with no modifications.

netmap, VALE and netmap pipes are implemented
as a single, non intrusive kernel module. Native netmap
support is available for several NICs through slightly modified
drivers; for all other NICs, we provide an emulated mode
on top of standard drivers.
netmap/VALE are part of standard FreeBSD distributions,
and available in source format for Linux too.

Netmap is a standard component of FreeBSD 9 and above.
For the most up to date version of the code, or support for
other operating systems (Linux and Windows) look at our
source repository at https://github.com/luigirizzo/netmap .
Additional resouces (including demo
OS images for FreeBSD and Linux) are available here:

netmap uses a select()-able file descriptor to support blocking I/O,
which makes it extremely easy to port applications using,
say, raw sockets or libpcap to the netmap API.
netmap achieves extremely high speeds (up to 14.88 Mpps with a single
interface using just one core at 900 Mhz); similarly, VALE can
switch up to 20 Mpps per core on a virtual port.
Other frameworks (e.g. DPDK, DNA) achieve similar speeds but lack
the ease of use and portability.
On top of netmap we are bulding features and applications to
replace parts of the existing network stack.netmap is a very efficient framework for line-rate raw packet
I/O from user space, which is capable to support 14.88Mpps on an
ordinary PC and OS. Netmap integrates some known ideas
into a novel, robust and easy to use framework that is available
on FreeBSD and Linux without the need of special hardware or
proprietary software.
With netmap, it takes as little as 60-65 clock cycles to
move one packet between the user program and the wire.
As an example, a single core running at 900 MHz
can generate the 14.8 Mpps that saturate a 10 GigE interface.
This is a 10-20x improvement over the use of a standard device driver.
The rest of this page gives a high level description of the
project.

netmap uses some well known performance-boosting techniques,
such as memory-mapping the card's packet buffers,
I/O batching, and
modeling the send and receive queues as circular buffers to match
what the hardware implements.
Unlike other systems, applications using netmap cannot
crash the OS, because they run in user space
and have no direct access to critical resources (device registers, kernel
memory pointers, etc.). The programming model is extremely
simple (circular rings of fixed size buffers), and
applications use only standard system calls: non-blocking ioctl() to
synchronize with the hardware, and poll()-able file descriptors
to wait for packet receptions or transmissions on individual queues.

netmap can generate traffic at line rate (14.88Mpps) on a
10GigE link with just a single core running at 900Mhz.
This equals to about 60-65 clock cycles per packet, and scales
well with cores and clock frequency (with 4 cores, line rate
is achieved at less than 450 MHz).
Similar rates are reached on the receive side.
In the graph below, the two top curves (green and red) indicate
the performance of netmap on FreeBSD with 1 and 4 cores,
respectively (Intel 82599 10Gbit card). The blue curve is
the fastest available packet generator on Linux (pktgen,
works entirely in the kernel), while the purple curve on the bottom
shows the performance of a user-space generator on FreeBSD
using udp sockets.netmap scales well to multicore systems: individual
file descriptors can be associated to different cards
or queues of a multi-queue card, and move packets between
queues without the need to synchronize with each other.

netmap implements a special device, /dev/netmap, which
is the gateway to switch one or more network cards to netmap
mode, where the card's datapath is disconnected from the operating
system.
open("/dev/netmap") returns a file descriptor that can be
used with ioctl(fd, NIOCREG, ...)
to switch an interface to
netmap mode. A subsequent mmap() exports to userspace
a replica of the TX and RX rings of the card, and the actual
packet buffers. Each "shadow" ring indicates
the number of available buffers, the current read or write index,
and the address and length each buffer (buffers have fixed size and
are preallocated by the kernel).

Two ioctl() synchronize the state of the rings between kernel
and userspace: ioctl(fd, NIOCRXSYNC) tells
the kernel which buffers have been read by userspace, and informs
userspace of any newly received packets. On the TX side, ioctl(fd, NIOCTXSYNC) tells the kernel about new packets to
transmit, and reports to userspace how many free slots are available.
The file descriptor returned by open() can be used to
poll() one or all queues of a card, so that blocking
operation can be integrated seamlessly in existing programs.

Receiving a packet is as simple as reading from the buffer in
the mmapped region; eventually, ioctl(fd, NIOCRXSYNC)
is used to release one or
more buffers at once. Writing to the network requires to fill
one or more buffers with data, set the lengths, and
eventually invoke the ioctl(fd, NIOCTXSYNC) to issue the
appropriate commands to the card.
The memory mapped region contains all rings and buffers of all
cards in netmap mode, so it is trivial to implement packet
forwarding between interfaces.
Zero-copy operation is also possible, by simply
writing the address of the received buffer into the
in the transmit ring.

In addition to the "hardware" rings, each card
in netmap mode exposes two additional rings that
connect to the host stack. Packets coming from the stack are
put in an RX ring where they can be processed in the same way
as those coming from the network. Similarly, packets written
to the additional TX ring are passed up to the host stack when
the ioctl(fd, NIOCTXSYNC) is invoked. Zero-copy bridging
between the host stack and the card is then possible
in the same way as between two cards. In terms of performance,
using the card in netmap mode and bridging in software
is often more efficient than
using standard mode, because the driver uses simpler and
faster code paths.

Programs using netmap do not need any special library or knowledge
of the inner details of the network controller. Not only the ring and buffer
format is independent of the card itself, but any
operation that requires to program the card is done entirely
within the kernel.

When talking about performance it is important to understand
what are the relevant metrics. I won't enter into a long discussion
here, please have a look at the papers
for a more detailed discussion and up to date numbers.
In short:

if you only care about "system" overheads (i.e. the time
to move a packet between the wire and the application) then the
packet generator and receiver should be as simple as possible
(maybe not even touching the data). The program pkt-gen that you
find on the distribution implements exactly that -- outgoing
packets are prepared once so they can be sent without regenerating
all the times, incoming packets are counted but not read;

if you have a more complex application, it might need to
send and receive at the same time, look at the payload, etc.
There are two "bridge" applications in the distribution,
transparently passing traffic between interfaces. The one called
bridge uses the native netmap API and can do zero-copy
forwarding; another one, called testpcap, uses a wrapper
library that implements pcap calls on top of netmap. Among
other things it does a copy of the packet on the outgoing
link, plus it reads a timestamp (in the kernel) at each syscall
because certain pcap clients expect packets to be timestamped.

Transmit and receive speed is shown in the previous
section, and is relatively uninteresting as we go at line
rate even with a severely underclocked CPU.

More interesting is what happens when you touch the data.

netmap can forward packets at line rate (14.88 Mpps) at 1.7 GHz without
touching data, and slightly slower with full data copies.
As a comparison, native packet forwarding using the in-kernel bridge
does about 700Kpps on the same hardware. Though the comparison is a
bit unfair because our bridge and testpcap don't do address lookups;
however we have some real forwarding code (a modified version of
openvswitch) that does almost 3Mpps using netmap.