Overview

How to achieve
lossless gigabit packet capture to disk with unmodified Linux on
ordinary/modest PC hardware and

Capturing packets remotely on a campus network (without connecting
a capture box to the remote network).

My software which does both is freely available
and is called Gulp (visualize drinking quickly from the network firehose).

By publishing this paper, I hope to:

efficiently share my code, methods and
insight with others interested in doing this and

shed light on
limitations in the Linux code base which hopefully can be fixed so Gulp
is no longer needed.

Background

At the University of Washington, we have a large network with many
hundreds of subnets and close to 120,000 IP devices on our campus
network. Sometimes it is necessary to look at network traffic to
diagnose problems. Recently, I began a project to allow us to capture
subnet-level traffic remotely (without having to physically connect
remotely) to make life easier for our Security and Network Operations
groups and to help diagnose problems more efficiently.

Our Cisco 7600 routers have the ability to create a limited number
of "Encapsulated Remote SPAN ports" (ERSPAN ports) which are similar to
mirrored switch ports except the router "GRE" encapsulates the packets
and sends them to an arbitrary IP address. (GRE is in quotes because
the Cisco GRE header is larger than the standard GRE header (it is 50
bytes) so Linux and/or unmodified
tcpdump can not correctly decapsulate
it).

Because the router will send the "GRE" encapsulated packets without any
established state or confirmation on the receiver (as if sending UDP), I
don't need to establish a tunnel on Linux to receive the packets. I
initially wrote a tiny (30-line) proof-of-concept decapsulator in C
which could postprocess a tcpdump capture like this:

My initial measurements indicated that the percentage of dropped
packets and CPU overhead of writing through the conversion program and
then to disk were not significantly higher than writing directly to disk
so I thought this was a reasonable plan. On my old desktop workstation,
a 3.2GHz P4 Dell Optiplex 270 with slow 32-bit PCI bus and a built-in
10/100/1000 Intel 82540EM NIC) running Fedora Core 6 Linux (2.6.19
kernel, ethtool -G eth0 rx 4096), I could capture
and save close to 180Mb/s of iperf traffic with about
1% packet loss so it seemed worth pursuing. Partly to facilite this and
partly for unrelated reasons, I bought a newer/faster office PC.

What Did and Didn't Work

To my surprise, my new office PC (a Dell Precision 690 with 2.66 GHz
quad-core Xeon x5355, PCI-Express-based Intel Pro-1000-PT NIC, faster
RAM and SATA disks) running the same (Fedora Core 6) OS, initially
dropped more packets than my old P4 system did, even though each of the 4
CPU cores does about 70% more than my old P4 system (according to my
benchmarks). I spent a long time trying to tune the OS by changing
various parameters in /proc and
/sys, trying to tune the e1000 NIC driver's tunable
parameters and fiddling with scheduling priority and processor affinity
(for processes, daemons and interrupts). Although the number of
combinations and permutations of things to change was high, I gradually
made enough progress that I continued down this path for far too long
before discovering the right path.

Two things puzzled me:
"xosview"
(a system load visualization tool) always showed plenty of idle
resources when packets were dropped and writing packets to disk seemed
to have a disproportionate impact on packet loss, especially when the
system buffer cache was full.

It eventually occurred to me to try to decouple disk writing from packet
reading. I tried piping the output of the capturing tcpdump program into an
old (circa 1990) tape
buffering program (written by Lee McLoughlin) which ran as two processes
with a small shared-memory ring buffer. Remarkably, piping the output through
McLoughlin's buffer program caused tcpdump to drop fewer packets. Piping
through "dd" with any write size and/or buffer size or through "cat" did not
provide any improvement. My best guess as to why McLoughlin's buffer helped is
that even though the select(2) system call says writes to disk never block,
they effectively do. When the writes block, tcpdump can't read packets from
the kernel quickly enough to prevent the NIC's buffer from overflowing.

A quick look at the code in McLoughlin's buffer program convinced me I
would do better starting from scratch so I wrote a simple multi-threaded
ring-buffer program (which became Gulp). For both simplicity and efficiency
under load, I designed it to be completely lock-free. The multi-threaded ring
buffer worked remarkably well and considerably increased the rate at which I
could capture without loss but, at higher packet rates, it still dropped
packets--especially while writing to disk.

I emailed Luca Deri, the author
of Linux's PF_RING NIC driver,
and he (correctly) suggested that it would be easy to
incorporate the packet capture into the ring buffer program itself
(which I did). This ultimately was a good idea but initially didn't
seem to help much. Eventually I figured out why: the Linux scheduler
sometimes scheduled both my reader and writer threads on the same
CPU/core which caused them to run alternately instead of simultaneously.
When they ran alternately, the packet reader was again starved of CPU
cycles and packet loss occurred. The solution was simply to explicitly
assign the reader and writer threads to different CPU/cores and to
increase the scheduling priority of the packet reading thread. These
two changes improved performance so dramatically that dropping any
packets on a gigabit capture, written entirely to disk, is now a rare
occurrence and many of the system performance tuning hacks I resorted to
earlier have been backed out. (I now suspect they mostly helped by
indirectly influencing process scheduling and cpu affinity--something I
now control directly--however on systems with more than
two CPU cores, the
inter-core-benchmark I developed may still be helpful to determine which
cores work most efficiently together).

Performance of Our Production System

Our (pilot) production system for gigabit remote packet capture is a
Dell PowerEdge model 860 with a single Intel Core2Duo CPU (x3070) at
2.66 GHz (hyperthreading disabled) running RedHat Enterprise Linux 5
(RHEL5 2.6.18 kernel). It has 2GB RAM, two WD2500JS 250GB SATA drives
in a striped ext2 logical volume (essentially software RAID 0 using LVM)
and an Intel Pro1000 PT network interface (NIC) for packet capture.
(The builtin BCM5721 Broadcom NICs are unable to capture the slightly
jumbo frames required for Cisco ERSPAN--they may work for non-jumbo
packet capture but I haven't tested them. The Intel NIC does consume a
PCI-e slot but costs only about $40.)

A 2-minute capture of as much
iperf data as I can
generate into a 1Gb ERSPAN port (before the ERSPAN link saturates and
the router starts dropping packets) results in a nearly 14GB pcap file
usually with no packets dropped by Linux. The packet rate for that
traffic is about 96k pps avg. The router port sending the ERSPAN
traffic was nearly saturated (900+Mb/s) and the sum of the average iperf
throughputs was 818-897Mb/s (but unlike ethernet, I believe iperf
reports only payload bits counted in 1024^2 millions so this translates
to 857-940Mb/s in decimal/ethernet millions not counting packet
headers). Telling iperf to use smaller packets, I was able to capture
all packets at 170k pps avg but I could only 2/3 saturate the gigabit
network using iperf and small packets with the hardware at my disposal.

A subsequent test using a "SmartBits" packet generator to roughly
84% saturate the net with 300-byte packets indicates I can capture and
write to disk 330k pps without dropping any packets. Interestingly the
failure mode at higher packet rates is that there is insufficient CPU
capacity left to empty Gulp's ring buffer as fast as it fills. Gulp did
not start dropping packets until its ring buffer eventually filled.
This demonstrates that Linux can be very
successful at capturing packets at high speed and delivering them to
user processes as long as the reading process can read them from the
kernel fast enough that the NIC-driver's relatively small ring
buffer does not overflow. At very high packet rates, even though the
e1000 NIC driver does interrupt aggregation,
xosview indicated that much of
the CPU was consumed with "hard" and "soft" interrupt processing.

In summary, I believe as long as the average packet size is 300 or
more, our system should be able to capture and write to disk every
packet it receives from a gigabit ethernet. The larger the average
packet size, the more CPU headroom is available and the more certain is
capturing every packet.

I should mention that I have been using Shawn Ostermann's
"tcptrace"
program to confirm that when tcpdump or Gulp reports that the kernel
dropped no packets, this is indeed true. Likewise, when the tools
report the kernel dropped some packets, tcptrace agrees. This means I have
complete confidence in my claims above for capturing iperf data without
loss. Although the SmartBits did not generate TCP traffic, it offered
counts of how many packets it sent which agree with what was captured.

Suggestions for improvements to the Linux code base

Normally if one is interested in capturing only a subset of the
traffic on an interface, the pcap library can filter out the uninteresting
packets in the kernel (as early as possible) to avoid the overhead of
copying them into userspace and then discarding them.

nor the pcap code seems to be capable of decapsulating GRE packets with a
non-standard header length (50 bytes in this case) and then applying normal
pcap filters to what remains, I can do no in-kernel filtering on the contents
of the ERSPAN packets--they must all be copied to userspace, decapsulated
and then filtered again by tcpdump (wireshark or equivalent) as per
examples #4-6 above.

Extensions to either the pcap code or the GRE tunnel mechanism should be
able to add the ability to capture a subset of packets more efficiently by
filtering them out in the kernel. I have not measured the overhead of
"ip tunnel" but I presume doing this in the pcap code would be simplest
and most efficient.

Perhaps select(2) should not always say a descriptor to an open file
on disk will not block for write(2) or alternatively, perhaps the writes
can be made faster so they agree with select(2) and don't block.

Future Work

To my surprise, I learned after completing this work that Luca
Deri's PF_RING patch is NOT already incorporated in the standard Linux kernel
(as I mistakenly thought) and the packet "ring buffer" that "ethtool"
adjusts is something different. Though this misunderstanding is
somewhat embarrassing to me, it seems likely that the benefits of Gulp
and PF_RING will be cumulative and since my next obvious goal is 10Gb I look
forward to confirming that.