Intel Pro/1000 Small Packet Performance

This is a summary of what I've learned about the Intel Pro/1000
gigabit ethernet card's performance with small (60 byte) packets. The
board can receive 680,000 packets/second, and send at 500,000 or
840,000 p/s, depending on model. However, the Intel Linux drivers can
drive it at only about half those rates.

The reason to be interested in small-packet performance is if you wish
to build routers, or router-like boxes such as NATs. In such
situations the average packet size is likely to be about 200
bytes. Many gigabit ethernet board designs (and marketing) seem to
focus more on 1500 or 9000 byte packets.

Test Configuration

The test results here actually involve two versions of the Pro/1000.
The receiver is model PWLA8490, which has a 33-mHz 64-bit PCI
interface. The sender is model PWLA8490SX, often called the "Pro/1000
F Server Adapter". It has a 66-mHz 64-bit PCI interface. You probably
want to buy the PWLA8490SX. It seems to be able to send almost twice
as fast as the PWLA8490, but doesn't receive any faster.

The test machines are PCs with SuperMicro 370DL3 motherboards, 800 mHz
Pentium III CPUs, 133 mHz front-side bus, and 256 MB of PC133 memory.
This motherboard has the ServerWorks ServerSet LE chipset and 64-bit
PCI slots. The machines have two CPUs but are running a Linux kernel
with SMP support turned off.

The machines are running Linux 2.2.16. The networking code, however,
is the Click software
router toolkit. Depending on the precise configuration, Click replaces
some or all of the Linux kernel networking code. The point of using
Click is that it can send and receive packets much faster than any
user-level program, because it runs in the kernel. The send software
for these experiments sends UDP packets at a controlled rate. The
receive software just counts and discards packets. The packets are a
total of 60 bytes in length, including the 14 byte ethernet header.

The Pro/1000 driver is based on Intel's version 2.5.11,
available on the web
here.

The two machines involved are directly connected with a fiber cable.
Link-level flow control is disabled for all the tests.

Transmit Performance

The Intel driver can send up to 260,000 packets/second with
the PWLA8490, and 340,000 p/s with the PWLA8490SX. The limiting
factor seems to be that the board uses "delayed" transmit complete
interrupts. It probably interrupts only once per fixed period of
time. Since the transmit queue is limited to 80 packets, this means
the board can send only 80 packets per period of time. The details are
hard to pin down since Intel doesn't make board documentation
available.

After fixing the driver to specifically ask the board for a transmit
complete interrupt every 60 packets, the PWLA8490 is able to send
523,000 p/s. The PWLA8490SX can send about 840,000 p/s. The
detailed fix was to not turn on the E1000_TXD_CMD_IDE bit in every
60th transmit descriptor.

Receive Performance

This graph shows the number of packets per second delivered to the
receiving software as a function of the rate at which packets are sent
to a PWLA8490 card:

The Original line corresponds to the unmodified Intel driver. It can
receive about 300,000 p/s. At higher input rates it seems to
experience interrupt livelock -- the card interrupts for every
received packet, and the cost of the interrupt handling prevents the
CPU from performing any other processing.

The driver source includes code to ask the card to delay interrupts,
but that code isn't turned on. The relevant variable is
e1000_rxint_delay. It appears to be the number of microseconds
between interrupts. The receive DMA queue length is set by
MAX_RFD, so the maximum receive rate should be about
MAX_RFD packets per delay period. Unfortunately these
parameters probably have to be tuned for each specific workload.
The delay period should be long enough that the CPU can completely
process MAX_RFD packets per delay, including user-level
processing if appropriate. If the delay is too low, the CPU
will experience livelock and get no work done. If the delay is
too high, the card will discard packets even though the CPU
is idle.

I found that leaving MAX_RFD at 80 packets and setting the
receive interrupt delay to 128 (the same as the transmit delay) worked
well. This allows about 1.6 microseconds of processing time per
packet, which is enough for my receive software to count and discard a
packet. The resulting behavior is shown by the Tuned line in the graph
above. Note that the receive rate goes up to 450,000 p/s, but then
descends. I wasn't able to find MAX_RFD and delay values
that prevented the decline. This is too bad -- part of the point of
delayed interrupts is to prevent livelock, but it doesn't seem to
work.

The Polling line in the graph describes a setup in which the card
doesn't interrupt at all. Instead, the Click software polls the card
for new packets, fully processes them, and only then polls for more
packets. This prevents livelock as well as avoiding interrupt
overhead, so the driver can receive (and process) 680,000 p/s even
when overloaded with input.

Conclusions

The Pro/1000 hardware can receive 680,000 packets/second, and send
500,000 or 840,000 p/s, depending on model. On the one hand, these
numbers are far from saturating a gigabit link (about 1.4 million
60-byte packets/second). On the other hand, it's a lot better than the
two other gigabit cards I've used. The Alteon Tigon-II seems to be
limited to sending about 100,000 packets per second, possibly due to
the firmware taking about 5 microseconds to load each DMA descriptor
(see page 70, section 4.4, of the Host/Nic Software Interface
Definition). I'm able to send 250,000 p/s with the SysKonnect SK-9843.

It's too bad the Intel Linux driver can only achieve about half the
board's potential. It's also disappointing that the delayed interrupt
mechanism seems to require manual tuning, and that it doesn't prevent
livelock.

You can find my modified version of the Intel 2.5.11 driver here. My modifications support Click's
polling and simplify the code to help get rid of some locking and
increase concurrency. I could easily have introduced bugs, so don't
use my driver if you can't tolerate problems.

Note that since I don't have a manual for the Intel board, I may be
misunderstanding its behavior. And it could easily be the case that
the Intel, Alteon, and SysKonnect hardware could perform better than
I've suggested here with better drivers or with a different test
strategy. So take my results and explanations with a grain of salt.