Improving the Network Interfaces for Gigabit Ethernet
in Clusters of PCs by Protocol Speculation
Christian Kurmann, Michel Muller, Felix Rauch and Thomas M. Stricker
Laboratory for Computer Systems
Swiss Federal Institute of Technology (ETH)
CH-8092 Zurich, Switzerland
{kurmann,rauch,tomstr}@inf.ethz.ch
http://www.inf.ethz.ch/
CS Technical Report #339
Modern massively parallel computers are built from commodity
processors and memories that are used in workstations and PCs. A
similar trend towards commodity components is visible for the
interconnects that connect multiple nodes in clusters of PCs. Only a
few years ago the market was dominated by highly specialized
supercomputer interconnects (e.g. in a Cray T3D). Todays networking
solutions are still proprietary but they do connect to standard buses
(e.g. as PCI card). In the future the networking solutions of the
Internet (e.g. Gigabit Ethernet) will offer Gigabit speeds at lower
costs. Commodity platforms offer good compute performance, but they
can not yet fully utilize the potential of Gigabit/s communication
technology, at least as long as commodity networks like Ethernet with
standard protocols like TCP/IP are used. While the speed of Ethernet
has grown from 10 to 1000 Mbit/s the functionality and the
architectural support in the network interfaces has not kept up and
the driver software becomes a limiting factor.
Network speeds are catching up rapidly to the streaming speed of
main memory. Therefore a true zero-copy network interface architecture
is required to sustain the raw network speed in applications. Many
common Giga-bit Ethernet network protocol stacks are called zero-copy
at the OS level, but upon a closer look they are not really zero-copy
down to the hardware level, since there remains a last copy in the
driver for the fragmenta-tion/ defragmentation of the transfered
network packets that are smaller than a page size.
Defragmenting all the packets of various communication protocols
correctly in hardware remains an extremely complex task, resulting in
a large amount of additional circuitry to be incorporated into to
existing commodity hardware. Therefore we consider the different route
of studying and implementing a speculative defragmentation technique,
that can eliminate the last defragmenting copy operation from
zero-copy TCP/IP stacks on existing hardware. The speculative
technique shows even greater potential for improved efficiency once
the present network interfaces are enhanced by a few protocol matching
registers with a simple control path to the DMA engines.
The payload of fragmented packets is separated from the headers and
stored into a memory page that can be mapped directly to its final
destination in user memory. The checks for correctness and compliance
with the protocol are deferred until, after several packets, an
interrupt for protocol processing is taken. The success of a
speculative approach suggests that a modest hardware addition to a
current Gigabit Ethernet adapter design (e.g. the Hamachi chip) is
sufficient to provide a high speed data path for zero-copy bulk
transfers.
For an evaluation of our ideas we integrated a network interface
driver with speculative defragmenting into existing zero-copy protocol
stacks with page remapping, fbufs or user/kernel shared
memory. Performance mea-surements indicate that we can improve
performance over the standard Linux 2.2 TCP/IP by a factor of 1.5 2
for uninterrupted burst transfers. Based on those implementations we
can present real measurement data on how a simple protocol matching
hardware could improve the performance of bulk transfers with a
commodity Gigabit Ethernet interface.
As with any hardware solution using speculative techniques, a fairly
accurate prediction of the good case (i.e. that a sequence of incoming
packets are consecutive) is required. We show success rates of
uninterrupted bulk transfers for a database and a scientific
computation code on a cluster of PCs. The hit rate can be greatly
improved with a simple matching mechanism in the network interface
that allows to separate packets suitable to zero-copy processing from
other packets to be handled with a regular protocol stack.