Cluster Networking: The Dark Side of IP over Ethernet

Article Index

The cruel truth about IP Datagrams and other things you may have forgot (or never learned).

In the last column we learned some things that every Cluster Engineer should know about Ethernet and the Internet Protocol (IP). The former specification, recall, is defined by IEEE documents that are "open" but not freely re-publishable; the latter by fully open RFCs that you can read yourself for free and from which I can actually cut and paste while describing them. The article contained a synopsis of information from RFC 791 (IP) and referenced RFC 792 (ICMP) and RFC 894 (IP over Ethernet).

Of course when you read that article (studied it, really) you noticed the fact that the smallest packet that can be sent to deliver one single byte of actual data via IP over Ethernet is exactly 64 bytes, the smallest permitted Ethernet packet size. Of this 64 bytes, 18 bytes are Ethernet header and CRC, 20 bytes are IP header, one byte is data, and the rest (25 bytes, about 40% of the packet) is "padding" (although in practice it will generally be at least partially used for higher level e.g. TCP headers discussed below).

It is worthwhile to spend a moment meditating upon this cruel truth. The ratio of 63 bytes of mandatory envelope (40% of which might well be "blank paper") to one byte of message is one reason that IP over Ethernet is a poor choice for cluster designs and applications that are expected to send lots of small messages, and we haven't even gotten to the TCP layer yet (which uses some of the wasted padding for its own header but doesn't alter the 63:N ratio of overhead to message for sending small messages with N bytes).

While we are considering cruel truths, we also should recall that (counted or not) there are 8 more bytes of metaphorical bell-ringing time required to raise the carrier and grab the line to send any packet at all, and that the probability of collisions goes way up if our metaphorical room full of individuals with messages for one another have to shout their messages one word at a time. These observations have led to the development of quite a number of alternative network protocols that transmit packets in rings, use much smaller headers, do more with dedicated hardware. In future columns we will probably get around to looking at at least some of these efficient but expensive networks.

Fortunately (or not) on many systems the ratio of header size to data size turns out to be nearly irrelevant because other elements, many of them hardware based, determine the irreducible packet latency (the absolute minimum time between Ethernet packets of minimum size). For example, in between two systems in my home network (switched 100 BT) the transmission latency for a 1 byte message (64 byte packet plus 8 bytes worth of preamble) is around 50 microseconds according to NPtcp (Netpipe was discussed in the Right Stuff column in the very first issue of ClusterWorld (December 2003), but we'll come back to it in this column fairly soon as well as it is a critical network benchmarking tool).

Achieving 50 microseconds, much of it fixed by the switch and hardware independent of the protocol stack, is actually quite good for a minimum length TCP/IP packet on Ethernet containing a single byte of actual data as these things go. For a packet size roughly twice the minimum the latency is only roughly 68 microseconds, considerably less than twice 50. This is good news and bad news. The good news is that the bandwidth is growing rapidly with packet size as it costs only 18 more microseconds to send some 80 times as much data (we'll learn below just how to compute the TCP data capacity of a 128 byte Ethernet packet) and that this slow growth continues until one approaches packet sizes that saturate the medium.

{mosgoogle right}
The bad news is that this is really quite poor in absolute terms. The interface can send at most 20,000 minimum size packets per second, or around 20 KBytes/second, on a network that can carry 10 MBytes/second for large packet sizes. 50 microseconds translates to 50,000 to 150,000 instructions on modern CPUs. In many cluster applications the actual computation will block, wasting all these cycles, while waiting for communications to complete.

This news is not all the dark. Last month we learned that IP is lovely, but the protocol by itself has many warts. It is not very reliable. A variety of things can cause a network to drop occasional packets. IP has no way of positively identifying that a packet has been dropped, even if it is in the middle of an important message, and has no way of requesting that the dropped packet be retransmitted. IP has no way of dealing with the vagaries of networks with complicated routes, where packets that are part of a single (fragmented) datagram can arrive out of order. IP can send datagrams of at most 64 kilobytes in length, but real messages might be much longer.

Applications require "connections", and their connections need to be multiplex-able -- several applications on one host need to be able to exchange data with several applications on another host "at the same time" -- but IP is connectionless. It just drops a packet onto a wire in the untested expectation that it will be received, is totally unaware of higher level applications, and does not support the notion of a persistent "connection" between applications running on different ends of a network.

We cannot do much about the latency issue for IP on Ethernet as most of the problem is beyond our control -- in the hardware itself or equally inaccessible in the kernel. However, we can do quite a lot to achieve application connectivity, flexibility, and reliability. To get there, we need to add one or more layers in the ISO/OSI stack. This requirement leads us to learn about the Transmission Control Protocol (TCP) and its unreliable (but faster) cousin, User Datagram Protocol (UDP). Let's look at the latter first.

The User Datagram Protocol

By now the idea should be familiar. Ethernet is very low overhead but not routable or robust and does not support the notion of connections (persistent or otherwise) between applications as opposed to kernels. IP adds routability via a header tailored to that purpose, but remains less than robust and doesn't grok connections between applications.
There are two ways to get reliability and connections. One is to invent a big new header with support for both. This idea adds a certain amount of overhead to every connection, whether or not every feature is being used or is necessary. For some applications, missing the occasional packet may not matter compared to getting the packets one does get as efficiently as possible. The other idea is to add connections only (an abstraction we will assume is required for any pair of applications to talk over a network) and let those applications deal with reliability to the extent that they feel appropriate.

This is the basis of UDP, defined in RFC 768. The UDP header (copied verbatim from this RFC) is very, very simple and shown in Figure One:

This design is simplicity itself. UDP introduces a new abstraction, that of the port. A port is presumed to be associated with an application running on either end of the connection. Yes, UDP is described as a "connectionless protocol" but by this it is meant that UDP connections are not persistent and verified, not that they are not defined by the addition of the notion of the port.

We won't say much about ports here. You can read /etc/services if you want to see what "well known ports" are known on your machine(s), or you can read any of a long string of RFCs that define or modify this list. There are also ports that are reserved from being assigned in this way so that a persistent connection can find free port numbers on both ends to dynamically create persistent connections without blocking a well known port in the meantime.

Beyond source and destination ports (which should be thought of as being concatenated with IP number to specify e.g. port 80 on host 192.168.1.129 as a way of sending a packet out that will be received by the application listening on port 80 of host 192.168.1.129) the header contains the length of the UDP message, including header, and yet another checksum. Following this is the data, once again padded so that it contains an even number of bytes. The UDP header is 8 bytes long, so our minimum packet now becomes 18 bytes of Ethernet header encapsulating 20 bytes of IP header encapsulating 8 bytes of UDP header (46 bytes total header) encapsulating a data message padded as required (at the IP level) so that the minimum packet length is 64 bytes and the maximum is 1518 (for a standard MTU of 1500)!

Corrupted packets can be detected and dropped at the UDP level, at the IP header level, and at the Ethernet level. It is up to the applications sending and receiving data to ensure that message streams arrive in the right order, that dropped packets or corrupted messages are retransmitted, and so forth. These are "unlikely" to occur on local area networks where the packets cannot take multiple routes to their destination and where Ethernet itself ensures a fairly reliable delivery of packets on good hardware, so UDP is often used for local, non-persistent connections where sender and receiver are "on the same wire". It is also used for applications that want to achieve reliability at the absolutely lowest cost and think that they can beat TCP. Some applications in this category include parallel computing messaging libraries, e.g. PVM and core network services such as NFS.

TCP isn't that easy to beat, though. While it does have (perhaps) more controls than many connections need, as we observed last month latency scales weakly with packet length and the TCP code tends to be pretty well optimized (having been kicked around performance-wise for a rather long time). TCP does have some annoying features associated with their notion of persistent connections (which can be both blessing and curse). As I write this column, vendors are appearing that promise to move the TCP stack out of the kernel altogether and into the network interface. This feature will lower the latency cost of TCP and (perhaps more importantly) reduce the CPU burden of doing the actual work associated with reliable (re)transmission of data on an existing connection.