Monday, May 30, 2011

Reflections on Fast, User-Level Networking

A couple of weeks ago at HotOS, one of the most controversial papers (from Stanford) was entitled "It's Time for Low Latency." The basic premise of the paper is that clusters are stuck using expensive, high-latency network interfaces (generally TCP/IP over some flavor of Ethernet), but it should now be possible to achieve sub-10-microsecond round-trip-times for RPCs. Of course, a tremendous amount of research looked at low-latency, high-bandwidth cluster networking in the mid-1990's, including Active Messages, the Virtual Interface Architecture, and U-Net (which I was involved with as an undergrad at Cornell). A bunch of commercial products were available in this space, including Myrinet (still the best, IMHO) and InfiniBand.

Not much of this work has really taken off in commercial datacenters. John Ousterhout and Steve Rumble argue that this is because the commercial need for low latency networking hasn't been there until now. Indeed, when we were working on this in the 90's, the applications we envisioned were primarily numerical and scientific computing: big matrix multiplies, that kind of thing.

When Inktomi and Google started demonstrating Web search as the "killer app" for clusters, they managed to get away with relatively high-latency, but cheap, Ethernet-based solutions. For these applications, the cluster interconnect was not the bottleneck. Rumble's paper argues that emerging cloud applications are motivating the need for fast intermachine RPC. I'm not entirely convinced of this, but John and Steve and I had a few good conversations about this at HotOS and I've been reflecting on the lessons learned from the "fast interconnect" heyday of the 90's...

Microbenchmarks are evil: There is a risk in focusing on microbenchmarks when working on cluster networking. The standard "ping-pong" latency measurement and bulk transfer throughput measurements rarely reflect the kind of traffic patterns seen in real workloads. Getting something to work on two unloaded machines connected back-to-back says little about whether it will work at a large scale with a complex traffic mix and unexpected load. You often find that real world performance comes nowhere near the ideal two-machine case. For that matter, even "macrobenchmarks" like the infamous NOW-Sort work be misleading, especially when measurements are taken under ideal conditions. Obtaining robust performance under uncertain conditions seems a lot more important than optimizing for the best case that you will never see in real life.

Usability matters: I'm convinced that one of the reasons that U-Net, Active Messages, and VIA failed to take off is that they were notoriously hard to program to. Some systems, like Fast Sockets, layer a conventional sockets API on top, but often suffered large performance losses as a result, in part because the interface couldn't be tailored for specific traffic patterns. And even "sockets-like" layers often did not work exactly like sockets, being different enough that you couldn't just recompile your application to use them. A common example is not being entirely threadsafe, or not working with mechanisms such as select() and poll(). When you are running a large software stack that depends on sockets, it is not easy to rip out the guts with something that is not fully backwards compatible.

Commodity beats fast: If history has proven anything, there's only so much that systems designers are willing to pay -- in terms of complexity or cost -- for performance. The vast majority of real-world systems are based on some flavor of the UNIX process model, BSD filesystem, relational database, and TCP/IP over Ethernet. These technologies are all commodity and can be found in many (mostly compatible) variants, both commercial and open source; few companies are willing to invest time and money to tailor their design for some funky single-vendor user-level networking solution that might disappear one day.

2 comments:

Ethernet, as a wire protocol, has benefited from wide adoption and been essentially commoditized. It also provides a nimble deployment model for upgrading to faster (i.e. 1G to 10G) signaling technologies. The *programming model* is what ultimately determines "usability" and this is where cluster networks are not willing to spend the effort unless there is substantial payoff. Sockets Direct Protocol (SDP) offered some initial promise, but in practice has not been used. Most high-performance clusters use MPI (even over ethernet which is still the most common network protocol on the list of the Top500 civilian computers (www.top500.org). I am not arguing for MPI as a programming model, but I am suggesting there is a big gap between TCP/IP (i.e. sockets) and MPI or user-level fast messaging.

FWIW, the financial community is *very* latency sensitive in their applications for trading and microseconds in the network make a difference of order execution, and potentially millions of dollars.

Google+ Badge

About Me

Matt Welsh is a software engineer at Google, where he works on mobile web performance. He was previously a professor of Computer Science at Harvard University. His research interests include distributed systems and networks.