Things have been quite on the blog, but not because there has not been a lot to say. In fact, there has been so much happening that I have not had the idle cycles to write about them. However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing. Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades. However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services. MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting. The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with. The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours. However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk. Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale. We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server? That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

CPU is required to perform the O(n log n) operation to sort the data. While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.

Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data. One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O. A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.

Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased. While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.

Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.

Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers. Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated. Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives. The number of hard drives per server was key to delivering high levels of performance. Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput. We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier. So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate. In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s. Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks. At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers. This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server. This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware. Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency. Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance. In no particular order:

We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.

We had to experiment with multiple file systems and file system configuration before settling on ext4.

We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.

While our job overall is not CPU bound, thread scheduling and core contention became a significant issue. In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach. One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.

Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge. We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.

We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity. Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project. Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems. It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.

The great thing about working with great students is following their successes throughout their career.

I just learned that Haifeng Yu was promoted to Associate Professor with tenure at National University of Singapore (NUS). Haifeng was my first graduating PhD student and now the first to receive tenure. Overall, Haifeng has always been drawn to deep, challenging problems and his work is always insightful. He did his graduate work on consistency models for replicated and distributed systems. More recently, he has been working on network security, system availability, and most recently coding for wireless systems. His recent SIGCOMM 2010 paper on flexible coding schemes will be (in my biased opinion) of great interest. In fact, it won the best paper award at the conference.

Recently, there has been a lot of handwringing in the systems community about the work that we can do in the age of mega-scale data centers and cloud computing. The worry is that the really interesting systems today consist of tens of thousands of machines interconnected both within data centers and across the wide area. Further, appropriate system architectures are heavily dependent on the workloads imposed by millions of users on particular software architectures. The worry is that we in academia cannot perform good research because we do not have access to either systems of the appropriate scale or application workloads to inform appropriate system architectures.

The concern further goes that systems research is increasingly being co-opted by industry, with many (sometimes most) of the papers in top systems and networking conferences being written by our colleagues in industry.

One of my colleagues hypothesized that perhaps the void in the systems community was partially caused by the void in “big funding” that was historically available to the academic systems community from DARPA. Starting in about 2000, DARPA moved to more focused funding to efforts likely to have direct impact in the near term. Though, it looks that this policy is changing under new DARPA leadership, the effects in the academic community have yet to be felt.

My feeling is that all this worry is entirely misplaced. I will outline some of the opportunities that go along with the challenges that we currently face in academic research.

First, for me, this may in fact be another golden age in systems research, borne out of tremendous opportunity to address a whole new scale of problems collaboratively between industry and academia. Personally, I find interactions with my colleagues in industry to be a terrific source of concrete problems to work on. For example, our recent work on data center networking could never have happened without detailed understanding of the real problems faced in large-scale network deployments. While we had to carry out a significant systems building effort as part of the work, we did not need to build a 10,000-node network to carry out interesting work in this space. Even the terrific work coming out of Microsoft Research on related efforts such as VL2, DCell, and BCube typically employ relatively modest-sized system implementations as proofs of concepts of their designs.

A related approach is to draw inspiration from a famous baseball quote by Willie Keeler, “I keep my eyes clear and I hit ’em where they ain’t.” The analog in systems research is to focus on topics that may not currently be addressed by industry. For example, while there has been tremendous interest and effort in building systems that scale seemingly arbitrarily, there has been relatively little focus on per-node efficiency. So a recent focus of my group has been on building scalable systems that do not necessarily sacrifice efficiency. More on this in a subsequent post.

The last, and perhaps best, strategy is to actively seek out collaborations with industry to increase overall impact on both sides. One of the best papers I read in the set of submissions to SIGCOMM 2010 was on DCTCP, a variant of TCP targeting the data center. This work was a collaboration between Microsoft Research and Stanford with the protocol deployed live on a cluster consisting of thousands of machines. The best paper award from IMC 2009 was on a system called WhyHigh, a system for diagnosing performance problems in Google’s Content Distribution Network. This was a multi-way collaboration between Google, UC San Diego, University of Washington, and Stony Brook. Such examples of fruitful collaborations abound. Companies like Akamai and AT&T are famous for multiple very successful academic collaborations with actual impact on business operations. I have personally benefitted from insights and collaborations with HP Labs on topics such as virtualization and system performance debugging.

I think the big thing to note is that industry and academia have long lived in a symbiotic relationship. When I was a PhD student at Berkeley, many of the must read systems papers came out of industry: the Alto, Grapevine, RPC, NFS, Firefly, Logic of Authentication, Pilot, etc., just as systems such as GFS, MapReduce, Dynamo, PNUTS, and Dryad are heavily influencing academic research today. At the same time, GFS likely could not have happened without the lineage of academic file systems research, from AFS, Coda, LFS, and Zebra to xFS. Similarly, Dynamo would not have been as straightforward if it had not been informed by Chord, Pastry, Tapestry, CAN, and all the peer to peer systems that came afterward. The novel consistency model in PNUTS that enables its scalability was informed by decades of research in strong and weak data consistency models.

Sometimes things go entirely full circle multiple times between industry and academia. IBM’s seminal work on virtual machines in the 1960’s lay dormant for a few decades before inspiring some of the top academic work of the 1990’s, SimOS and DISCO. This work in turn led to the founding of VMWare, perhaps one of the most influential companies to directly come out of the systems community. And of course, VMWare has helped define part of the research agenda for the system’s community in the past decade, through academic efforts like Xen. Interestingly, academic work on Xen led to a second high-profile company, XenSource.

This is all to say that I believe that the symbiotic relationship between industry and academia in systems and networking will continue. We in academia do not need a 100,000-node data center to do good research, especially by focusing on direct collaboration with industry where it makes sense and otherwise on topics that may not be being directly addressed by industry. And the fact that there are so many great systems and networking papers from industry in top conferences should only serve as inspiration, both to define important areas for further research and to set the bar higher for the quality of our own work in academia.

Finally, and only partially in jest, all the fundamental work in industrial research is perhaps further affirmation of the important role that academia plays, since many of the people carrying out the work were MS and PhD students in academia not so long ago.

This year, I had the pleasure of serving on the SIGCOMM 2010 program committee. I may write more about the experience later, but the short version is that I really enjoyed reading the papers and was particularly impressed by the deep discussions at the two-day program committee last month. K.K. Ramakrishnan and Geoff Voelker did a terrific job as co-chairs and I believe their efforts are well reflected in a very strong program.

The conference will be held in New Delhi this year and the organizing committee has been fortunate to secure some generous support for travel grants. This year, grants will be available not just for students, but also for post docs and junior faculty. The deadline for application has been extended to June 12, 2010. Full details are available here. On behalf of the SIGCOMM organizing committee, I encourage everyone interested to apply.

If you do attend SIGCOMM, let me also put in a plug for the VISA workshop. This is the second workshop on Virtualized Infrastructure Systems and Architecture, building on the successful program we had last year. I was the co-program chair this year with Guru Parulkar and Cedric Westphal. Virtualization remains an important topic and VISA is playing an important role for discussion of important problems across systems and networks.

The amount of interest in data centers and data center networking continues to grow. For the past decade plus, the most savvy Internet companies have been focusing on infrastructure. Essentially, planetary scale services such as search, social networking, and e-commerce require a tremendous amount of computation and storage. When operating at the scale of tens of thousands of computers and petabytes of storage, small gains in efficiency can result in millions of dollars of annual savings. On the other extreme, efficient access to tremendous amounts of computation can enable companies to deliver more valuable content. For example, Amazon is famous for tailoring web page contents to individual customers based on both their history and potentially the history of similar users. Doing so while maintaining interactive response times (typically responding in less than 300 ms) requires fast, parallel access to data potentially spread across hundreds or even thousands of computers. In an earlier post, I described the Facebook architecture and its reliance on clustering for delivering social networking content.

Over the last few years, academia has become increasingly interested in data centers and cloud computing. One reason is the opportunity for impact; it is clear, that the entire computing industry is undergoing another paradigm shift. Five years from now, it is clear that the way we build out computing and storage infrastructures will be radically different. Another allure of the data center is the fact that it is possible to do “clean slate” research and deployment. One frustration of the networking research community has been the inability to deploy novel architectures and protocols because of the need to be backward compatible and friendly to legacy systems. Check out this paper for an excellent discussion. In the data center, it is at least possible to deploy entirely new architectures without the need to be compatible with every protocol developed over the years.

Of course, there are difficulties with performing data center research as well. One is having access to the necessary infrastructure to perform research at scale. With companies deploying data centers at the scale of tens of thousands of computers, it is difficult for most universities and even research labs to have access to the necessary infrastructure. In our own experience, we have found that it is possible to consider problems of scale even with a relatively modest number of machines. Research infrastructures such Emulab and OpenCirrus are open compute platforms that provide significant amount of computing infrastructure to the research community.

Another challenge is the lack of software infrastructure for performing data center research, particularly in networking. Eucalyptus provides an EC2-compatible environment for cloud computing. However, there is a relative void of available research software for research in networking. Rebuilding every aspect of the protocol stack before performing research in fundamental algorithms and protocols is a challenge.

To partially address this shortcoming, we are release an alpha version of our PortLand protocol. This work was published in SIGCOMM 2009 and targets delivering a unified Layer 2 environment for easier management and support for basic functionality such as virtual machine migration. I discussed our work on PortLand in an earlier post here and some of the issues of Layer 2 versus Layer 3 deployment here.

The page for downloading PortLand is now up. It reflects the hard work of two graduate students in my group, Sambit Das and Malveeka Tewari, who took our research code and ported it HP ProCurve switches running OpenFlow. The same codebase runs on NetFPGA switches as well. We hope the community can confirm that the same code runs on a variety of other OpenFlow-enabled switches. Our goal is for PortLand to be a piece of the puzzle for a software environment for performing research in data center networking. We encourage you to try it out and give us feedback. In the meantime, Sambit and Malveeka are hard at work in adding Hedera functionality for flow scheduling for our next code release.

Later this month, we will be presenting our work on Hedera at NSDI 2010. The goal of the work is to improve data center network performance under a range of dynamically shifting communication patterns. Below I will present a quick overview of the work starting with some of the motivation for it.

The promise of adaptively choosing the right path for a packet based on dynamically changing network performance conditions is at least as old as the ARPANET. The original goal was to track the current levels of congestion on all available paths between a source and destination and to then forward individual packets along the path likely to deliver the best performance on a packet by packet basis.

In the ARPANET, researchers attempted to achieve this functionality by distributing queue lengths and capacity as part of the routing protocol. Each router would then have a view of not just network connectivity but also the performance available on individual paths. Forwarding entries for each destination would then be calculated not just based on shortest number of hops but also dynamically changing performance measures. Unfortunately, this approach suffered from stability issues. Distribution of current queue lengths as part of the routing protocol was too coarse grained and deterministic. Hence, packets would oscillate from all simultaneously chasing the path that had the best performance in the previous measurement epoch, leaving other paths idle. The previously best performing path would in turn often become overwhelmed as part of this herd effect. The next measurement cycle would reveal the resulting congestion and lead to yet another oscillation.

As a result of this early experience, inter-domain routing protocols such as BGP settled on hop count as the metric for packet delivery, eschewing performance goals in favor of simple connectivity. Intra-domain routing protocols such as OSPF also initially opted for simplicity, again aiming for the shortest path between a source and destination. Administrators could however set weights for individual links as a way to make particular paths more or less appealing by default.

More recently administrators perform coarse-grained traffic engineering among available paths using MPLS. With the rise of ISPs and the cost of operating hundreds or thousands of expensive long-haul links and customers with strict performance requirements, it became important to make better use of network resources within each domain/ISP. Traffic engineering (TE) extensions to OSPF allowed for bundles of flows from the same ingress to egress points in the network to follow the same path, leveraging the long-term relative stability in traffic between various points of presence in a long-haul network. For example, the amount of traffic from Los Angeles to Chicago aggregated over many customers might demonstrate stability modulo diurnal variations. OSPF-TE allowed network operators to balance aggregations of flows among available paths to smooth the load across available links in a wide-area network. Rebalancing of forwarding preferences could be done on a coarse granularity, perhaps with human oversight, given the relative stability in aggregate traffic characteristics.

Our recent focus has been on the data center and in that environment, the network displays much more bursty communication patterns with rapid shifts in load from one portion of the network to another. At the same time, data center networks only achieve scalability through topologies that inherently provide multiple paths between every source and destination. Leveraging coarse-grained stability on the order of days is not an option for performing traffic engineering in the data center. And yet, attempting to send each packet along the best available path also seems like a non-starter from both a scalability perspective and a TCP compatibility perspective. On the second point, TCP does not behave well when packets may potentially be delivered out of order as the common case.

The state of the art in load balancing in data center is the Equal Cost Multipath (ECMP) extension to OSPF. Here, each switch tracks the set of available next hops to a particular destination. For each arriving packet, it extracts a potentially configurable set of headers (e.g., source and destination IP address, source and destination port, etc.) with the goal of deterministically identifying all of the packets that belong to the same logical higher-level flow. The switch then applies a hash function to the concatenated flow identifier to assign the flow to one of the available output ports.

ECMP has the effect of load balancing flows among the available paths. It can perform well under certain conditions, for example when flows are mostly of uniform, small size and when hosts communicate with one another with uniform probability. However, long-term hash collisions can leave certain links oversubscribed while others remain idle. In production networks, network administrators are sometimes left to manually tweak the ECMP hash function to achieve good performance for a particular communication pattern (though of course, the appropriate hash function depends on globally shifting communication patterns).

In our work, we have found that ECMP can under utilize network bandwidth by a factor of 2-4 for moderate sized networks. The worst-case overhead grows with network size.

Our work on Hedera shows how to improve network performance with small communication overhead to maintain overall network scalability. The key idea, detailed in the paper, is to leverage a central network fabric manager that tracks the behavior of large flows. By default, new flows that are initiated are considered small and scheduled using a technique similar to ECMP. However, once a flow grows beyond a certain threshold, the fabric manager attempts to schedule it in light of the behavior of all other large flows in the network. The fabric manager communicates with individual switches in the topology to track resource utilization using OpenFlow. This ensures that in the future our approach can be backward compatible with a range of commercially available switches.

An important consideration in our work is the ability to estimate the inherent demand of a TCP flow independent of its measured consumed bandwidth. That is, the fabric manager cannot perform scheduling of flows based on observations of observed bandwidth on a per-flow basis. This bandwidth can be off from what a flow would ideally achieve by a large factor because of poor previous scheduling decisions. Hence, we designed an algorithm to estimate the best case bandwidth that would be available to a TCP flow assuming the presence of a perfect scheduler. This demand estimator is then the input to our scheduling algorithm rather than any observed performance characteristics.

The final piece of the puzzle is an efficient scheduling algorithm for placing large flows in the network. One important consideration is the length of the control loop. That is, how quickly can we measure the behavior of existing flows and respond with a new placement of flows. If network communication patterns are shifting more rapidly than we are able to observe and react, we will be left, in effect, continuously reacting to no longer meaningful network conditions. We currently are able to measure and react at the granularity of approximately one second, but this is driven by some limitations in our switch hardware. As part of future work, we hope to drive the overhead down to approximately 100ms. It will likely take some hardware support, perhaps using an FPGA, to go much below 100 ms.

Overall, we have found that Hedera can deliver near-optimal network utilization for a range of communication patterns, with significant improvements relative to ECMP. It remains an open question whether the network scheduling problem we need to solve is NP-hard or not. But our current algorithms are reasonably efficient with acceptable performance under the conditions we have experimented with thus far.

We hope that the relative simplicity of our architecture along with its backward compatibility with existing switch hardware will enable more dynamic scheduling of data center network fabrics with higher levels of delivered performance and faster reaction to any network failures.

I just read an interesting article “Has Amazon EC2 Become Oversubscribed?” The article describes one company’s relatively large-scale usage of EC2 over a three-year period. Apparently, over the first 18 months, EC2 small instances provided sufficient performance for their largely I/O bound network servers. Recently however, they have increasingly run into problems with “noisy neighbors”. They speculate that they happen to be co-located with other virtual machines that are using so much CPU that the small instances are unable to get their fair share of the CPU. They recently moved to large instances to avoid the noisy neighbor problem.

More recently however, they have been finding unacceptable local network performance even on large instances, with ping times ranging from hundreds of milliseconds to even seconds in some cases. This increase in ping time is likely due to overload on the end hosts, with the hypervisor unable to keep up with even the network load imposed by pings. (The issue is highly unlikely to be in the network because switches deployed in the data center do not have that kind of buffering.)

The conclusion from the article, along with the associated comments, is that Amazon has not sufficiently provisioned EC2 and that is the cause of the overload.

While this is purely speculation on my part, I believe that underprovisioning of the cloud compute infrastructure is unlikely to be the sole cause of the problem. Amazon has very explicit descriptions of the amount of computing power associated with each type of computing instance. And it is fairly straightforward to set the hypervisor to allocate a fixed share of the CPU to individual virtual machine instances. I am assuming that Amazon has set the CPU schedulers to appropriately reserve the appropriate portion of each machine’s physical CPU (and memory) to each VM.

For CPU-bound VM’s, the hypervisor scheduler is quite efficient at allocating resources according to administrator-specified levels. However, the achilles heel of scheduling in VM environments is I/O. More particularly, the hypervisor typically has no way to account for the work performed on behalf of individual VMs in either the hypervisor or (likely the bigger culprit) in driver domains responsible for things like network I/O or disk I/O. Hence, if a particular instance performs significant network communication (either externally or to other EC2 hosts), the corresponding system calls will first go into the local kernel. The kernel likely has a virtual device driver for either the disk or the NIC. However, for protection, the virtual device driver cannot have access to the actual physical device. Hence, the kernel driver must transfer control to the hypervisor, which in turn likely transfers control to a device driver likely running a separate domain.

The work done in the driver domain on behalf of a particular is difficult to account for. In fact, this work is typically not “billed” back to the original domain. So, a virtual machine can effectively mount a denial of service attack (whether malicious or not) on other co-located VM’s simply by performing significant I/O. With colleagues at HP Labs, we wrote a couple of papers investigating this exact issue a few years back:

As mentioned above however, without having access to the actual workloads on EC2, it is impossible to know whether the hypervisor scheduler is really the culprit. It will be interesting to see whether Amazon has a response.