Archive for the 'data center' Category

Things have been quite on the blog, but not because there has not been a lot to say. In fact, there has been so much happening that I have not had the idle cycles to write about them. However, I do want to highlight some of the interesting things that have taken place over the past few months.

There has been significant recent interest in large-scale data processing. Many would snicker that this is far from a new problem and indeed the database community has been pioneering in this space for decades. However, I believe it is the case that there has been an uptick in commercial interest in this space, for example to index and analyze the wealth of information available on the Internet or to process the multiple billions of requests per day made to popular web services. MapReduce and open source tools like Hadoop have significantly intensified the debate over the right way to perform large-scale data processing (see my earlier post on this topic).

Observing this recent trend along with my group’s recent focus on data center networking (along with a healthy dose of naivete) led us to go after the world record in data sorting. The team recently set the records for both Gray Sort (fastest time to sort 100 TB of data) and Minute Sort (most data sorted in one minute) in the “Indy” category. See the sort benchmark page for details. This has been one of the most gratifying projects I have ever been involved with. The work was of course really interesting but the best part was seeing the team (Alex Rasmussen, Radhika Niranjan Mysore, Harsha V. Madhyastha, Alexander Pucher, Michael Conley, and George Porter) go after a really challenging problem. While some members of the team would disagree, it was also at least interesting to set the records with just minutes to spare before the 2010 deadline.

Our focus in this work was not so much to set the record (though we are happy to have done so) but to go after high-levels of efficiency while operating at scale. Recently, setting the sort record has largely been a test of how much computing resources an organization could throw at the problem, often sacrificing on per-server efficiency. For example, Yahoo’s record for Gray sort used an impressive 3452 servers to sort 100 TB of data in less than 3 hours. However, per server throughput worked out to less than 3 MB/s, a factor of 30 less bandwidth than available even from a single disk. Large-scale data sorting involves carefully balancing all per-server resources (CPU, memory capacity, disk capacity, disk I/O, and network I/O), all while maintaining overall system scale. We wanted to determine the limits of a scalable and efficient data processing system. Given current commodity server capacity, is it feasible to run at 30 MB/s or 300 MB/s per server? That is, could we reduce the required number of machines for sorting 100 TB of data by a factor of 10 or even 100?

The interesting thing about large-scale data sorting is that it exercises all aspect of the computer system.

CPU is required to perform the O(n log n) operation to sort the data. While not the most compute-intensive application, CPU requirements nonetheless cannot be ignored.

Disk Bandwidth: earlier work proves that external memory sort (the case where the data set size is larger than aggregate physical memory) requires at least two reads of the data and two writes of the data. One of the banes of system efficiency is the orders of magnitude difference in I/O performance for sequential versus random disk I/O. A key requirement for high-performance sort is ensuring that disks are performing sequential I/O (either read or write) near continuously.

Disk capacity: Sorting 100 TB of data requires at least 200 TB of storage, 300 TB if the input data cannot be erased. While not an enormous amount of data by modern standards, simply storing this amount of data amounts to an interesting systems challenge.

Memory capacity: certainly in our architecture, and perhaps fundamentally, ensuring streaming I/O while simultaneously limiting the number of disk operations to 2 reads and 2 writes per tuple requires a substantial amount of memory and careful memory management to buffer data in preparation for large, contiguous writes to disk.

Network bandwidth: in a parallel sort system, data must be shuffled in an all-to-all manner across all servers. Saturating available per-server CPU and storage capacity, requires significant network bandwidth, approaching 10 Gb/s of sustained network throughput per server in our configuration.

Managing the interaction of these disparate resources along with parallelism both within a single server and across a cluster of machines was far more challenging than we anticipated. Our goal was to use commodity servers to break the sort record while focusing on high efficiency. We constructed a cluster with dual-socket, four-core Intel processors, initially 12GB RAM (later upgraded to 24GB RAM once we realized we good not maintain sequential I/O with just 12GB RAM/server), 2x10GE NIC (only one port active for the experiment), and 16 500GB drives. The number of hard drives per server was key to delivering high levels of performance. Each of our drives could sustain approximately 100 MB/s of sequential read or write throughput. We knew that, in the optimal case (see this paper), we would read and write the data twice in two discrete phases separated by a barrier. So, if we managed everything perfectly, in the first phase, we would read data from 8 drives at an aggregate rate of 800 MB/s (8*100 MB/s) while simultaneously writing it out to the remaining 8 disks at an identical rate. In the second phase, we would similarly read the data at 800 MB/s while writing the fully-sorted data out at 800 MB/s. Once again, in the best case, we would average 400 MB/s of sorting per server.

Interestingly, the continuing chasm between CPU performance and disk I/O (even in the streaming case) means that building a “balanced” data-intensive processing cluster requires a large number of drives per server to maintain overall system balance. While 16 disks per server seems large, one conclusion of our work is that servers dedicated to large-scale data processing should likely have even more disks. At the same time, significant work needs to be done in the operating system and disk controllers to harness the I/O bandwidth available from such large disk arrays in a scalable fashion.

Our initial goal was to break the record with just 30 servers. This would correspond to 720 GB/min assuming 400 MB/s/server, allowing us to sort 100 TB of data in ~138 minutes. We did not quite get there (yet); our record-setting runs were on a 48-server configuration. For our “certified” record-setting run, we ran at 582 GB/min on 48 servers, or 200 MB/s/server. This corresponds to 50% of the maximum efficiency/capacity of our underlying hardware. Since the certified experiments, we have further tuned our code to sort at ~780 GB/min aggregate or 267 MB/s/server. These newest runs correspond to ~67% efficiency. Now obsessed with squeezing the last ounce of efficiency from the system, we continue to target >90% efficiency or more than 1 TB/min of sorting on 48 machines.

While beyond the scope of this post, it has been very interesting just how much we had to do for even this level of performance. In no particular order:

We had to revise, redesign, and fine tune both our architecture and implementation multiple times. There is no one right architecture because the right technique varies with evolving hardware capabilities and balance.

We had to experiment with multiple file systems and file system configuration before settling on ext4.

We were bit multiple times by the performance and caching behavior of our hardware RAID controllers.

While our job overall is not CPU bound, thread scheduling and core contention became a significant issue. In the end, we had to come up with our own custom core allocation bypassing the Linux kernel’s own approach. One interesting requirement was avoiding the core that by default performed most of the in-kernel system call work.

Performing all-to-all communication at near 10 Gb/s, even among 48 hosts on a single switch, is an unsolved challenge to the best of our knowledge. We had to resort to brittle and arcane socket configuration to sustain even ~5Gb/s.

We had to run with virtual memory disabled because the operating system’s memory management behaved in unexpected ways close to capacity. Of course, with virtual memory disabled, we had to tolerate kernel panics if we were not careful about memory allocation.

In the end, simultaneously addressing these challenges turned out to be a lot of fun, especially with a great group of people working on the project. Large-scale sort exercises many aspects of the operating system, the network protocol stack, and distributed systems. It is far from trivial, but it is also simple enough to (mostly) keep in your head at once. In addition to improving the efficiency of our system, we are also working to generalize our infrastructure to arbitrary MapReduce-style computation. Fundamentally, we are interested to determine how much efficiency and scale we can maintain in a general-purpose data processing infrastructure.

The amount of interest in data centers and data center networking continues to grow. For the past decade plus, the most savvy Internet companies have been focusing on infrastructure. Essentially, planetary scale services such as search, social networking, and e-commerce require a tremendous amount of computation and storage. When operating at the scale of tens of thousands of computers and petabytes of storage, small gains in efficiency can result in millions of dollars of annual savings. On the other extreme, efficient access to tremendous amounts of computation can enable companies to deliver more valuable content. For example, Amazon is famous for tailoring web page contents to individual customers based on both their history and potentially the history of similar users. Doing so while maintaining interactive response times (typically responding in less than 300 ms) requires fast, parallel access to data potentially spread across hundreds or even thousands of computers. In an earlier post, I described the Facebook architecture and its reliance on clustering for delivering social networking content.

Over the last few years, academia has become increasingly interested in data centers and cloud computing. One reason is the opportunity for impact; it is clear, that the entire computing industry is undergoing another paradigm shift. Five years from now, it is clear that the way we build out computing and storage infrastructures will be radically different. Another allure of the data center is the fact that it is possible to do “clean slate” research and deployment. One frustration of the networking research community has been the inability to deploy novel architectures and protocols because of the need to be backward compatible and friendly to legacy systems. Check out this paper for an excellent discussion. In the data center, it is at least possible to deploy entirely new architectures without the need to be compatible with every protocol developed over the years.

Of course, there are difficulties with performing data center research as well. One is having access to the necessary infrastructure to perform research at scale. With companies deploying data centers at the scale of tens of thousands of computers, it is difficult for most universities and even research labs to have access to the necessary infrastructure. In our own experience, we have found that it is possible to consider problems of scale even with a relatively modest number of machines. Research infrastructures such Emulab and OpenCirrus are open compute platforms that provide significant amount of computing infrastructure to the research community.

Another challenge is the lack of software infrastructure for performing data center research, particularly in networking. Eucalyptus provides an EC2-compatible environment for cloud computing. However, there is a relative void of available research software for research in networking. Rebuilding every aspect of the protocol stack before performing research in fundamental algorithms and protocols is a challenge.

To partially address this shortcoming, we are release an alpha version of our PortLand protocol. This work was published in SIGCOMM 2009 and targets delivering a unified Layer 2 environment for easier management and support for basic functionality such as virtual machine migration. I discussed our work on PortLand in an earlier post here and some of the issues of Layer 2 versus Layer 3 deployment here.

The page for downloading PortLand is now up. It reflects the hard work of two graduate students in my group, Sambit Das and Malveeka Tewari, who took our research code and ported it HP ProCurve switches running OpenFlow. The same codebase runs on NetFPGA switches as well. We hope the community can confirm that the same code runs on a variety of other OpenFlow-enabled switches. Our goal is for PortLand to be a piece of the puzzle for a software environment for performing research in data center networking. We encourage you to try it out and give us feedback. In the meantime, Sambit and Malveeka are hard at work in adding Hedera functionality for flow scheduling for our next code release.

Later this month, we will be presenting our work on Hedera at NSDI 2010. The goal of the work is to improve data center network performance under a range of dynamically shifting communication patterns. Below I will present a quick overview of the work starting with some of the motivation for it.

The promise of adaptively choosing the right path for a packet based on dynamically changing network performance conditions is at least as old as the ARPANET. The original goal was to track the current levels of congestion on all available paths between a source and destination and to then forward individual packets along the path likely to deliver the best performance on a packet by packet basis.

In the ARPANET, researchers attempted to achieve this functionality by distributing queue lengths and capacity as part of the routing protocol. Each router would then have a view of not just network connectivity but also the performance available on individual paths. Forwarding entries for each destination would then be calculated not just based on shortest number of hops but also dynamically changing performance measures. Unfortunately, this approach suffered from stability issues. Distribution of current queue lengths as part of the routing protocol was too coarse grained and deterministic. Hence, packets would oscillate from all simultaneously chasing the path that had the best performance in the previous measurement epoch, leaving other paths idle. The previously best performing path would in turn often become overwhelmed as part of this herd effect. The next measurement cycle would reveal the resulting congestion and lead to yet another oscillation.

As a result of this early experience, inter-domain routing protocols such as BGP settled on hop count as the metric for packet delivery, eschewing performance goals in favor of simple connectivity. Intra-domain routing protocols such as OSPF also initially opted for simplicity, again aiming for the shortest path between a source and destination. Administrators could however set weights for individual links as a way to make particular paths more or less appealing by default.

More recently administrators perform coarse-grained traffic engineering among available paths using MPLS. With the rise of ISPs and the cost of operating hundreds or thousands of expensive long-haul links and customers with strict performance requirements, it became important to make better use of network resources within each domain/ISP. Traffic engineering (TE) extensions to OSPF allowed for bundles of flows from the same ingress to egress points in the network to follow the same path, leveraging the long-term relative stability in traffic between various points of presence in a long-haul network. For example, the amount of traffic from Los Angeles to Chicago aggregated over many customers might demonstrate stability modulo diurnal variations. OSPF-TE allowed network operators to balance aggregations of flows among available paths to smooth the load across available links in a wide-area network. Rebalancing of forwarding preferences could be done on a coarse granularity, perhaps with human oversight, given the relative stability in aggregate traffic characteristics.

Our recent focus has been on the data center and in that environment, the network displays much more bursty communication patterns with rapid shifts in load from one portion of the network to another. At the same time, data center networks only achieve scalability through topologies that inherently provide multiple paths between every source and destination. Leveraging coarse-grained stability on the order of days is not an option for performing traffic engineering in the data center. And yet, attempting to send each packet along the best available path also seems like a non-starter from both a scalability perspective and a TCP compatibility perspective. On the second point, TCP does not behave well when packets may potentially be delivered out of order as the common case.

The state of the art in load balancing in data center is the Equal Cost Multipath (ECMP) extension to OSPF. Here, each switch tracks the set of available next hops to a particular destination. For each arriving packet, it extracts a potentially configurable set of headers (e.g., source and destination IP address, source and destination port, etc.) with the goal of deterministically identifying all of the packets that belong to the same logical higher-level flow. The switch then applies a hash function to the concatenated flow identifier to assign the flow to one of the available output ports.

ECMP has the effect of load balancing flows among the available paths. It can perform well under certain conditions, for example when flows are mostly of uniform, small size and when hosts communicate with one another with uniform probability. However, long-term hash collisions can leave certain links oversubscribed while others remain idle. In production networks, network administrators are sometimes left to manually tweak the ECMP hash function to achieve good performance for a particular communication pattern (though of course, the appropriate hash function depends on globally shifting communication patterns).

In our work, we have found that ECMP can under utilize network bandwidth by a factor of 2-4 for moderate sized networks. The worst-case overhead grows with network size.

Our work on Hedera shows how to improve network performance with small communication overhead to maintain overall network scalability. The key idea, detailed in the paper, is to leverage a central network fabric manager that tracks the behavior of large flows. By default, new flows that are initiated are considered small and scheduled using a technique similar to ECMP. However, once a flow grows beyond a certain threshold, the fabric manager attempts to schedule it in light of the behavior of all other large flows in the network. The fabric manager communicates with individual switches in the topology to track resource utilization using OpenFlow. This ensures that in the future our approach can be backward compatible with a range of commercially available switches.

An important consideration in our work is the ability to estimate the inherent demand of a TCP flow independent of its measured consumed bandwidth. That is, the fabric manager cannot perform scheduling of flows based on observations of observed bandwidth on a per-flow basis. This bandwidth can be off from what a flow would ideally achieve by a large factor because of poor previous scheduling decisions. Hence, we designed an algorithm to estimate the best case bandwidth that would be available to a TCP flow assuming the presence of a perfect scheduler. This demand estimator is then the input to our scheduling algorithm rather than any observed performance characteristics.

The final piece of the puzzle is an efficient scheduling algorithm for placing large flows in the network. One important consideration is the length of the control loop. That is, how quickly can we measure the behavior of existing flows and respond with a new placement of flows. If network communication patterns are shifting more rapidly than we are able to observe and react, we will be left, in effect, continuously reacting to no longer meaningful network conditions. We currently are able to measure and react at the granularity of approximately one second, but this is driven by some limitations in our switch hardware. As part of future work, we hope to drive the overhead down to approximately 100ms. It will likely take some hardware support, perhaps using an FPGA, to go much below 100 ms.

Overall, we have found that Hedera can deliver near-optimal network utilization for a range of communication patterns, with significant improvements relative to ECMP. It remains an open question whether the network scheduling problem we need to solve is NP-hard or not. But our current algorithms are reasonably efficient with acceptable performance under the conditions we have experimented with thus far.

We hope that the relative simplicity of our architecture along with its backward compatibility with existing switch hardware will enable more dynamic scheduling of data center network fabrics with higher levels of delivered performance and faster reaction to any network failures.

Recently, we were fortunate to host Jeff Rothschild, the Vice President of Technology at Facebook, for a visit for the CNS lecture series. Jeff’s talk, “High Performance at Massive Scale: Lessons Learned at Facebook” was highly detailed, providing real insights into the Facebook architecture. Jeff spoke to a packed house of faculty, staff, and students interested in the technology and research challenges associated with running and Internet service at scale. The talk is archived here as part of the CNS lecture series. I encourage you to check it out; below are my notes on the presentation.

Site Statistics:

Facebook is the #2 property on the Internet as measured by the time users spend on the site.

Over 200 billion monthly page views.

>3.9 trillion feed actions proceessed per day.

Over 15,000 websites use Facebook content

In 2004, the shape of the curve plotting user population as a function of time showed exponential growth to 2M users. 5 years later they have stayed on the same exponetial curve with >300M users.

Facebook is a global site, with 70% of users outside of the US.

Today, there are 1.3B people in the world who have quality Internet connectivity, so there is at least another factor of 4 growth that Facebook is going after. Jeff presented statistics for the number of users that each engineer supports at a variety of high-profile Internet companies: 1.1M for Facebook, 190,000 Google, 94,000 Amazon, 75,000 Microsoft.

Photo sharing on Facebook:

Facebook stores 20 billion photos in 4 resolutions

2-3 billion new photos uploaded every month

Originally provisioned photo storage for 6 months, but blew through available storage in 1.5 weeks.

Facebook serves 600k photos/second –> serving them is more difficult than storing them.

Scaling photos, first the easy way:

Upload tier: handles uploads, scales the images, sotres on NFS tier

Serving tier: Images are served from NFS via HTTP

NFS Storage tier built from commercial products

Filesystems aren’t really good at supporting large numbers of files

Scaling photos, 2nd generation:

Cachr: cache the high volume smaller images to offload the main storage systems.

Only 300M images in 3 resolutions

Distribute these through a CDN to reduce network latency.

Cache them in memory.

Scaling photos, 3rd Generation System: Haystack

How many IO’s do you need to serve an image? Originally, 10 I/O’s at Facebook because of the complex directory structure.

Optimizations got it down to 2-4 IOs per file served

Facebook built a better version called Haystack by merging multiple files into a single large file. In the common case, serving a photo now requires 1 I/O operation. Haystack is available as open source.

Facebook architecture consists of:

Load balancers as front end requests are distributed to Web Servers retrieve actual content from a large memcached layer because of the latency requirements for individual requests.

Presentation Layer employs PHP

Simple to learn: small set of expressions and statements

Simple to write: loose typing and universal “array”

Simple to read

But this comes at a cost:

High CPU and memory consumption.

C++ Interoperability Challenging.

PHP does not encourage good programming in the large (at 3M lines of code it is a significant organizational challenge).

Initialization cost of each page scales with size of code base

Thus Facebook engineers undertook implementing optimizations to PHP:

Lazy loading

Cache priming

More efficient locking semantics for variable cache

Memcache client extension

Asynchrnous event-handling

Back-end services that require the performance are implemente in C++. Services Philosophy:

Create a service iff required.

Real overhead for deployment, maintenance, separate code base.

Another failure point.

Create a common framework and toolset that will allow for easier creation of services: Thrift (open source).

A number of things break at scale, one example: syslog

Became impossible to push large amounts of data through the logging infrastructure.

Overall, Facebook currently runs approximately 30k servers, with the bulk of them acting as web servers.

The Facebook Web Server, running PHP, is responsible for retrieving all of the data required to compose the web page. The data itself is stored authoritatively in a large cluster of MySQL servers. However, to hit performance targets, most of the data is also stored in memory across an array of memcached servers. For traditional websites, each user interacts with his or her own data. And for most web sites, only 1-2% of registered users concurrently access the site at any given time. Thus, the site only needs to cache 1-2% of all data in RAM. However, data at Facebook is deeply interconnected; each user is interested in the state of hundreds of other users. Hence, even with only 1-2% of the user population at any given time, virtually all data must still be available in RAM.

Memcache

Data partitioning was easy when Facebook was a college web site, simply partition data at the level of individual colleges. After considering a variety of data clustering algorithms, found that there was very little win for the additional complexity of clustering. So at Facebook, user data is randomly partitioned across indiviual databases and machines across the cluster. Hence, each user access requires retrieving data corresponding to user state spread across hundreds of machines. Intra-cluster network performance is hence critical to site performance. Facebook employs memcache to store the vast majority of user data in memory spread across thousands of machines in the cluster. In essence, nodes maintain a distributed hash table to determine the machine responsible for a particular users data. Hot data from MySQL is stored in the cache. The cache supports get/set/incr/decr and

multiget/multiset operations.

Initially, the architecture needed to support 15-20k requests/sec/machine but that number has scaled to approximately 250k requests/sec/machine today. Servers have gotten faster to keep up to some but Facebook engineers also had to perform some fundamental re-engineering of memcached to improve its performance. System performance improved from 50k requests/sec/machine to 150k to 200k to 250k by adding multithreading, polling device drivers, stats locking, and batched packet handling respectively. In aggregate, Memcache at Facebook processes in 120M requests/sec.

Incast

One networking challenge with memcached was so-called Network Incast. A front-end web server would collect responses from hundreds of memcache machines in parallel to compose an individual HTTP response. All responses would come back within the same approximately 40 microsecond window. Hence, while overall network utilization was low at Facebook, even at short time scales, there were significant, correlated packet losses at very fine timescales. These microbursts overflowed the limited packet buffering in commodity switches (see my earlier post for more discussion on this issue).

To deal with the significant slow down that resulted by synchronized loss in relatively small TCP windows, Facebook built a custom congestion-aware UDP-based transport that managed congestion across multiple requests rather than within a single connection. This optimization allowed Facebook to avoid the, for example, 200 ms timeouts associated with the loss of an entire window’s worth of data in TCP.

Authoritative Storage

Authoritative Facebook data is stored in a pool of MySQL servers. The overall experience with MySQL has been very positive at Facebook, with thousands of MySQL servers in multiple datacenters. It is simple, fast, and reliable. Facebook currently has 8,000 server-yearas of runtime experience without data loss or corruption.

Facebook has learned a number of lessons about data management:

Shared architecture should be avoided; there are no joins in the code.

Storing dynamically changing data in a central database should be avoided.

Similarly, heavily-referenced static data should not be stored in a central database.

There are a number of challenges with MySQL as well, including:

Logical migration of data is very difficult.

Creating a large number of logical dbs, load balance them over varying number of physical nodes.

Easier to scale CPU on web tier than on the DB tier.

Data driven schemas make for happy programmers and difficult operations.

Given its global user population, Facebook eventually had to move to replicating its content across multiple data centers. Facebook now runs two large data centers, one on the West coast of the US and one on the East coast. However, this introduces the age-old problem of data consistency. Facebook adopts a primary/slave replication scheme where the West coast MySQL replicas are the authoritative stores for data. All updates are applied to these master replicas and asynchronously replicated to the slaves on the East coast. However, without synchronous updates, consecutive requests to the same data item from the same user can return inconsistent or stale results.

The approach taken at Facebook is to set a cookie on user update requests that will redirect all subsequent requests from that user to the West coast master for some configurable time period to ensure that read operations do not return inconsistent results. More details on this approach is detailed on the Facebook blog.

Jeff also relayed an interesting philosophy from Mark Zuckerberg: “Work fast and don’t be afraid to break things.” Overall, the idea to avoid working cautiously the entire year, delivering rock-solid code, but not much of it. A corollary: if you take the entire site down, it’s not the end of your career.

Craig Labovitz made a very interesting presentation e the recent NANOG meeting on the most recent measurements from Arbor’s ATLAS Internet observatory. ATLAS takes real time Internet traffic measurements from 110+ ISPs with real-time access to more than 14 Tbps of Internet access. One of the things that makes working in and around Internet research so interesting (and gratifying) is that the set of problems are constantly changing because the way that we use the Internet and the requirements of the applications that we run on the Internet are constantly evolving. The rate of evolution has thus far been so rapid that we constantly seem to be hitting new tipping points in the set of “burning” problems that we need to address.

Craig, currently Chief Scientist at Arbor Networks, has long been at the forefront of identifying important architectural challenges in the Internet. His modus operandi has been to conduct measurement studies at a scale far beyond what might have been considered feasible at any particular point in time. His paper on Delayed Internet Routing Convergence from SIGCOMM 2000 is a classic, among the first to demonstrate the problems with wide-area Internet routing using a 2-year study of the effects of simulated failure and repair events injected from a “dummy” ISP and the many peering relationships that MERIT enjoyed with TIER-1 ISPs. The paper showed that Internet routing, previously thought to be robust to failure, would often take minutes to converge after a failure event as a result of shortcomings of BGP and the way that ISPs typically configured their border routers. This paper spawned a whole cottage industry on research into improved inter-domain routing protocols.

This presentation had three high level findings on Internet traffic:

Consolidation of Content Contributors: 50% of Internet traffic now originates from just 150 Autonomous Systems (down from thousands just two years ago). More and more content is being aggregated through big players and content distribution networks. As a group, CDN’s account for approximately 10% of Internet traffic.

Consolidation of Applications: The browser is increasingly running applications. HTTP and and Flash are the predominant protocols for application delivery. One of the most interesting findings from the presentation is that P2P traffic as a category is declining fairly rapidly. As a result of efforts by ISPs and others to rate-limit P2P traffic, in a strict “classifiable” sense (by port number), P2P traffic accounts for less than 1% of Internet traffic in 2009. However the actual number is likely closer to 18% when accounting for various obfuscation techniques. Still this is down significantly from estimates just a few years ago that 40-50% of Internet traffic consisted of P2P downloads. Today, with a number of sites providing both paid and advertiser-supported audio and video content, the fraction of users turning to P2P for their content is declining rapidly. Instead, streaming of audio and video over Flash/HTTP is one of the fastest growing application segments on the Internet.

Evolution of Internet Core: Increasingly, content is being delivered directly from providers to consumers without going through traditional ISPs. Anecdotally, content providers such as Google, Microsoft, Yahoo!, etc. are peering directly with thousands of Autonomous Systems so that web content from these companies to consumers skips any intermediary tier-X ISPs in going from source to destination.
When ranking AS’s by the total amount of data either originated or transited, Google ranked third and Comcast 6th in 2009, meaning that for the first time, a non-ISP ranked in the top 10. Google accounts for 6% of Internet traffic, driven largely by YouTube videos.

Measurements are valuable in providing insight into what is happening in the network but also suggest interesting future directions. I outline a few of the potential implications below:

Internet routing: with content providers taking on ever larger presence in the Internet topology, one important question is the resiliency of the Internet routing infrastructure. In the past, domains that wishes to remain resilient to individual link and router failures would “multi-home” by connecting to two or more ISPs. Content providers such as Google would similarly receive transit from multiple ISPs, typically at multiple points in the network. However, with an increasing fraction of Internet content and “critical” services provided by an ever-smaller number of Internet sites and with these content-providers directly peering with end customers rather than going through ISPs, there is the potential for reduced fault tolerance for the network as a whole. While it is now possible for clients to receive better quality of service with direct connections to content providers, a single failure or perhaps a small number of correlated failures can potentially have much more impact on the resiliency of network services.

CDN architecture: The above trend can be even more worrisome if the cloud computing vision becomes reality and content providers begin to run on a small number of infrastructure providers. Companies such as Google and Amazon are already operating their own content distribution networks to some extent and clearly they and others will be significant players in future cloud hosting services. It will be interesting to consider the architectural challenges of a combined CDN and cloud hosting infrastructure.

Video is king: with an increasing fraction of Internet traffic devoted to video, there is significant opportunity in improved video and audio codecs, caching, and perhaps the adaptation of peer-to-peer protocols for fixed infrastructure settings.

Back when I took my graduate course on computer networks (from the tremendous Domenico Ferrari at UC Berkeley), the material was still taught strictly based on the seven-layer OSI protocol stack. Essentially, our textbook had one chapter for each of the seven layers. The running joke about the OSI model is that no one understands exactly what layers 5 (the session layer) and the layer 6 (the presentation layer) were all about. In networking, we spend lots of time talking about layers 1, 2, 3, 4, and 7, but almost none about layers 5 and 6. Recently, people have even started talking about layer 0, e.g., material scientists that create some of the physical substrates that support high levels of bandwidth on optical networks, and layer 8, the higher-level meaning that might be extracted from collections of applications and data, e.g., the Semantic Web.

What I have found interesting as of late however is that the line between two of the more well-defined layers, layer 2, the network layer, and layer 3, the internetwork layer has become increasingly blurred. In fact, I would argue that much of the functionality that was traditionally relegated to either layer 2 or layer 3 has become blurred. In the past, layer 2 was about getting data to/from hosts on the same physical network. Layer 3 was about getting data among hosts on different physical networks. Presumably, delivering data for hosts on, for instance, the same LAN segment should allow for simplifying assumptions relative to delivering data between networks.

However, technology forces have pushed us to a point where everything is about “inter-networking”. A single physical LAN in isolation is just not interesting. One would think that this would mean that layer 2 protocols would become increasingly marginalized and less important. All the action should be at layer 3, because inter-networking is where all the action is.

However, just the opposite is in fact happening. Just about all traditional layer 3/inter-networking functionality is migrating to layer 2 protocols. So if one were to squint just a little bit, functionality at layer 2 and layer 3 is virtually indistinguishable and often duplicated. Just as interesting perhaps is that layer 2 may in fact be the place where inter-networking takes place by default, at least within the campus, the enterprise, and the data center. It would be too radical (for now) for me to make claims about it extending to the Internet as a whole, though a number of projects, including the 100×100 effort, have considered this very position.

Here, I will consider some of the reasons why inter-networking is migrating to layer 2. There are at least two major forces at work here.

The first issue goes back to the original design of the Internet and its protocol suite. The designers of the Internet made a crucial, and at the time entirely justified, design decision/optimization. They used a host’s IP address to encode both its globally unique address and its hierarchical position in the global network. That is, a host’s 32-bit IP address would be both the guaranteed unique handle for all potential senders and the basis for scalable routing/forwarding in Internet routers. I recently heard a talk from Vint Cerf where he said that this was the one decision that he most wishes he could revisit.This design point was perfectly reasonable, and in fact a very nice optimization, as long as Internet hosts never, or at least very rarely, changed locations in the network. As soon as hosts could move from physical network to physical network with some frequency, then conflating host location with host identity introduces a number of challenges. And of course today, we have exactly this situation with WiFi, smart phones, and virtual machine migration. The problem stems from the fact that scalable Internet routing relies on hierarchically encoding IP addresses. All hosts on the same LAN share the same prefix in their IP address; all hosts in the same organization share the same (typically shorter) prefix; etc.

When a host moves from one layer 2 domain (previously one physical network) to another layer 2 domain, it must change its IP address (or use fairly clumsy forwarding schemes originally developed to support IP mobility with home agents, etc.). Changing a host’s IP address breaks all outstanding TCP connections to that host and of course invalidates all network state that remote hosts were maintaining regarding a supposedly globally unique name. Of course, it is worth noting that when the Internet protocols were being designed in the 70’s, an optimization targeting the case where host mobility was considered to be rare was entirely justified and even very clever!

The second major force at work in pushing inter-networking functionality into layer 2 is the relative difficulty of managing large layer-3 networks. Essentially, because of the hierarchy imposed on the IP address name space, layer 3 devices in enterprise settings have to be configured with the unique subnet number corresponding to the prefix the switches are uniquely responsible for. Similarly, end hosts must be configured through DHCP to receive an IP address corresponding to the first hop switch they connect to.

It is for these reasons that network designers and administrators became interested in managing multiple physical networks as a single layer 2 domain, even going back to some of the original work on layer 2 bridging and spanning tree protocols. In an extended LAN, any host could be assigned any IP address and it could maintain its IP address as it moved from switch to switch. For instance, consider a campus WiFi network. Technically, each WiFi base station forms its own distinct physical network. If each base station were to be managed as a separate LAN, then hosts moving from one base station to another would need to be assigned a new IP address corresponding to the new subnet. Similarly, with the advent of virtualization in the enterprise and data center, it is no longer necessary for a host to physically migrate from one network to another. For load balancing, planned upgrades, and thermal management, it is desirable to migrate virtual machines from one physical host to another. Once again, migrating a virtual machine should not necessitate resetting the machines globally unique name.

Forwarding across LANs at layer 2 involves a single spanning tree that may result in sub-optimal routes and worse admits only path between each source and destination.

A number of support protocols, such as ARP, require broadcasting to the entire layer 2 domain, potentially limiting overall scalability.

Aggregation of forwarding entries becomes difficult/impossible because of flat MAC addresses increasing the amount of state in forwarding tables. An earlier post discusses the memory limitations in modern switch hardware that makes this issue a significant challenge.

Forwarding loops can go on forever since layer 2 protocols do not have a TTL or Hop Count field in the header to enable looping packets to eventually be discarded. This is especially problematic for broadcast packets.

In a subsequent post, I will discuss some of the techniques being explored to address these challenges.

Last week, my PhD student Nathan Farrington presented our paper “Data Center Switch Architecture in the Age of Merchant Silicon” at Hot Interconnects. My group has been thinking about the concept of scale out networking. Today, we roughly understand how to build incrementally scalable computation infrastructures with clusters of commodity PC’s. We similarly understand how to incrementally deploy storage in clusters through systems such as GFS or HDFS. Higher-level software enables the computation and storage to be incrementally built out, achieving so-called “scale out” functionality. Adding a number of CPUs and disks should result in a proportional increase in overall processing power and storage capacity.

However, achieving the same functionality for the network remains a challenge. Adding a few high-bandwidth switches to a large topology may not increase the aggregate bandwidth available to applications running on the infrastructure. In fact, ill-advised placement of new switches with original Ethernet spanning tree protocols could actually result in a reduction of bandwidth.

Of course, the ability to seamlessly harness additional CPUs and storage in some large-scale infrastructure did not become available overnight. Significant monitoring and protocol work went into achieving such functionality. So, one goal of our work is to consider the protocol, software, and hardware requirements of scale-out networking. Essentially, how can developers of large-scale network infrastructures independently add both ports and bandwidth to their topology?

Along one dimension, the network should expand to accommodate more hosts by adding ports. The bandwidth available in the global switching infrastructure should then be re-apportioned to the available ports. This allocation may be influenced by higher-level administrator policy, importantly not necessarily on a link-by-link, port-by-port, or even path-by-path basis. Rather, this allocation may take place on applications and services running on the infrastructure. And, of course, the mapping of application to port-set may change dynamically.

Along a second dimension, the aggregate network bandwidth should be expandable by simply plugging in additional hardware. This bandwidth should then correspond to increased available network performance across the network fabric, again subject to administrator policy.

Thus, I may have a network with 1000 ports of 10 Gigabit/sec of Ethernet. The network fabric may support 1 Terabit/sec of aggregate bandwidth, making an average of 1 Gigabit/sec of bandwidth available to each port. This would result in an oversubscription ratio of 10, which may be appropriate depending on the communication requirements of applications running on the framework. Given this network, I should be able to expand the number of ports to 2000 while maintaining aggregate bandwidth in the switching fabric at 1 Terabit/sec, increasing the oversubscription ratio to 20. Similarly, I might increase the aggregate bandwidth in the fabric to 2 Terabits/sec while maintaining the port count at 1000, decreasing the oversubscription ratio to 5.

Our paper considered the hardware requirements of such an architecture. At a high level, we designed a modular two-level network architecture around available “merchant silicon.” The first level, so-called pod switches, are large-scale fully functional Ethernet switches with between 100-1000 ports given current technology design points. The pod switches are built from some number of merchant-silicon chips available economically from any number of manufacturers (including Fulcrum, Broadcom, Gnodal, etc.). Fabric cards containing the merchant silicon control the amount of available aggregate bandwidth (and hence oversubscription ratio) in a pod. The second level of the architecture, the core switching array, similarly leverages the same merchant silicon in modular fabric cards to vary the amount of oversubscription available for global, or inter-pod, communication.

The system scales out the number of ports with additional pods (and line cards within a pod) and adds bandwidth to both pods and the network as a whole with modular line cards.

The work also considers the physical cabling challenges associated with any large-scale network infrastructure. Essentially, transporting lots of bandwidth (e.g., potentially petabits/sec) across a room takes a lot of power and a lot of cables, especially if using traditional copper cable. However, technology trends in optics is changing this side of the equation. More on this in a separate post.

The availability of commodity, feature-rich switches will, I believe, change the face of networking in the same way that commodity processors changed the face of networked services and high-performance computing (back in the mid-90’s, the NOW project at UC Berkeley explored the use of clusters of commodity PC’s to address both domains). Today, the highest performance compute systems are typically built from commodity x86 processors. This was not necessarily true 10 and certainly not 20 years ago. In the same way, the highest performance network fabrics will be built around commodity Ethernet switches on a chip moving forward.