Archive for November, 2009

Recently, we were fortunate to host Jeff Rothschild, the Vice President of Technology at Facebook, for a visit for the CNS lecture series. Jeff’s talk, “High Performance at Massive Scale: Lessons Learned at Facebook” was highly detailed, providing real insights into the Facebook architecture. Jeff spoke to a packed house of faculty, staff, and students interested in the technology and research challenges associated with running and Internet service at scale. The talk is archived here as part of the CNS lecture series. I encourage you to check it out; below are my notes on the presentation.

Site Statistics:

Facebook is the #2 property on the Internet as measured by the time users spend on the site.

Over 200 billion monthly page views.

>3.9 trillion feed actions proceessed per day.

Over 15,000 websites use Facebook content

In 2004, the shape of the curve plotting user population as a function of time showed exponential growth to 2M users. 5 years later they have stayed on the same exponetial curve with >300M users.

Facebook is a global site, with 70% of users outside of the US.

Today, there are 1.3B people in the world who have quality Internet connectivity, so there is at least another factor of 4 growth that Facebook is going after. Jeff presented statistics for the number of users that each engineer supports at a variety of high-profile Internet companies: 1.1M for Facebook, 190,000 Google, 94,000 Amazon, 75,000 Microsoft.

Photo sharing on Facebook:

Facebook stores 20 billion photos in 4 resolutions

2-3 billion new photos uploaded every month

Originally provisioned photo storage for 6 months, but blew through available storage in 1.5 weeks.

Facebook serves 600k photos/second –> serving them is more difficult than storing them.

Scaling photos, first the easy way:

Upload tier: handles uploads, scales the images, sotres on NFS tier

Serving tier: Images are served from NFS via HTTP

NFS Storage tier built from commercial products

Filesystems aren’t really good at supporting large numbers of files

Scaling photos, 2nd generation:

Cachr: cache the high volume smaller images to offload the main storage systems.

Only 300M images in 3 resolutions

Distribute these through a CDN to reduce network latency.

Cache them in memory.

Scaling photos, 3rd Generation System: Haystack

How many IO’s do you need to serve an image? Originally, 10 I/O’s at Facebook because of the complex directory structure.

Optimizations got it down to 2-4 IOs per file served

Facebook built a better version called Haystack by merging multiple files into a single large file. In the common case, serving a photo now requires 1 I/O operation. Haystack is available as open source.

Facebook architecture consists of:

Load balancers as front end requests are distributed to Web Servers retrieve actual content from a large memcached layer because of the latency requirements for individual requests.

Presentation Layer employs PHP

Simple to learn: small set of expressions and statements

Simple to write: loose typing and universal “array”

Simple to read

But this comes at a cost:

High CPU and memory consumption.

C++ Interoperability Challenging.

PHP does not encourage good programming in the large (at 3M lines of code it is a significant organizational challenge).

Initialization cost of each page scales with size of code base

Thus Facebook engineers undertook implementing optimizations to PHP:

Lazy loading

Cache priming

More efficient locking semantics for variable cache

Memcache client extension

Asynchrnous event-handling

Back-end services that require the performance are implemente in C++. Services Philosophy:

Create a service iff required.

Real overhead for deployment, maintenance, separate code base.

Another failure point.

Create a common framework and toolset that will allow for easier creation of services: Thrift (open source).

A number of things break at scale, one example: syslog

Became impossible to push large amounts of data through the logging infrastructure.

Overall, Facebook currently runs approximately 30k servers, with the bulk of them acting as web servers.

The Facebook Web Server, running PHP, is responsible for retrieving all of the data required to compose the web page. The data itself is stored authoritatively in a large cluster of MySQL servers. However, to hit performance targets, most of the data is also stored in memory across an array of memcached servers. For traditional websites, each user interacts with his or her own data. And for most web sites, only 1-2% of registered users concurrently access the site at any given time. Thus, the site only needs to cache 1-2% of all data in RAM. However, data at Facebook is deeply interconnected; each user is interested in the state of hundreds of other users. Hence, even with only 1-2% of the user population at any given time, virtually all data must still be available in RAM.

Memcache

Data partitioning was easy when Facebook was a college web site, simply partition data at the level of individual colleges. After considering a variety of data clustering algorithms, found that there was very little win for the additional complexity of clustering. So at Facebook, user data is randomly partitioned across indiviual databases and machines across the cluster. Hence, each user access requires retrieving data corresponding to user state spread across hundreds of machines. Intra-cluster network performance is hence critical to site performance. Facebook employs memcache to store the vast majority of user data in memory spread across thousands of machines in the cluster. In essence, nodes maintain a distributed hash table to determine the machine responsible for a particular users data. Hot data from MySQL is stored in the cache. The cache supports get/set/incr/decr and

multiget/multiset operations.

Initially, the architecture needed to support 15-20k requests/sec/machine but that number has scaled to approximately 250k requests/sec/machine today. Servers have gotten faster to keep up to some but Facebook engineers also had to perform some fundamental re-engineering of memcached to improve its performance. System performance improved from 50k requests/sec/machine to 150k to 200k to 250k by adding multithreading, polling device drivers, stats locking, and batched packet handling respectively. In aggregate, Memcache at Facebook processes in 120M requests/sec.

Incast

One networking challenge with memcached was so-called Network Incast. A front-end web server would collect responses from hundreds of memcache machines in parallel to compose an individual HTTP response. All responses would come back within the same approximately 40 microsecond window. Hence, while overall network utilization was low at Facebook, even at short time scales, there were significant, correlated packet losses at very fine timescales. These microbursts overflowed the limited packet buffering in commodity switches (see my earlier post for more discussion on this issue).

To deal with the significant slow down that resulted by synchronized loss in relatively small TCP windows, Facebook built a custom congestion-aware UDP-based transport that managed congestion across multiple requests rather than within a single connection. This optimization allowed Facebook to avoid the, for example, 200 ms timeouts associated with the loss of an entire window’s worth of data in TCP.

Authoritative Storage

Authoritative Facebook data is stored in a pool of MySQL servers. The overall experience with MySQL has been very positive at Facebook, with thousands of MySQL servers in multiple datacenters. It is simple, fast, and reliable. Facebook currently has 8,000 server-yearas of runtime experience without data loss or corruption.

Facebook has learned a number of lessons about data management:

Shared architecture should be avoided; there are no joins in the code.

Storing dynamically changing data in a central database should be avoided.

Similarly, heavily-referenced static data should not be stored in a central database.

There are a number of challenges with MySQL as well, including:

Logical migration of data is very difficult.

Creating a large number of logical dbs, load balance them over varying number of physical nodes.

Easier to scale CPU on web tier than on the DB tier.

Data driven schemas make for happy programmers and difficult operations.

Given its global user population, Facebook eventually had to move to replicating its content across multiple data centers. Facebook now runs two large data centers, one on the West coast of the US and one on the East coast. However, this introduces the age-old problem of data consistency. Facebook adopts a primary/slave replication scheme where the West coast MySQL replicas are the authoritative stores for data. All updates are applied to these master replicas and asynchronously replicated to the slaves on the East coast. However, without synchronous updates, consecutive requests to the same data item from the same user can return inconsistent or stale results.

The approach taken at Facebook is to set a cookie on user update requests that will redirect all subsequent requests from that user to the West coast master for some configurable time period to ensure that read operations do not return inconsistent results. More details on this approach is detailed on the Facebook blog.

Jeff also relayed an interesting philosophy from Mark Zuckerberg: “Work fast and don’t be afraid to break things.” Overall, the idea to avoid working cautiously the entire year, delivering rock-solid code, but not much of it. A corollary: if you take the entire site down, it’s not the end of your career.

Harsha Madhyastha‘s paper “Moving Beyond End-to-End Path Information to Optimize CDN Performance” won the best paper award at IMC 2009. The paper presents measurements information from Google’s production CDN to show that redirecting clients to the nearest CDN node will not necessarily result in the lowest latency. Harsha and his colleagues built a tool called WhyHigh in production use at Google that uses a series of active measurements to diagnose the cause of inflated latency to the relatively large number of clients that experience poor latency to individual CDN nodes. Definitely a worthwhile read.

Craig Labovitz made a very interesting presentation e the recent NANOG meeting on the most recent measurements from Arbor’s ATLAS Internet observatory. ATLAS takes real time Internet traffic measurements from 110+ ISPs with real-time access to more than 14 Tbps of Internet access. One of the things that makes working in and around Internet research so interesting (and gratifying) is that the set of problems are constantly changing because the way that we use the Internet and the requirements of the applications that we run on the Internet are constantly evolving. The rate of evolution has thus far been so rapid that we constantly seem to be hitting new tipping points in the set of “burning” problems that we need to address.

Craig, currently Chief Scientist at Arbor Networks, has long been at the forefront of identifying important architectural challenges in the Internet. His modus operandi has been to conduct measurement studies at a scale far beyond what might have been considered feasible at any particular point in time. His paper on Delayed Internet Routing Convergence from SIGCOMM 2000 is a classic, among the first to demonstrate the problems with wide-area Internet routing using a 2-year study of the effects of simulated failure and repair events injected from a “dummy” ISP and the many peering relationships that MERIT enjoyed with TIER-1 ISPs. The paper showed that Internet routing, previously thought to be robust to failure, would often take minutes to converge after a failure event as a result of shortcomings of BGP and the way that ISPs typically configured their border routers. This paper spawned a whole cottage industry on research into improved inter-domain routing protocols.

This presentation had three high level findings on Internet traffic:

Consolidation of Content Contributors: 50% of Internet traffic now originates from just 150 Autonomous Systems (down from thousands just two years ago). More and more content is being aggregated through big players and content distribution networks. As a group, CDN’s account for approximately 10% of Internet traffic.

Consolidation of Applications: The browser is increasingly running applications. HTTP and and Flash are the predominant protocols for application delivery. One of the most interesting findings from the presentation is that P2P traffic as a category is declining fairly rapidly. As a result of efforts by ISPs and others to rate-limit P2P traffic, in a strict “classifiable” sense (by port number), P2P traffic accounts for less than 1% of Internet traffic in 2009. However the actual number is likely closer to 18% when accounting for various obfuscation techniques. Still this is down significantly from estimates just a few years ago that 40-50% of Internet traffic consisted of P2P downloads. Today, with a number of sites providing both paid and advertiser-supported audio and video content, the fraction of users turning to P2P for their content is declining rapidly. Instead, streaming of audio and video over Flash/HTTP is one of the fastest growing application segments on the Internet.

Evolution of Internet Core: Increasingly, content is being delivered directly from providers to consumers without going through traditional ISPs. Anecdotally, content providers such as Google, Microsoft, Yahoo!, etc. are peering directly with thousands of Autonomous Systems so that web content from these companies to consumers skips any intermediary tier-X ISPs in going from source to destination.
When ranking AS’s by the total amount of data either originated or transited, Google ranked third and Comcast 6th in 2009, meaning that for the first time, a non-ISP ranked in the top 10. Google accounts for 6% of Internet traffic, driven largely by YouTube videos.

Measurements are valuable in providing insight into what is happening in the network but also suggest interesting future directions. I outline a few of the potential implications below:

Internet routing: with content providers taking on ever larger presence in the Internet topology, one important question is the resiliency of the Internet routing infrastructure. In the past, domains that wishes to remain resilient to individual link and router failures would “multi-home” by connecting to two or more ISPs. Content providers such as Google would similarly receive transit from multiple ISPs, typically at multiple points in the network. However, with an increasing fraction of Internet content and “critical” services provided by an ever-smaller number of Internet sites and with these content-providers directly peering with end customers rather than going through ISPs, there is the potential for reduced fault tolerance for the network as a whole. While it is now possible for clients to receive better quality of service with direct connections to content providers, a single failure or perhaps a small number of correlated failures can potentially have much more impact on the resiliency of network services.

CDN architecture: The above trend can be even more worrisome if the cloud computing vision becomes reality and content providers begin to run on a small number of infrastructure providers. Companies such as Google and Amazon are already operating their own content distribution networks to some extent and clearly they and others will be significant players in future cloud hosting services. It will be interesting to consider the architectural challenges of a combined CDN and cloud hosting infrastructure.

Video is king: with an increasing fraction of Internet traffic devoted to video, there is significant opportunity in improved video and audio codecs, caching, and perhaps the adaptation of peer-to-peer protocols for fixed infrastructure settings.

Advertisements

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.