Extreme Agility at Facebook

By E. Michael Maximilien

November 11, 2009

The Facebook social utility is phenomenally successful. As of summer 2009, the site attracted around 300 million visitors per month. It is well noted that if Facebook was a nation it would be ranked in the top five most populous states; and the growth seems to be accelerating! In a nutshell Facebook has simply changed the way everyday individuals (worldwide) conduct their social lives.

Robert Johnson (pictured on photo), director of engineering at Facebook was the last keynote at OOPSLA 2009. Robert’s talk: “Moving Fast at Scale”, aimed to shed some lights on Facebook’s scaling issues and successes, as well as elaborate the type of processes they have used to deal with such incredible growth.

Facebook’s architecture is based on typical hierarchical PHP Web application model with a layer of data caching and extracted services components. The caching layer is done via the stable and fast memcached open source software on top of one of the largest installation of MySQL. The caching layer is so critical to Facebook’s success that Facebook is now one of the main contributors to memcached.

To support their extreme scale needs, the various service components use an homegrown, now Apache open source, RPC mechanism called Thrift. The RPC language was influence by CORBA interface definition language (IDL); it is designed to bind to various languages and is optimized for speed over the wire.

Perhaps the most interesting and revealing aspect of Robert’s talk was the discussion of Facebook’s somewhat unique development process. At the surface it appears to have the contradictory goals of: minimizing down time, scaling fast, and extremely quick code updates. First, Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes.

Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing, and gradual targeted geographic launches. This has resulted in a site that has experienced, according to Robert, less than 3 hours of down time in the past three years.

To help Facebook deal with the enormous amount of data that it collects daily from its users, the team has developed various backend batched services that use Hadoop, Scribe, and Hive. For instance, data is periodically moved to central repositories (e.g., photos, statuses, and comments) not only to facilitate access but also to perform data analytics. The team has even created a tool called HiPal that gives a SQL-like GUI interface to the data which allows the marketing and business teams to perform queries on the data and make informed business decisions. Most of the tools are made open source. Facebook’s culture is that the “world is a better place due to open source.”

Naturally, while Facebook’s process is adapted to deal with its unusual growth and success, it is not without its challenges. For instance, while caching relational data allows for quick access to read-mostly data (e.g., photo, status, and profile) since such data have higher read than write patterns, it can fail miserably when the access pattern is one of write often instead.

This became quickly apparent to the team when they introduced the “Like” feature, or the ability for friends and fans to give thumbs up to posted items; and as soon as popular users like US President Barack Obama or Hollywood actor Vin Diesel had posted items which experienced thousands of likes within minutes... By gradually deploying features and having all developers think of “horizontal scaling by design” for all features, the Facebook team hopes to encourage new features that will integrate smoothly with the site.

Another aspect of the Facebook engineering team is how large the ratio of active user to developer is. Currently it stands at 1.1 million users per developer. This is an attractive recruiting figure since every new Facebook engineer knows that they will have huge impact (positive or negative) quickly and thus this should keep the adrenaline high when pushing new features. Naturally, an immediate question to ask of such a high-pace and high-impact development environment, is whether burnout is or will become a significant issue at Facebook as the company matures? Other issues raised by the audience at the Q&A session related to security and privacy as well as the social graph of the network...

Regardless of ones take on Facebook and whether it continues to grow and soon connect more than 500 million or reach the unprecedented figure of 1 billion active users, one thing is certain, and it is that Facebook has forever changed how a large portion of the world’s population conduct their social lives. This has global, national, regional, and local consequences as was noted during the Iran elections of 2009. In the process, the Facebook team has also taken agility to the extreme and devised a set of principles, tools, and process improvements that allow their engineers to have quick and real impact on you and I.

11/11/2009 - first post and also minor update to fix some typos and grammar

11/12/2009 - noted that photo is of Robert Johnson. A few more grammar fixes (nothing major)

4 Comments

Ronald Woan

November 11, 2009 07:59

I was very impressed with Robert [Johnson]'s presentation. I think Max understates the case Robert made for speed and their ability to run experiments feeding their iterative development and statistical decision making processes in rapid succession.

Competing against a weekly or God forbid monthly release cycle, Facebook can out experiment their competition by orders of magnitude.

Developers are responsible for design, development, QA, and deployment. The other teams are there in advisory capacity within those functions as opposed to gatekeepers in most other organizations. Their philosophy is to reduce latency between bug discovery and resolution by making sure the development teams are there.

Ties in with small chunk and release early and often agile philosophy.

Eugene Maximilien

November 12, 2009 12:27

Great point Ronald. I was trying hard not to overstate Facebook's process 'extremeness' but you are correct, they seem to be breeding super-developers and it seems to be working for them so far. Cheers.

Eugene Maximilien

November 12, 2009 12:35

One more observation. Facebook's process seems to be unusual to say the least. It practically turns Agile on it's head which itself did the same to old processes (e.g., waterfall)... though we are not back to waterfall with Facebook. It's crowdsourcing of QA with unprecedented speed, scaling, feedback, and empowered development teams.

Eugene Maximilien

November 19, 2009 06:09

FYI: Greg Linden (http://glinden.blogspot.com/) had a post at BLOG@CACM in September raising and outlining some of the same issues on frequent software updates that led Facebook to their extreme Agile process improvements. Find it here: http://bit.ly/a5ZMQ

If you are an ACM member, Communications subscriber, Digital Library subscriber, or use your institution's subscription, please set up a web account to access premium content and site
features. If you are a SIG member or member of the general public, you may set up a web account to comment on free articles and sign up for email alerts.