Kim Nash in an interview with Jonathan Heiliger, Facebook VP of technical operations, provides some juicy details on how Facebook handles operations. Operations is one of those departments everyone runs differently as it is usually an ontogeny recapitulates phylogeny situation. With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?

Frequent Releases. A major release once a week and a minor releases every few days.

Create a Cyber Liability Group. At one time operations was distributed amongst several groups. A permanent operations group was created to isolate problems and revert problem software components back to previously known good states. The ability of a separate team to handle rollbacks speaks to a great deal of standardization and advanced tool building.

Distribute Team Across Time Zones. Split the operations team across different time zones so no one has to work the graveyard shift. Facebook has 20 people in their team located in Palo Alto, California and London, England.

Be Innovative, Not Safe. Fear of failure often shuts down the organizational brain and makes it hide behind excessive rules and regulations. A technology company should have a bias towards action and innovation. Release software. Don't stifle genius. Rely on your tools and processes to recover from problems.

Expect Problems. Software pushed to production will have problems. Expect problems, but don't let that stop you from innovating.

Roll Backward. When a problem is detected in a release the changes can either be rolled forward or backward. Rolling back is going to a previously good release. Rolling forward is fixing problems in the new release rather than rolling back. Bugs in production are fixed in production. Roll forward ends up being covered in the press, so prefer roll backs over roll forwards.

Roll Out Massive Changes Slowly. Turn on features gradually, for a few percent of users at a time. Use the slow rollout to fix problems that can only be found under real user conditions. This approach give operations and development a lot of confidence in changes.

Encourage Openness and Information Sharing. Design reviews, PR strategy, which servers to buy, etc are often open for informal debate among employees. Facebook has created an Ideas system where employees can create an Idea by category. There's a discussion tool for discussing the idea and a rating system for rating the idea. Tools are built on-top of Facebook platform so they are available to everyone.

Live-blog Key Events. Large company meetings, monthly presentations and weekly Q&As with the management team are transcribed live.

It sounds like a relatively fun environment for pushing software live. Getting software moved into production is often harder than the original coding and testing. Now I know what you are thinking. You somehow managed to procure the ssh login. So just login remotely and do the install yourself! Nobody will know. Oh so tempting. But it's not really good corporate citizenship. And you just might screw up, then there will be some esplaining to do.

Emphasing frequent releases and gutsy release policies makes it actually seem like someone is supporting developers instead of treating them like their software carries the plague. Data centers are often treated like quarantine stations and developers are treated like asymptomatic carriers of some unknown virulent disease. To be safe nothing should ever change, but that's not an attitude that makes things better. Nice to see that recognized.

To setup or not to setup a separate operations group? Facebook says "to be" and creates a seperate group. Amazon says "not to be" and has developers support their own software. Secretly I think Amazon gets better results by requiring developers to support their own software. Knowing it may be you getting the "It's Down!" call gives one proper perspective. But I like not being on call and I think most developers agree. Plus the idea "following the sun" to get 24 hour support is a smart idea.

Reader Comments (7)

Great article, Brian Shire (sorry if it's misspelled) from Facebook presented at PHP|Tek in 2k7 and gave a lot of insight into what they did to scale. The concept of rolling out a change to only a small portion of your site is a pretty cool concept that I hope to try out at some point.

Release early & release often is the best catalyst for both optimizing customer value and your operations. It brings developers and sysadmins closer together and encourages flexibility and operational momentum. For massive changes, I encourage http://www.agileweboperations.com/avoiding-code-inventory-staged-releases">"sleeper code" releases as well - push out everything you need as soon as its ready. If the other components aren't ready, simply disable these interfaces - most importantly, you've released it and its "running in production". Just this phrase is enough to really focus a team and get them pulling together to finish.

Being on call is fine, as long as things work. And that's sort of the point of not separating operational thinking from developers.

Even if there's a separate group, a lot of effort needs to be done to make sure that all the groups collaborate extensively to avoid the "it's not my problem" mentality, which is unfortunately what I see most of the time.

With 2,000 databases, 25 terabytes of cache, 90 million active users, and 10,000 servers you know Facebook has some serious operational issues. What are some of Facebook's secrets to better operations?