Facebook has been so reliable that when a site outage does occur it's a definite learning opportunity. Fortunately for us we can learn something because in More Details on Today's Outage, Facebook's Robert Johnson gave a pretty candid explanation of what caused a rare 2.5 hour period of down time for Facebook. It wasn't a simple problem. The root causes were feedback loops and transient spikes caused ultimately by the complexity of weakly interacting layers in modern systems. You know, the kind everyone is building these days. Problems like this are notoriously hard to fix and finding a real solution may send Facebook back to the whiteboard. There's a technical debt that must be paid.

The outline and my interpretation (reading between the lines) of what happened is:

Remember that Facebook caches everything. They have 28 terabytes of memcached data on 800 servers. The database is the system of record, but memory is where the action is. So when a problem happens that involves the caching layer, it can and did take down the system.

Facebook has an automated system that checks for invalid configuration values in the cache and replaces them with updated values from the persistent store. We are not told what the configuration property was, but since configuration information is usually important centralized data that is widely shared by key subsystems, this helps explain why there would be an automated background check in the first place.

A change was made to the persistent copy of the configuration value which then propagated to the cache.

Production code thought this new value was invalid, which caused every client to delete the key from the cache and then try to get a valid value from the database. Hundreds of thousand of queries a second, that would have normally been served at the caching layer, went to the database, which crushed it utterly. This is an example of the Dog Pile Problem. It's also an example of the age old reason why having RAID is not the same as having a backup. On a RAID system when an invalid value is written or deleted, it's written everywhere, and only valid data can be restored from a backup.

When a database fails to serve a request often applications will simply retry, which spawns even more requests, which makes the problem exponentially worse. CPU is consumed, memory is used up, long locks are taken, networks get clogged. The end result is something like the Ant Death Spiral picture at the beginning of this post Bad juju.

A feedback loop had been entered that didn’t allow the databases to recover. Even if a valid value had been written to that database it wouldn't have mattered. Facebook's own internal clients were basically running a DDOS attack on their own database servers. The database was so busy handling requests no reply would ever be seen from the database, so the valid value couldn't propagate. And if they put a valid value in the cache that wouldn't matter either because all the clients would still be spinning on the database, unaware that the cache now had a valid value.

What they ended up doing was: fix the code so the value would be considered valid; take down everything so the system could quiet down and restart normally.

This kind of thing happens in complex systems as abstractions leak all over each other at the most inopportune times. So the typical Internet reply to every failure of "how could they be so stupid, that would never happen to someone as smart as me", doesn't really apply. Complexity kills. Always.

Based on nothing but pure conjecture, what are some of the key issues here for system designers?

The Dog Pile Problem has a few solutions, perhaps Facebook will add one them, but perhaps their system has been so reliable it hasn't been necessary. Should they take the hit or play the percentages that it won't happen again or that other changes can mitigate the problem? A difficult ROI calculation when you are committed to releasing new features all the time.

The need for a caching layer in the first place, with all the implied cache coherency issues, is largely a function of the inability for the database to serve as both an object cache and a transactional data store. Will the need for a separate object cache change going forward? Cassandra, for example, has added caching layer that along with key-value approach, may reduce the need for external caches for database type data (as apposed to HTML fragment caches and other transient caches).

How did invalid data get into the system in the first place? My impression from the article was that maybe someone did an update by hand so that the value did not got through a rigorous check. This happens because integrity checks aren't centralized in the database anymore, they are in code, and that code can often be spread out and duplicated in many areas. When updates don't go through a central piece of code it's any easy thing to enter a bad value. Yet the article seemed to also imply that value entered was valid, it's just that the production software didn't think it was valid. This could argue for an issue with software release and testing policies not being strong enough to catch problems. But Facebook makes a hard push for getting code into production as fast as possible, so maybe it's just one of those things that will happen? Also, data is stored in MySQL as a BLOB, so it wouldn't be possible to do integrity checks at the database level anyway. This argues for using a database that can handle structured value types natively.

Background validity checkers are a great way to slowly bring data into compliance. It's usually applied though to data with a high potential to be unclean, like when there are a lot of communication problems and updates get wedged or dropped, or when transactions aren't used, or when attributes like counts and relationships aren't calculated in real-time. Why would configuration data be checked when it should always be solid? Another problem is again that the validation logic in the checker can easily be out of sync when validation logic elsewhere in the stack, which can lead to horrendous problems as different parts of the system fight each other over who is right.

The next major issue is how the application code hammered the database server. Now I have no idea how Facebook structures their code, but I've often seen this problem when applications are in charge of writing error recovery logic, which is a very bad policy. I've seen way too much code like this: while (0) { slam_database_with_another_request(); sleep (1); }. This system will never recover when the sh*t really hits the fan, but it will look golden on trivial tests. Application code shouldn't decide policy because this type of policy is really a global policy. It should be moved to a centralized network component on each box that is getting fed monitoring data that can tell what's happening with the networks and services running on the networks. The network component would issue up/down events that communication systems running inside each process would know to interpret and act upon. There's no way every exception in every application can handle this sort of intelligence, so it needs to be moved to a common component and out of application code. Application code should never ever have retries. Ever. It's just inviting the target machine to die from resource exhaustion. In this case a third party application is used, but with your own applications it's very useful to be able to explicitly put back pressure on clients when a server is experiencing resource issues. There's no need to go Wild West out there. This would have completely prevented the DDOS attack.

There are a lot of really interesting system design issues here. To what degree you want to handle these type of issues depends a lot on your SLAs, resources, and so on. But as systems grow in complexity there's a lot more infrastructure code that needs to be written to keep it all working together without killing each other. Much like a family :-)

Jeff, the problem is there's absolutely no way for an application to come up with any reasonable rule. Why 3-5? How long is the timeout between retries? This may make sense in an Internet app, but inside your own data center it doesn't because services, networks, etc are all supposed to be there. You can be a lot smarter and make for better system wide stability.

If you have something that must retry, exponential backoff is a fantastic idea. There's a reason networks use it. No matter how badly you pound your stable storage at first, requests will slow down rapidly enough that, if you wait just tens of seconds, the problem goes away. And in the common case, when everything isn't going to hell, you still have quick retries. No need for complex network monitoring, and no need to think through every special situation you might have.

The reason I don't like EB for applications Alex is that this kills your response latency when a condition clears up. You could be waiting unnecessarily long during a retry. At the protocol level TCP/IP has to assume it doesn't know anything about what's happening on the other side. With your own app in your own data center that's not a valid assumption.

I can see valid reasons for retries, for instance, downloading a bunch of images from a third party server or service, network hiccup or 3rd party server hiccup times out the download... don't want to kill the entire process which is downloading a million photos, shelve the bad photo and wait a few seconds, try again, after 3-5 in a row kill it... Would be nicer to have this in a 'centralized' system but that is a complex system that would take time to write and the 3-5 rule works fine in this case

Both are completely valid approaches, and I think that both can (and should) be used in tandem. It's completely trivial to have that monitoring component fire off signals at the app servers that the connection is believed to be good and they can now retry.

Part of the reason that you don't want that functionality confined to the monitoring system alone is that the interruption may not occur between a backend component and the monitoring system. Network blips and application hiccups can happen anywhere in the stack for any reason, and it's so simple to add in a sensible retry delay that there's really no good reason not to.

Nicolas, I understand the complexity issue, but at a certain level infrastructure needs to be self-aware so it can make conscious decisions rather than reacting dumbly and making problems worse. Distributing monitoring information is actually pretty straightforward. The trick is to make applications use core infrastructure services so that they never have to see the complexity, they get to see the success path more times than not. Its the core infrastructure services that consume the event information and implement policy. This kind of system level architecture hasn't been common in the web world as people are building complex apps now, not just building CRUD systems.

Tom, I'm talking about an integrated software architecture for a backend. In a script or whatever where I'm downloading images a retry is just fine. At that level it's likely the system that you are accessing will provide rate limiting, which is a form of what I'm talking about. To have an app sit in a tight loop downloading a million images in naive way, now times a million, would cause a lot of system problems. We accept that our OS will have scheduling policies on a set of scarce shared hardware resources, applications need to the same at the cluster and data center level too.

Will the need for a separate object cache change going forward? Cassandra, for example, has added caching layer that along with key-value approach, may reduce the need for external caches for database type data (as apposed to HTML fragment caches and other transient caches).

I think that this question works in both direction i.e. will the availability of large memory machines would enable us to store all the data in-memory and use the in-memory store as the system of record?. This is another way to remove the data consistency issue.

Anyway looking backward the solution always seem trivial - the challenge though is that failure (especially in distributed systems) happens often at places where we don't expect them to happen. We can't prevent failure from happening but cope with them. So the lesson in this case is not how to avoid this type of failure but how to contain it in away that wouldn't lead to the entire system failure.

Perhaps something like a phased roll-out approach could have minimize the failure into a sub-cluster and would have make the impact of a failure less catastrophic.

You might consider enforcing client access limits at the database level. Much like Apache: if all the connections are busy your request will queue until the web server has a slot free. Well, the database shouldn't serve more requests than it can reasonably handle.

Because of connection persistence, I don't know that it is trivial to limit the amount of pounding the database will allow without funneling requests through some rate-limiting layer that lets your database handles stay open but only allows through n requests at a time . . . maybe each object cache server can share a finitely small number of handles, increasing nominal database latency but improving overall system robustness . . .

. . . but databases receiving too much traffic ought to be able to fail gracefully . . .

Queuing requests on the client is definitely the answer. However, there is a problem queuing connections when several loosely connected client nodes are sending requests to the database, since there is not a single authority that allows a certain maximum of queued requests for each node. I can imagine a solution for this scenario to have a cron task on the database server that reads the load and then requests that the client nodes reduce their work queues respectively.

One big lesson is about rollout strategy -- detecting this kind of problem before it takes down the site, as Nati says.

It also seems like "kill switches" to degrade noncritical (or critical!) site features could be useful reliability tools. Making it quick to force frontends to treat particular kinds of old or invalid cached values as valid, or fail requests or return empty results rather than hit the database, could slow down the storm without the full-on downtime that happened here.

Of course, easier said than done -- you'd need a lot of switches since you never know what's going to create load, and flipping them can have hard to predict effects and is a difficult judgment call. Hard problem and I don't want to minimize how hard Facebook's job is.