Expecting to Fail

A few days ago, after the latest in a seemingly never ending string of problems that interrupted connectivity between two of our data centers, coworker #1 said something like “why can’t we have a network that just works?!” The exasperation in his voice is something we all felt to some degree or another. Moments later coworker #2 piped up and said “if our network was perfect, the software you write wouldn’t be nearly as robust.”

There’s a lot of truth to that. I find myself writing software a bit differently now than I did seven or eight years ago, even though I was working on high-traffic web sites then and still am now. The single biggest difference is that I try to expect everything to fail. Everything.

I never really thought about this as a design philosophy or how it affects the way I approach things, but that comment last week made me realize it was something worth talking about. The more I’ve thought about it, there are seven distinct issues I find myself dealing with over and over when it comes to embracing failure: redundancy, locality, caching, timeouts, logging, and monitoring.

Redundancy

Anyone using RAID almost takes the idea redundancy for granted. The principle is simple: if an inexpensive component is likely to fail, make sure you have another one capable of taking over the task–seamlessly or otherwise. But in a well-designed system (a modern web platform or service), redundancy goes well beyond disk controllers.

At the network level, you typically want at least two connections to the Internet. You want two routers (and firewalls if you have them as separate devices from the routers) which can share state and seamlessly take over when the peer fails. These devices are typically not inexpensive, but they’re very important to the site remaining visible to the outside world.

You want multiple back end servers that are not all plugged into the same switch. By spreading out your machines across switches, you can survive a switch failure (if you do it right). You want multiple DNS, NTP, SMTP, and Web servers.

You want your data center to have a backup power system, typically a combination of batteries and a generator. In fact, there are usually multiple generators.

On your caching and database tiers, you want master/slave or master/master replication pairs to deal with potential failures. On your database boxes, you probably want hardware or software RAID to guard against a single disk failure.

Planning redundancy into all layers of the system is a form of accepting failure and planning to run in spite of it. Large-scale distributed file systems like Google’s GFS or Hadoop’s HDFS use a replication strategy to ensure that every block of every file exists on at least N nodes. Most users operate with N=3 to guard against a double failure. Maybe a disk controller in one node dies and then a CPU goes bad in a second–and they just happened to have some files or blocks in common. There’s still a live node which can be used to re-distribute the data.

In any sufficiently large system, every single point of failure (SPOF) will eventually fail. Expect it. Plan for it.

Locality

The flip side of spreading things out and having more than one of pretty much everything is that things can be rather spread out. In the extreme cases, you have multiple data centers, each with a full copy of your data and running critical services. In a case like this, you want to make sure that you’re using local services as much as possible. For example, a job doing routine data analysis on large data sets in data center A should not query a database in data center B if there’s a perfectly usable one in data center A it can query.

You might think this is about performance and latency–it is. Local data sources are going to be faster than remote. But remote services are also more likely to fail. Well, they may not fail, but they can be inaccessible due to failures in between. There are a lot more variables at work and more things to go wrong along the path when you’re talking between data centers.

A few years ago, Hadoop was patched to incorporate locality into decisions about where jobs should try to read data. This was mainly for performance but I’m sure it also increased the overall reliability of clusters along the way. A double win.

Caching

Sometimes it’s better to have stale data than no data at all. Would a user visiting your web site rather see the home page with some slightly old data or an HTTP 404 because some back-end server was down (or slow–we’ll get to that next) for 2 minutes in the middle of the day? Unless you’re a stock exchange (and most of us are not), the answer is almost always “give the user a stale page.”

That means your front-end proxies or your caching layer needs the ability to hold on to data and serve that up in the event of a back-end server problem. Think about using this all the way down the application stack. It’s a powerful technique.

Timeouts

Cascading failures are the worst. Almost. Failures caused by bugs that occur in very rare circumstances are the worst, but cascading failures are really, really bad too. Very often they’re the result of slowness somewhere. Service X, which usually responds in 5ms, suddenly started taking 2 seconds per request (due to a DNS server failure that resulted from a misconfiguration that nobody caught). That caused requests to pile up on the application tier, which then became overloaded by new connections from the front-end proxies that they couldn’t service fast enough, and then… BOOM!

Cascading failure.

Maybe the front-end proxies die first. Mabye apache starts hitting MaxClients on the application layer. Maybe database connections max out. It doesn’t matter which domino falls first, service layers fall down one by one, taking the whole site into the abyss.

Having sensible timeouts on various network calls is essential to preventing cascading failures like this. You want things to gracefully degrade, not come to a grinding halt. The default timeouts in a lot of network libraries and client APIs are far, far too high.

Logging

In order to understand failures when they occur, you need good logging. In my mind that means consistency, aggregation, and appropriate levels of verbosity. Consistency means recording logs in a way that makes it easy to piece back together what happened. That means having useful timestamps (system clocks synced with a common NTP server help a lot here) and processes that identify themselves in a useful manner. If you’re using syslog, you get some of that for free. Plus, packages like syslog-ng facilitate network logging and aggregation. In a sufficiently large deployment, you can have one more log hosts that accept log messages from the other servers, filter them into appropriate files, and so on.

Finally, you need to get a logging level that’s useful enough to provide enough context in the event of a failure, but not so much that you always feel like you’re looking for a needle in a haystack. On the flip side, if you log too little, you’ll be kicking yourself for not having systems that are just a little more chatty on a routine basis, just so you can more easily see when things start to look wrong.

Monitoring

You can’t always be watching logs. While it’s often enlightening to watch a tail -F syslog-all, you probably need to start getting work done at some point. That’s where monitoring comes in. Having computers watch computers makes it a lot easier to detect an then report on problems. A good monitoring system frees you from having to constantly watch a status page somewhere. It knows what expected tolerances for various metrics are, tests them routinely, is able to aggregate related alerts, and ultimately notify you in a timely manner.

You may get the urge to write you own monitoring system. Resist that. Writing a good monitoring system is much harder than it seems on the surface. If at all possible, find an existing system that allows you to plug in your own test and/or code. The time and effort you invest into finding a good system and deploying it will pay dividends many, many times over.

Conclusion

While I haven’t gone into a lot of details on any of these items, they’re all important things to consider as your systems continue to grow. Neglecting any one of them could result in a prolonged outage, unexplained downtime, or undetected performance problems. If you haven’t already done so, it’s worth seriously thinking how ridiculous it will look down the road if you don’t take steps today to work towards a better situation.

Comments on "Expecting to Fail"

loesprite

I\’d like to say that we all know there\’s so many ways to keep the data and services safe. But at some point, you may not have enough budget on redundancy, locality and caching. Maybe you have all the services running on a few servers in your house and the data would be easily distroied if the 2 or 3 disk die at the same time ( fire or earthquake ). Regarding to this, what we may think about is something like priority. And also we need some cheap and effective data keeper solutions.

First of all I agree completely with the notion about proactive monitoring. I\’d further add the criticality of fine-gained monitoring. So many people run SAR at the default interval of 10 minutes and don\’t realize the data they\’re collecting is mush! I\’d also say it\’s critical that all systems in a cluster are synchronized with NTP and that all your monitoring samples all systems at as close as possible (within a few msecs) so when there is a problem you can look at all your logs on all your systems to see what happened and in what order.

That said, I\’d suggest you check out the open source monitoring tool I wrote a number of years ago called collectl – see: http://collectl.sourceforge.net/ which does everything I said above and more. Just start it and it will collect samples every 10 seconds to the nearest msec, saving very detailed logs for a week (or more if you like). Collectl runs on some of the largest clusters in the world, many of which are on the Top500 list.

Rather than me ramble more, check it out and the next time you do have a failure for which you don\’t know the reason, there\’s a good chance collectl will.

As an \”Old Guy\” who started a while ago, Networking sure has gotten complicated as time goes on. Some of us have had certain \’rules\’ drilled into our heads from training & have not bothered to update, regardless of popular opinions.

This is a great article that all Networking folks should have a gawk at; it could save their jobs!

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.