August 07, 2009

Fail Fast, Debugging and Systems that Never Stop

If you're a Java guy you're probably familiar with the phrase "fail fucking fast" in the context of Iterators (I added the fucking part). But fail fast is much more than just iterators, it's a general design principle--and a very important one at that.

Jim Gray on fail fast: "it [the process] should either function correctly or it should detect the fault, signal failure and stop operating." You can replace "the process" with component, class, etc. Whatever you designate as your logical unit, the fail fast principle remains. What you absolutely don't want to do is detect the fault, don't signal failure and continue. This is why swallowing exceptions is one of my all time pet peeves. Fail fast is often described in terms of fault detection latency, the time between initial error detection and the fault. You want your latency to be small.

There are two advantages to failing fast. First, easier debugging and analysis. It complicates things greatly if your code detects errors but continues. Eventually something will go wrong but now you can't easily track down the root cause. I recently worked with a component (one that shall remain nameless) that doesn't follow the fail fast principle. During initialization the component knew it was missing a configuration file but instead of halting it continued on. When the component was finally invoked things obviously did not work, but the root cause was not obvious. Ended up wasting two days tracking down the root cause, when it really should have taken 30 seconds.

Second, ability to build "systems that never stop". Failing fast is so important to building highly available systems that Joe Armstrong talks about it at great lengths in his presentation on "systems that never stop (and Erlang)." In this presentation he presents 6 laws,

Isolation

Concurrency

Failure Detection

Fault Identification

Live Code Upgrade

Stable Storage

Without failure detection you can't fail fast, and if you can't fail fast you can't achieve isolation. Real world example: Amazon S3's massive outage last year. A huge embarrassment to Amazon's otherwise great cloud stack. Their code didn't have adequate failure detection, allowing a single corrupt bit to propagate throughout the system. The result was hours of downtime while engineers scrambled. If Amazon had been able to detect the single corrupt bit then they could have isolated the failing component instead of letting the error bring down the entire system.

Few of us will be coding S3 like systems, but it doesn't mean we can't follow the fail fast approach. It will make debugging easier and just might prevent similar embarrassments. I think some programmers don't like to fail fast because they worry they'll be blamed for the failure, so just pretend things are good and pray nothing bad happens downstream. I suppose it depends on how much you believe in God.