Netflix: fail constantly

[Sorry for the sporadic posting. I’ve had more travel in the past 7 weeks than the last 2 years. I should be back to a more regular schedule soon.]

The “cloud” is still a new and curious beast for a lot of us, especially people who grew up in a more traditional hosting model. We have several generations of IT workers who have learned everything about hosting on our own hardware and networks. The flexibility of the cloud is a game-changer, and I’m continually learning new places where “conventional wisdom” will lead you down a difficult path.

Netflix has been kind enough to post their five key lessons from their cloud experiences on their tech blog. While these lessons may look simple and perhaps obvious in retrospect, there are two that really hit home with me:

1. Prepare to unlearn everything you know about running applications in your own datacenters.

3. The best way to avoid failure is to fail constantly.

First, an entire generation (or maybe two or three) of system and network administrators learned all of what we know about scale and reliability by running our own applications on our own servers in our own datacenters using our own networks. There are thousands of person-centuries of of experience that have created best (or at least “good”) practice on how to be successful in this model, but this has done very little to prepare us to be successful using cloud resources. In fact, it might even be working against us.

We’ve all got a lot to un-learn.

Second, in the olden days, uptime was king, and a high time between reboots (or crashes) was considered a mark of a capable system administrator. Failure was to be avoided at all costs, and testing failover (or disaster recovery) was done infrequently, if at all, due to the high impact and high costs. We did all get used to a more frequent reboot cycle, if only to be able to install all the needed security patches, but that was just a small change in focus, not a complete sea change.

In computing clouds, it is a given and an expectation that instances will fail at random, and the solution is to have an agile application, not to focus on high availability or increasing hardware reliability. Just as there is continuous development, testing and deployment, there needs to be continuous failover testing. Netflix created a tool (Chaos Money) specifically to force random failures in their production systems! That’s right, they are constantly creating failures, just to continuously test their failover methods, in the live production system.

That’s a) really hardcore, b) really scary and c) really cool.

That’s one way to put your reputation on the line. And it points out just how you need to do some very non-intuitive things, and unlearn decades of good practice to be successful in the cloud.

I think uptime is still king, but the uptime that matters has become
decoupled from the uptime of a single piece of hardware. This trend is
actually independent of the rise of AWS like IaaS clouds. It stems from
a combination of reaching the limits of single system scalability and
the drastic cost savings enabled by using commodity hardware. If you
run enough systems, then small failure rates becomes a whole number of
systems that fail per day.

In addition, I think the definition of uptime has changed, with
companies like Google, Facebook, Amazon and such leading the way. Uptime is no
longer viewed as “unplanned” downtime, but instead means no downtime.
This shift also requires planning for failures, and rolling upgrades.

Netflix is in an interesting middle ground, where their fleet is large
enough that failure is not rare, but not so large that it’s common.
Hence the “chaos monkey” (which is brilliant idea for any service in a
similar place on the failure curve).

That being said, you are correct that “cloud” forces this change on an
IT shop. Because it takes advantage of the less reliable commodity
hardware, and because the IT team now has less control over change (and
more the point, timing of such) in the environment. The mere act of
moving your service to the cloud forces you to plan for failure more.