Archives

Categories

Meta

Brad Porter is Director and Senior Principal engineer at Amazon. We work in different parts of the company but I have known him for years and he’s actually one of the reasons I ended up joining Amazon Web Services. Last week Brad sent me the guest blog post that follows where, on the basis of his operational experience, he prioritizes the most important points in the Lisa paper On Designing and Deploying Internet-Scale Services.

–jrh

Prioritizing the Principles in “On Designing and Deploying Internet-Scale Services”

James published what I consider to be the single best paper to come out of the highly-available systems world in many years. He gave simple practical advice for delivering on the promise of high-availability. James presented “On Designing and Deploying Internet-Scale Services” at Lisa 2007.

A few folks have commented to me that implementing all of these principles is a tall hill to climb. I thought I might help by highlighting what I consider to be the most important elements and why.

1. Keep it simple

Much of the work in recovery-oriented computing has been driven by the observation that human errors are the number one cause of failure in large-scale systems. However, in my experience complexity is the number one cause of human error.

Complexity originates from a number of sources: lack of a clear architectural model, variance introduced by forking or branching software or configuration, and implementation cruft never cleaned up. I’m going to add three new sub-principles to this.

Have Well Defined Architectural Roles and Responsibilities: Robust systems are often described as having “good bones.” The structural skeleton upon which the system has evolved and grown is solid. Good architecture starts from having a clear and widely shared understanding of the roles and responsibilities in the system. It should be possible to introduce the basic architecture to someone new in just a few minutes on a white-board.

Minimize Variance: Variance arises most often when engineering or operations teams use partitioning typically through branching or forking as a way to handle different use cases or requirements sets. Every new use case creates a slightly different variant. Variations occur along software boundaries, configuration boundaries, or hardware boundaries. To the extent possible, systems should be architected, deployed, and managed to minimize variance in the production environment.

Clean-Up Cruft: Cruft can be defined as those things that clearly should be fixed, but no one has bothered to fix. This can include unnecessary configuration values and variables, unnecessary log messages, test instances, unnecessary code branches, and low priority “bugs” that no one has fixed. Cleaning up cruft is a constant task, but it is a necessary to minimize complexity.

2. Expect failures

At its simplest, a production host or service need only exist in one of two states: on or off. On or off can be defined by whether that service is accepting requests or not. To “expect failures” is to recognize that “off” is always a valid state. A host or component may switch to the “off” state at any time without warning.

If you’re willing to turn a component off at any time, you’re immediately liberated. Most operational tasks become significantly simpler. You can perform upgrades when the component is off. In the event of any anomalous behavior, you can turn the component off.

3. Support version roll-back

Roll-back is similarly liberating. Many system problems are introduced on change-boundaries. If you can roll changes back quickly, you can minimize the impact of any change-induced problem. The perceived risk and cost of a change decreases dramatically when roll-back is enabled, immediately allowing for more rapid innovation and evolution, especially when combined with the next point.

4. Maintain forward-and-backward compatibility

Forcing simultaneous upgrade of many components introduces complexity, makes roll-back more difficult, and in some cases just isn’t possible as customers may be unable or unwilling to upgrade at the same time.

If you have forward-and-backwards compatibility for each component, you can upgrade that component transparently. Dependent services need not know that the new version has been deployed. This allows staged or incremental roll-out. This also allows a subset of machines in the system to be upgraded and receive real production traffic as a last phase of the test cycle simultaneously with older versions of the component.

5. Give enough information to diagnose

Once you have the big ticket bugs out of the system, the persistent bugs will only happen one in a million times or even less frequently. These problems are almost impossible to reproduce cost effectively. With sufficient production data, you can perform forensic diagnosis of the issue. Without it, you’re blind.

Maintaining production trace data is expensive, but ultimately less expensive than trying to build the infrastructure and tools to reproduce a one-in-a-million bug and it gives you the tools to answer exactly what happened quickly rather than guessing based on the results of a multi-day or multi-week simulation.

I rank these five as the most important because they liberate you to continue to evolve the system as time and resource permit to address the other dimensions the paper describes. If you fail to do the first five, you’ll be endlessly fighting operational overhead costs as you attempt to make forward progress.

If you haven’t kept it simple, then you’ll spend much of your time dealing with system dependencies, arguing over roles & responsibilities, managing variants, or sleuthing through data/config or code that is difficult to follow.

If you haven’t expected failures, then you’ll be reacting when the system does fail. You may also be dealing with complicated change-management processes designed to keep the system up and running while you’re attempting to change it.

If you haven’t implemented roll-back, then you’ll live in fear of your next upgrade. After one or two failures, you will hesitate to make any further system change, no matter how beneficial.

Without forward-and-backward compatibility, you’ll spend much of your time trying to force dependent customers through migrations.

Without enough information to diagnose, you’ll spend substantial amounts of time debugging or attempting to reproduce difficult-to-find bugs.

I would disagree on roll-backs. They are liberating… but not realistic. The ability to fix bugs quickly and deploy fixes automatically is a better feature to have, from my experience, although, this is not universal, of course.

Oleg, do you really feel that roll-backs are unrealistic? Partly I don’t see the choice but to support rollback and partly I’ve seen it done successfully at scale. If you are incrementally deploying new code two a multi-hundred or even thousand node cluster and you find an operational problem and servers are going down, what do you do? It could take hours or even days to figure out the bug, write the new code, test it, and then deploy it across the fleet. Can you and your customers really afford to be down for a portion of a day? If not, I don’t see any choice but to support rollback.

Its not possible to detect upfront all bugs prior to deployment. Being able to abort an upgrade and regroup is much easier on customers (and on the engineering team) than to go down hard and work for the next 36 hours straight trying to get a fix deployed. It just doesn’t work at scale.

James, I completely agree that it’s not possible to detect bugs upfront, and thus testing in production is necessary. But we do deploy the new code on a small subset (tens of boxes) first, and we monitor them for a few days, at least (the time depends on the change complexity). At this point, if any problems pop up, our ability to quickly deploy a version with better tuned error logging may be important.
Roll back may not be an option, if the data formats are changing, and some amount of new format data is already present in the distributed system.
Other than that, you paper is brilliant, and totally matches our experiences.
Thanks, Oleg

Oleg, that’s the right upgrade model. Unfortunately, you will encounter some failures that don’t manifest in low scale tests. Once you have a data center deployed and you encounter a failure, you really, really want to be able to rollback quickly.

I agree that its hard to support rollback when you have protocol or persistent state changes. But, some teams chose to do it because they feel there is a real possibility of an emergent failure that may not manifest until the new code is broadly deployed. They can’t live with that downtime risk exposure so chose to implement rollback.

I’m clearly biased by my past experience but, having led a high scale service that didn’t support roll back, I’ll NEVER go back. Never. You just haven’t lived until you have had customers and your own executives yelling at you each day and there are constant interruption from folks asking when the system will stabilize again. After 3 days of most of the engineering team working 18 to 20 hours a day with no sign of a solution on the horizon. And with the entire team just praying for the weekend to come quickly so the workload levels will drop down sufficiently that we can get the service back under control.

It was just evil and I’m arguing that anyone running at scale should think hard about this scenario before deciding not to support roll back. It was’t fun for us and our customers didn’t thank us for the experience.

James, obviously, I’m biased too. And maybe just my choice of words is incorrect and I should use "impractical" or "difficult" in place of "unrealistic". I’ve been too biased. Actually, roll backs are possible in our system, too, as an exception (what would prevent you from uninstalling the new version and install the old one, right?). But it is not an automated process, like the upgrade is.

Ah, so the challenge with the paper is it doesn’t tell you how to do all these things, only that these are the common best practices. There are hard ways to implement rollback and there are easy ways to implement rollback.

If you have an complex multi-step upgrade process, then supporting rollback as a reversal of those steps can be very impractical. However, if you follow the other tips in the paper by 1) making upgrades automatic 2) combine software and configuration 3) supporting forwards-and-backwards-compatibility, then rollback isn’t very difficult at all.

The simplest way to support rollback is to make rollback identical to "upgrade to a previous version" where you work to minimize the steps of version upgrade to. You stage the software/config versions on the hosts (double-buffering in effect) and throw a config switch (or symlink) to point to the new version. In fact, you can keep the old version resident on disk and rollback is simply an atomic update of a single system variable. Or even better if your running context is a VM, you can run both VM instances simultaneously and just switch the traffic routing. The same techniques can apply to database tables, etc. The main point is if you can get software deployment down to a simple stage/flip, then you can leave the old version staged and simply flip.

Now, forwards-and-backwards-compatibility can be challenging to do consistently and correctly when you’re dealing with protocols and data schemas. But if you’ve solved that problem, then rapid roll-back should be relatively straight forward.

I’m obviously with James on this one… once you’ve experienced the liberating feeling of rapid roll-back, you never want to go back. The adrenaline rush of deploying a live real-time software patch is exciting and only gets more exciting the more millions of dollars are on the line, but at a certain point the rush wears off and you just want to roll it back, go to sleep, and fix it in the morning.

Good summary paper. I’m increasingly convinced that the Amazon slogan: ‘you built it you manage it’, is the right approach; especially for the project manager since the first piece of scope to be lost under budget pressures is the anything that’s to do with operations. I also think that the principals apply to smaller scale environments and the emphasis of Continuous Delivery on testing and designing for operations remove a whole host of downstream cost and pain.

I’d be a bit more brutal than the recommendations about run-time configuration. I’d prefer to see these managed through the source code config management system, rather than an audit process. It makes it easier to confirm that the dev. process has the right tests in place.

To Oleg’s point, I believe that LMAX (a sizeable commodities trading platform) is a good example of a system that’s been designed/built with deploy/rollback times as key goals. The claim is that the time from checking in a code change to production deployment is around 20 mins, including key tests and schema migrations.

Google’s 1% or 0.1% experiments follow a similar model. For them, it’s not possible to validate a new design without running it in the wild.

What can be a challenge is the automatic deployment and configuration of some COTS components.

I particularly like: "Google’s 1% or 0.1% experiments follow a similar model. For them, it’s not possible to validate a new design without running it in the wild."

Its almost impossible to cost effectively run high fidelity full scale tests of complex services anywhere but in production. The trick is test everything you can before production, be able to incrementally deploy, invest deeply in health monitoring, and rollback quickly if anything suspicious shows up.

There is no way I want to be caught doing a deployment without rollback.