Thursday, January 03, 2013

The 2012 Christmas Eve outage at Amazon has people talking. The fuss isn't about what broke; it's about what Amazon said they're going to do to fix it. If you aren't familiar with their report, it's worth a quick read. If it's tl;dr, I'll sum it up: a developer whacked some data in a production database that made the load balancing service go hay-wire, and it took longer than it should have to identify the problem and restore it. (did you see how i avoided the technical jargon??)

If you're Amazon, you have to start thinking about how to make sure it never happens again. Restore confidence... and fast. Here's what they said:

We have made a number of changes to protect the ELB service from this sort of disruption in the future. First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data. The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval. We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval. This would have prevented the ELB state data from being deleted in this event. This is a protection that we use across all of our services that has prevented this sort of problem in the past, but was not appropriately enabled for this ELB state data. We have also modified our data recovery process to reflect the learning we went through in this event. We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event. We will also incorporate our learning from this event into our service architecture. We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state. This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.

Here's my question: If ITIL Service Transition (thoughtful change management) and DevOps (agile processes with infrastructure-as-code were to mate, what would the outcome be?
A) A child that wanted to run fast but couldn't because of too many manual/approval steps
B) A child that ran fast but only after the change board approved it
C) Mate multiple times; some children will run fast (with scissors) others will move carefully
D) No mating required; just fix the architecture (service recovery)

This is the discussion that I'm having with my colleagues. And to be clear, we aren't talking about what Amazon could/should do, we're talking about what WE should do with our own projects.

Although there's no unanimous agreement there has been some common beliefs:1. Fix the architecture. I like to say that "cloud providers make their architecture highly available so we don't have to." This is an exaggeration, but if the cloud provider does their job right, we will have to focus less on making our application components HA and more about correctly using the providers HA components. There's little disagreement on this topic. AWS screwed up the MTTR on the ELB. We've all screwed up things before... just fix it.

2. Rescind dev-team access. So this is where it gets interesting. Remember all that Kumbaya between developers and operators? Gone. Oh shit - maybe we should have called the movement "DevTestOps"! One simple mistake and you pulled my access to production?? LOL - hell, yea. The fact is all services aren't created equal. I have no visibility into Amazon's internal target SLA's - but I'm going to guess that there are a few services that are five-9's (or 5.26 minutes of down-time per year). Certain BUSINESS CRITICAL services shouldn't be working in DevOps time. They should be thoughtfully planned out with Change Advisory Boards with Change Records and Release Windows by pre-approved Change Roles. Yes - if it's BUSINESS CRITICAL - pull out your ITIL manuals and follow the !*@$ing steps!

Again - there's little disagreement here. People who run highly available architectures know that to re-release something critical requires a special attention to detail. Run the playbook like your launching a nuclear missile: focus on the details.

To be clear, I love infrastructure-as-code. I think everything can be automated and it kills me to think about putting manual steps into tasks that we all know should run human-free. If your application is two-9's (3.6 days of down-time), automate it! Hell, give the developers access to production data - - you can fix it later! What about 99.9% uptime (8.76 hours)? Hmm... not so sure. What about 99.99% up-time? (52.56 minutes)? Well, that's not a lot of time to fix things if they go wrong. But wait - if I did DevOps automation correctly, shouldn't I be able to back out quickly? The answer is Yes - you SHOULD be able to run your SaveMyAss.py script and it MIGHT work.

For me, it's not an either/or choice between ITIL Transition Management and DevOps. IMHO, both have a time and a place. That said, I don't think that the answer is to inbreed the two - DevOps will get fat and be the loser in that battle. Keep agile agile. Use structure when you need it.