Data centers and servers can fail for all sorts of reasons. That's true for businesses running data centers for their own employees' needs. It's true even when you're Amazon, one of the largest data center operators in the world and the most popular provider of infrastructure-as-a-service cloud offerings.

As impressive as Amazon's Elastic Compute Cloud is, it hasn't been immune to the sorts of outages that have afflicted just about every cloud service from Rackspace to Office 365 and Google Docs. All of them have various backup plans and backups to the backup plans, but the latest Amazon outage is a reminder that not even the strictest precautions are always enough.

A new root cause analysis describes an Amazon outage that occurred last week in Amazon's East Coast data centers. The report shows a series of problems resulted in virtual machines and storage volumes losing primary, backup, and secondary backup power. A cable fault took down the main service, a defective cooling fan messed up a backup generator, and finally an incorrectly configured circuit breaker caused secondary backup to fail.

Amazon describes the outage's cause thusly:

At approximately 8:44 pm PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53 pm PDT, one of the generators overheated and powered off because of a defective cooling fan.

At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57 pm PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power.

The generator's fan was fixed, and the generator was restarted—some two and a half hours after the initial cable fault. With power restored, Amazon quickly got everything back online.

It's important to note that customers of Amazon can take precautions too. Amazon splits its services up into availability zones and regions. Spreading workloads across zones and regions can keep customers online even when Amazon is having problems, but a lengthy Amazon outage that last year brought down sites such as Foursquare, reddit, Quora, and Hootsuite affected multiple availability zones. Even customer-initiated precautions aren't foolproof. However, in the case of last week's outage, hosting in multiple zones would have paid off.

"Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional," Amazon said.

For many customers, particularly ones without large data center budgets, outsourcing to Amazon or similar vendors makes a lot of sense even when you consider that there are occasional outages. Outages can be embarrassing—like RIM's worldwide outage affecting BlackBerry services last fall. Some can be puzzling, like one in Dublin last year affecting both Amazon and Microsoft that Amazon initially said was caused by a lightning strike hitting a generator, leading to an explosion and fire. It turned out to be a more mundane failure of a transformer operated by the local electricity company.

Any time there's a failure, it has become standard for cloud providers to apologize and detail the steps being taken to make sure such a thing never happens again. In the case of last week's outage, Amazon said it has "completed an audit of all our back-up power distribution circuits. We found one additional breaker that needed corrective action. We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes."

So, the breakers are fixed, but it's hard to imagine there won't be other problems in the future. This latest outage is a good reminder to take advantage of the redundancy options provided by Amazon, especially if you can afford it and if your business is dependent on always-on systems. Precautionary measures aren't foolproof, but the more precautions you take, the better.

Promoted Comments

Cascading failures can be amazing to watch. Whether it is an engineering failure on a plane or a bad piece of logic in electronics it is fascinating to observe something happen that took a 1 in 100,000 chance to occur in the first place, and then an even rarer occurrence for the second failure, and then it keeps going. Sometimes it feels like the failure is having it's hand held the entire way...

They have to TEST their backup means from time to time. Not just assume "Well, it will work when we need it to!" but actually turn on the generators from time to time and run them to make sure that things are working correctly

That is what my family does with our two generators, run them once every 3 months for 3 hours or so (plugging things into it so that the energy generated by them isn't wasted) to make sure that they are still working.

Cascading failures can be amazing to watch. Whether it is an engineering failure on a plane or a bad piece of logic in electronics it is fascinating to observe something happen that took a 1 in 100,000 chance to occur in the first place, and then an even rarer occurrence for the second failure, and then it keeps going. Sometimes it feels like the failure is having it's hand held the entire way...

They have to TEST their backup means from time to time. Not just assume "Well, it will work when we need it to!" but actually turn on the generators from time to time and run them to make sure that things are working correctly

That is what my family does with our two generators, run them once every 3 months for 3 hours or so (plugging things into it so that the energy generated by them isn't wasted) to make sure that they are still working.

man, with such expertise in home generators, you'd be the perfect candidate to head up amazon's datacenter power infrastructure.

They have to TEST their backup means from time to time. Not just assume "Well, it will work when we need it to!" but actually turn on the generators from time to time and run them to make sure that things are working correctly

That is what my family does with our two generators, run them once every 3 months for 3 hours or so (plugging things into it so that the energy generated by them isn't wasted) to make sure that they are still working.

man, with such expertise in home generators, you'd be the perfect candidate to head up amazon's datacenter power infrastructure.

Don't be such a douche. He's not advocating running Amazon's infrastructure, he's pointing out how basic an issue Amazon had. And he's right.

man, with such expertise in home generators, you'd be the perfect candidate to head up amazon's datacenter power infrastructure.

I think what he's saying is that a routine test of those generators for a few hours would have worked that kink out. They did, after all, find at least one other problem when they ran their audit. That at least demonstrates they had been missing things.

man, with such expertise in home generators, you'd be the perfect candidate to head up amazon's datacenter power infrastructure.

I think what he's saying is that a routine test of those generators for a few hours would have worked that kink out. They did, after all, find at least one other problem when they ran their audit. That at least demonstrates they had been missing things.

This reminds me that I need to replace the battery in my UPS.

The audit found problems with breakers, not other generators.

Also, Think about the fuel cost and environmental impact of running generators for 3 hours across the entire amazon cloud every 3 months. Most generators will have self diagnostics and it is easy to see how a defective cooling fan would be missed in those.

Regardless, the generator wouldn't have been a problem if the breakers didn't fail, and running generators for 3 hours on a regular basis to "test" them is overkill, especially now that they know the failover will not have the same problem next time.

Problem with after the backup power systems are installed, how often do they get retested? If never since the initial installation, then it's like never test the backup regularity. How do anyone know daily backups are any good?

Second is online storage or cloud data can be mirrored to another location such as different state or even different country. Such technology has been around for a long time and if one location has complete failure, the data and servers that run the cloud will stay online. Isn't that "Cloud Storage" supposed to be? Not centralized on one location? Or "Cloud Storage" is nothing more than fancy name trying to make online storage sounds like more than what it is?

man, with such expertise in home generators, you'd be the perfect candidate to head up amazon's datacenter power infrastructure.

I think what he's saying is that a routine test of those generators for a few hours would have worked that kink out. They did, after all, find at least one other problem when they ran their audit. That at least demonstrates they had been missing things.

This reminds me that I need to replace the battery in my UPS.

The audit found problems with breakers, not other generators.

Also, Think about the fuel cost and environmental impact of running generators for 3 hours across the entire amazon cloud every 3 months. Most generators will have self diagnostics and it is easy to see how a defective cooling fan would be missed in those.

Regardless, the generator wouldn't have been a problem if the breakers didn't fail, and running generators for 3 hours on a regular basis to "test" them is overkill, especially now that they know the failover will not have the same problem next time.

Not necessarily. Many heavy duty datacenters already run generators live into the electrical grid, spinning them up and switching to them as primaries to test them - some even do it monthly. And if a self-diag misses a cooling fan error that diag isn't useful.

They have to TEST their backup means from time to time. Not just assume "Well, it will work when we need it to!" but actually turn on the generators from time to time and run them to make sure that things are working correctly

That is what my family does with our two generators, run them once every 3 months for 3 hours or so (plugging things into it so that the energy generated by them isn't wasted) to make sure that they are still working.

man, with such expertise in home generators, you'd be the perfect candidate to head up amazon's datacenter power infrastructure.

Hey, a generator is a damned generator. The only difference is size, all of them work damned near the same way.

You just want to take issue with this because it's true and you're a troll!

I agree - end to end testing is the only way to go. Just because the generator turns on doesn't mean everything will work. There are plenty of problems that can come up in the cut from main/commercial power to UPS then to the generator and back. I once had to deal with a surge that actually welded the switching between UPS and generator. We were down for a while fixing that one.

Also, I've apparently been talking to too many Navy people. I almost called it Shore power.

Phrased another way, is this sort of triple failure a 1 in 10 occurrence, or 1 in 1000000?

Statistics like this have always humoured me. You'd have to be able to view time itself in all its entirety to correctly ascertain the odds. It also depends on the units. Ever noticed they never give them.

Phrased another way, is this sort of triple failure a 1 in 10 occurrence, or 1 in 1000000?

Statistics like this have always humoured me. You'd have to be able to view time itself in all its entirety to correctly ascertain the odds. It also depends on the units. Ever noticed they never give them.

First, I'm impressed Amazon has a secondary power backup. That shows they take redundancy seriously. Unfortunately, their implementation is somewhat lacking.

The first failure was a cooling fan on the backup generator. My question is why this caused the system to fail. Couldn't this have been detected prior to a failover? I remember the summer 2003 blackout in eastern US & Canada for the datacenter I worked with. The generator kicked over perfectly so the systems kept running. But the AC wasn't on the generator so the room temperature started to rise. But we still had enough time to shut down nearly everything before the systems started to fail.

The second failure was a human error and is nearly inexcusable. It shows that while they had a secondary backup, it was never exercised and/or maintained. Plus an audit found more than one case of the same problem.

They have to TEST their backup means from time to time. Not just assume "Well, it will work when we need it to!" but actually turn on the generators from time to time and run them to make sure that things are working correctly

That is what my family does with our two generators, run them once every 3 months for 3 hours or so (plugging things into it so that the energy generated by them isn't wasted) to make sure that they are still working.

I hope you do so under full load - including the instantaneous ramp-up when the genny has to cut in at full load. I've been lucky enough to handle what's left of the piston of a diesel genny that cracked right out of the cylinder head when the throttle jammed fully open when called into 'real' active service, despite regular simulated load tests. (The subsequent fire was not particularly helpful to recovery activities as I understand from the engineers involved...)

I sympathise on the breaker problem too. I've been involved with a situation where everything looked peachy and all was well after UPS and subsequently genny cut in with everything working as planned, and then have it all undone by the failure of a 50 cent microswitch that turns on the fuel pump to top up the header tank... (The upside of this was a learning experience at least - namely 'priming a diesel engine that's run dry 101' - the downside is realising your load test of the genny is only good for as long as the header tank doesn't drop below refill level; how long do YOU run your regular genny tests for?)

Neither of the above involved a full blown DC so less embarrasing than the Amazon outage, but even so they taught me this much: Shit Happens.

In other words - sometimes it's good to just say, "damn, that's bad luck," then look at your procedures to see how you can stop that bad luck happening to you (ie. learn from it,) and then finish with a "there but by the grace of God go I."

Anyone slinging around "pah, idiots, they should have done this" (a) hasn't been in this business long enough to *really* understand Murphy, and (b) is seriously asking for karma to kick them up the arse...

I've also read (though I'll be damned if I can find it on Google) that there is a certain point where redundancy starts to raise the chance of a catastrophic failure due to the added complexity of the redundant systems. My Google-fu is failing me right now though....

Catastrophic events (especially ones that the public at large become aware of) by definition, require the failure of several failsafes, redundancies, and countermeasures. After all, that is why they exist in the first place--to mitigate or avert these types of occurrences.

Unfortunately, maintaining the said failsafes, redundancies, and countermeasures tends to increase overhead for businesses so maintenance budgets tend to focus on protecting production systems, rather than the backup systems (i.e., drivers are more likely to keep their tires inflated than to check the spare).

Whether or not one agrees with that practice in principle, reality tends to this maxim: when dealing with limited budgets, manpower, and resources one will focus on the most crucial systems (read: profit centers), in favor of less critical systems (read: cost centers).

man, with such expertise in home generators, you'd be the perfect candidate to head up amazon's datacenter power infrastructure.

I think what he's saying is that a routine test of those generators for a few hours would have worked that kink out. They did, after all, find at least one other problem when they ran their audit. That at least demonstrates they had been missing things.

This reminds me that I need to replace the battery in my UPS.

Ditto, it's been 10 years for me. About once per year, I have an outage, and it hasn't failed me.. yet..

Serious data center usually startup their generator once every week or two to keep it lubricated internally; just like a car, you do not want it to sit there without running for prolonged periods. That being said, it is both uneconomical AND dangerous to do real tests. You can't start up the generator and transition your main load to it in case it does fail. Now you just caused an outage with a test. You test it best you can without endangering the load and hope that everything else works as advertised. Data centers also have electricians certify their systems, so it is likely that a 3rd party dropped the ball on a previous inspection in regard to the mis-configured breaker."

On the other hand, data centers also do switch to generator when the utility power is suspect such as during bad storms or rolling brownouts during peak use times. This is because the risk of a bad failover during a power event is higher than that of a non-emergency switch to generator power (the generator can be pre-started to minimize reliance on the DC's UPS when it is controlled, for eg., a step that could fail during a power emergency event).

In reality, the most robust system tends to be the primary with the backups being less reliable.

This sort of problem will continue to bedevil supporters of the 'cloud' concept of computing - just as it bedeviled them when it was called client/server architecture back in the day.

What I'd be fascinating to see is a peer-to-peer 'cloud' that allowed for encrypted data to be stored in the same way that the latest episode of Game of Thrones is stored - not so much on proprietary servers as in pieces across an endless number of personal machines.

Unfortunately, while this would probably be a far superior solution, I'm not seeing how it would be particularly appealing to anyone with the money to develop and popularize it - nor how it would get around a 'tragedy of the commons' situation when you're not allowed to discriminate over the data you store locally yet accessible to the entire network.

In fairness their implementation provides pretty seamless multi site redundancy (Availability Zones) and that is far an away they best resiliency (since you can't do much about someone cutting all your in/out cables or that pesky human error element)

There's an argument to be said for anyone that *can* scale in that direction to do so and not care too much about the redundancies within each site[1] (it will be cheaper to run each individual site, and not give you a false sense of security)

1. still have backup power, just don't sweat multiple levels of it, or that it can be maintained for that much time (sufficient to get your in flight data out, backups, clean shutdowns etc).

It is actually standard practice in Telecommunications to do once monthly generator checks. However, this does not always include switching the load, just testing the generator.

For critical telecom systems that can't be redundantly run in multiple data centers (for whatever reason), we actually transfer the entire system to an alternate data center weekly and run production there for a week before transfering it to the next production DC. Each location is sized to handle the entire load. Usually it takes a 4 hour failure to get the budget to do this, but after that happens, the money appears.

More and more, these systems are designed to have transactions balanced across many data centers. but sometimes you are stuck running a system designed 20 yrs ago and deployment of any new system requires 3+ yrs.

Telecom data center design would amaze most people. For power:* redundant power from at least 2 different substations* onsite batteries for about 15 minutes of run for all equipemt - these are conditioned, babied, replaced.* onsite generators with high priority fueling - often just behind a trauma hospital - before other hospitals* Redundant power supplies to all equipment feed from the 2 different power grid inputs* Storage frames have batteries built into them, so even if all the other reduntant power disappears, the storage will have time to write the items in cache and spin down the drives safely.

I'm more interested in this quote: "Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications..."

Ok, define "meaningful"? They didn't say "avoided disruption", they qualified it. So, if I had a service that was on a multi-Availability Zone configuration, did my site go down, or did it not? If customers couldn't get to it, it was meaningful to me...

I'm more interested in this quote: "Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications..."

Ok, define "meaningful"? They didn't say "avoided disruption", they qualified it. So, if I had a service that was on a multi-Availability Zone configuration, did my site go down, or did it not? If customers couldn't get to it, it was meaningful to me...

Speculation: If you were spread across two zones, you were running at 50% maximum bandwidth/processing capacity. If three, 67%. Four, 75%. Disruption would most likely occur if instantaneous demand peaked above 50/67/75/etc percent maximum capacity, and would take the form of longer access time for customers.

Further speculation: Most services don't typically sustain >50% maximum capacity. The slashdot effect or DDoS can push almost anyone over 100% maximum capacity, leading to slowdowns. With that in mind, I think the basic result would be that while normal operation would be more or less uninterrupted, your service would be much more sensitive to spikes in demand until Amazon fixed their stuff.