Amazon, Google Cloud Outages Highlight Bigger Risk

Photo: Will Merydith/Flickr

Just when you thought the list of possible gotchas causing cloud outages could not get much longer, a post-mortem of the recent Amazon outage that took out Reddit and Heroku, among others, fingered a memory bug. The finding follows a Google outage on Wednesday that locked folks out of some of its most popular consumer services briefly.

We’d like to share more about the service event that occurred on Monday, October 22nd in the US- East Region. We have now completed the analysis of the events that affected AWS customers, and we want to describe what happened, our understanding of how customers were affected, and what we are doing to prevent a similar issue from occurring in the future.

The short of the Amazon matter:

At 10:00AM PDT Monday, a small number of Amazon Elastic Block Store (EBS) volumes in one of our five Availability Zones in the US-East Region began seeing degraded performance, and in some cases, became “stuck” (i.e. unable to process further I/O requests). The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers.

And the long of it:

We apologize for the inconvenience and trouble this caused for affected customers. We know how critical our services are to our customers’ businesses, and will work hard (and expeditiously) to apply the learning from this event to our services. While we saw that some of the changes that we previously made helped us mitigate some of the impact, we also learned about new failure modes. We will spend many hours over the coming days and weeks improving our understanding of the event and further investing in the resiliency of our services.

At about 10:47 p.m. British time on Wednesday, Paul O’Brien couldn’t reach Google. At all.

“Strange,” he said, with a post to Twitter. “My phone just completely lost connectivity to all Google services. Anyone else?”

The response was immediate. “Same here in Mexico,” said someone who calls himself orb3000, who tells us he does work at Veracruz State University. “All google services are out…”

Here at the Wired newsroom in San Francisco, we saw much the same thing. “Gmail, Drive, Reader…everything is down for me,” wrote one reporter on our communal chat system, and soon countless others were complaining as well. “It’s a good thing we’re not beholden to google or anything for our digital lives,” said one particularly sarcastic type.

Six minutes later and Google was back, writes Wired Enterprise’s Cade Metz. “During those six minutes, about 10 percent of people trying to reach a long list of Google services were unable to do so, according to statement from the company. “We apologize to everyone affected and have worked hard to get our services back to normal as quickly as possible,” Google said.

Google declined to discuss the matter further. But this massive outage — however brief — shows how tenuous our “digital lives” can be. And how much we’re dependent on Google in particular. Google has gone to extreme lengths to minimize outages. But it too is fallible, and clearly, multiple services can go down in the event of an engineering mistake, technical malfunction, or natural disaster.

Have your say in the comments section or forum thread below: Is centralizing on cloud platforms a risk you can take? Will the cloud platforms be able to get on top of outages to the point of them being inconsequential? If outages are the new normal, will moving to private or hybrid clouds give you a leg up on your rivals?

Here’s The Thing With Ad Blockers

We get it: Ads aren’t what you’re here for. But ads help us keep the lights on. So, add us to your ad blocker’s whitelist or pay $1 per week for an ad-free version of WIRED. Either way, you are supporting our journalism. We’d really appreciate it.