Amazon Cloud Outage Caused by Storms, Worsened by Software Glitches

Power supply issues and software bugs only made a bad situation worse for AWS, which is facing growing competition in the IaaS space.

Amazon Web Services lengthy outage over the
weekend after powerful storms through the Mid-Atlantic region knocked out power
has put renewed focus on the risks of cloud computing and how users can minimize
those risks going forward.
It also has generated promises from Amazon at
a time when competition in the booming infrastructure as a service (IaaS) space
is growing, as the likes of Googlewhich in June unveiled its Compute
Enginelook to make inroads.

The storms that blew through the Mid-Atlantic
region June 29 hit Virginia particularly hard, knocking out power to hundreds
of thousands of people. The storms also cut off power to one of Amazon Web
Services (AWS) 10 data centers on the East Coast, a situation that was exacerbated
by problems with backup power at the Virginia facility as well as unexpected
software issues that arose during the recovery efforts.

The outage at the data center knocked out
such technologies as Amazons Elastic Compute Cloud (EC2); interrupted such
high-profile Websites as Netflix, Instagram and Pinterest; and impacted other
companies that run all or part of their businesses on the Amazon compute cloud.
The outage hit June 29 in the afternoon ET, impacting such services as EC2,
Elastic Block Storage (EBS) and Relational Database Service (RDS).
However, according to a July 2 analysis from Amazon,
the situation created by the power outage was made worse due to problems with
power supplies and software bugs. While several data centers in AWS U.S.
East-1 region saw power fluctuations, two data centers were hit with a large
voltage spike. One data center switched to generator power as planned, but
there were problems getting and keeping backup power going in the second one. As
a result, for more than an hour the night of June 29, users could not create
new instances in EC2.
Amazon officials said that the outage
affected about 7 percent of EC2 instances and EBS volumes, though they admitted
that there was significant impact to many customers.
Complicating matters were problems with what
AWS call control planes, which caused problems for customers trying to
respond to the service outage and manage their resources in the cloud
environment. During the outage, there was a large number of reboot requests
from customers, which caused a bottleneck in the server booting process. In
addition, there were problems with the elastic load balancers (ELBs), which are
designed to switch traffic to other unaffected areas in case of such situations.
When the power was restored, a large number of ELBs came up in a state which
triggered a bug we hadnt seen before. The bug caused the ELB control plane to
attempt to scale these ELBs to larger ELB instance sizes.
The result was a flood of requests that
combined with customers launching new EC2 instances, all of which conspired to
create an ELB control plane backlog and pretty soon, these requests started
taking a very long time to complete, Amazon said.
The problems also reached AWS relational
database service (RDS) in the impacted data center, which couldnt be restored
until the EBS came back up. In addition, another software bug meant that there
was no automatic failover to an unaffected area.
AWS officials have promised to fix these
problems, including expanding the number of engineering staff on-site to ensure
that, if there is another outage, they can switch power to generatorsmanually,
if necessarybefore the uninterrupted power supplies (UPSes) run out of power,
improving the recovery process and dealing with blockages that forced
assessment and failover for the control plane to be done manually rather than
automatically.
For AWS, one of the pioneers in IaaS, getting
this right will be important. The company suffered through a large service
interruption last year, and already has gone through smaller ones in recent
months, and other Web companies are looking to get in on the action, which is
rapidly gaining adoption as businesses see the advantages of not having to
invest a lot of money in creating their own infrastructures. Instead, they can
essentially run their businesses in the cloud, on someone elses servers and
storage arrays, and spend their money elsewhere, including product development
and hiring staff.
Google is among the latest Web companies
pitching their cloud services. In June, the company launched its Compute
Engine, a cloud service that currently is available in limited preview.
Google executives said during their Google I/O developers conference that the
company has the massive computing capabilities within its data centers to host
applications.
The outage also generated a host of
blogs and articlesseveral
found hereoutlining for AWS customers ways to avoid problems in the future
when service outages occur.