Netflix Amazon Outage Shows ‘Any Company Can Fail’

Reporter

An outage Tuesday at Amazon Web Services, which sells computing and data storage services to businesses including Netflix, should cause CIOs to look into new options to prevent mission critical functions running on public clouds from going dark, says Forrester Research analyst Rachel Dines.

Cloud services like Amazon’s AWS let businesses rent computing and data storage resources on an as-needed basis, helping keep IT costs proportional to the revenue generated, and allowing internal IT staff to stay focused on activities more valuable than making sure servers remain in working order. But outages at public cloud vendors like the one experienced at Amazon — while infrequent — can keep crucial services offline for an indeterminate amount of time, and leave customer CIOs powerless to make sure the problem is addressed as quickly as possible. For some organizations, losing Web services for several hours may not result in serious repercussions, while others would suffer significant revenue losses and hits to their reputation. Forrester’s Dines says even occasional outages can cause long term damage to a company’s reputation. “Its all about timing. This was a big deal because it was one of the worst possible times it could happen” as families gathered during Christmas to watch movies, Dines said.

An Amazon spokeswoman said in an emailed statement that the problem was caused by an issue in the company’s load balancing service, which distributes traffic across its network of servers. Such load balancing problems at Amazon have stalled Netflix services several times before. Netflix, which began switching from its own servers to AWS in 2009, has previously said that while it’s “easy and common to blame the cloud for outages because it’s outside of our control,” the service has helped improve the availability of its service.

According to Dines, one way to protect against cloud-based malfunctions are cloud to cloud continuity services that, in the event of an outage, allow traffic to be automatically routed onto another cloud center – either within the same vendor’s network, or to a second vendor’s cloud center. Dines says this form of disaster recovery is in its infancy because of the complexity in matching formats between two discrete virtual server systems. CIOs are also just beginning to bring the kinds of mission critical applications into the cloud that require an extensive failover option. “It’s not widespread at this point by any means,” she tells CIO Journal.

Dines says that Netflix has likely not yet invested in this this form of backup because of the cost and expense involved. And Netflix already “has one of the most robust continuity strategies out there,” she said. For example, Netflix continually deploys “Chaos Monkey”, a tool that randomly disrupts part of its Amazon-managed network to ensure the company can quickly respond to outages. A separate “Chaos Gorilla” is used to try to bring down an entire region of its Amazon network to ensure “services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.” Dines says the takeaway is that if a resilient cloud user like Netflix can fail, “any company can fail.”

Still, the timing of this last outage is particularly irksome for Netflix: on a day many Americans spent relaxing with their families in front of the television, Netflix customers found alternatives to its streaming video’s services. And for Netflix to be the most publicly affected company is especially embarrassing for Amazon. At its first-ever customer event for AWS held less than a month ago, the cloud company introduced Netflix as one of its most longstanding, recognizable and cloud-dependent customers, CIO Journal reported.

Netflix CEO Reed Hastings said during the event that his company began using AWS in 2009 after he read Nicholas Carr’s book, “The Big Switch.” Hastings said the book inspired him to see computing infrastructure as a commodity that should be rented as a service, much like companies purchase electricity rather than generating their own power. Given the company’s exponential usage of data storage capacity, he decided to start using AWS as an alternative to building his own data centers. “We had to take some risks,” he said. Another reason Netflix shifted to using infrastructure owned and operated by Amazon, a company staffer said, was that its own data centers suffered an outage in 2008 that left some customers without service for three days.

A spokesman for Netflix says the company is investigating the cause of Tuesday’s outage and will do what it can “to prevent a reoccurrence,” but declined to provide more detail. Given that Amazon’s issues stretched across the Americas, it isn’t clear that having backups across different regions would have helped. And maintaining alternative sites that duplicate data from the principal site in real time can be prohibitively expensive. Short of reexamining its use of outsourced cloud computing, Netflix may not have many attractive alternatives; but CIOs at other organizations may well want to evaluate how much it would be worth spending to mitigate an outage of this magnitude. “It’s a tradeoff,” says Dines. “You are giving up some level of control when things go wrong, but the cloud allows you to put your resources towards something that you hope will move the business forward.”

Comments (5 of 11)

I found this article while basically looking for "What the hell is wrong with AWS". I did experience the outage you refer to and also on a regular basis I get really slow load times on netflix movies. But it's not just netflix. It's anything I download from a s3.amazonaws url. I've been a subscriber to a tutorial site who uses aws for over a year now and can only download with DTA(Download Them All) because it makes multiple connections and is able to start up again when the connection fails. And it always fails. Same thing right now trying to download a linux OS distribution some jerk hosted on AWS. LOL they aren't really a jerk but I really wish they would have chosen another service to share the OS with. Luckily I'm able to use DTA for this file too. But it's a real pain because the task I'm doing right now requires me to download it via command line. So now I have to DTA it down and then put it someplace where I can get a uninterrupted download from.

I don't see why so many people still keep using AWS

9:52 pm December 30, 2012

Snuffles wrote:

This is the problem of the cloud. I'm always eager to see business that go to the cloud fail. Make no mistake folks, the cloud will do you harm. Eventually it will all be used against you to make you license everything from the cloud, track your usage, build patterns on your history, possibly be used against you in court cases as your character etc.. Resist using the cloud as much as you can and hope they all fail that do.

6:47 pm December 28, 2012

Emma - Chicago wrote:

I canceled Netflix and am using the TVDevo website and Amazon from now on.

4:58 am December 28, 2012

Webhosting.net, inc wrote:

Some companies are already offering services to help mitigate such an outage. There is risk in having everything in 'large' public clouds. As storage gets cheaper it's easier for smaller hosts to offer disaster recovery sites for the big boys.

4:26 am December 28, 2012

Rico wrote:

Blame Amazon? WTF??? These are the customers who put their eggs in one basket (datacenter) must be blame. The single best solution is the usage of multiple cloud providers.

Deloitte Touche Tohmatsu Limited's fourth annual Millennial Survey reveals the business activities and outcomes members of Generation Y would prioritize if they held leadership positions. In highlighting millennials' priorities, the survey results draw attention to this generation's values and the themes large enterprises should speak to if they wish to attract and retain members of this rising workforce.