Blog Post

Latest outage raises more questions about Amazon cloud

Massive thunderstorms notwithstanding, the fact that Amazon’s (s amzn) U.S. East data center went down again Friday night while other cloud services hosted in the same area kept running raises anew questions about whether Amazon is suffering architectural glitches that go beyond acts of God. While most Amazon services were back up Saturday morning, the company was still working on provisioning the backlog for its ELB load balancers as of 5:31 p.m. eastern time, according to the AWS dashboard.

This outage — the second this month — took down Netflix (s nflx), Instagram (s fb), Pinterest, and Heroku(s crm), as Om previously reported. The storm was undoubtedly huge, leaving 1.3 million in the Washington D.C. area without power as of Saturday afternoon, but Joyent, an Amazon rival, also hosts cloud services from an Ashburn, Virg. data center and experienced no outage, something its marketing people were quick to point out.

The implication is that Amazon, with all its talk about redundancy and availability, shouldn’t be having these issues if others are not.

Steve Zivanic, VP of marketing for Nirvanix, another rival cloud provider, said customers should simply stop defaulting to Amazon’s cloud. “It’s becoming rather clear that the answer for [Amazon’s] customers is not to try to master the AWS cloud and learn how to leverage multiple availability zones in an attempt to avoid the next outage but rather to look into a multi-vendor cloud strategy to ensure continuous business operations,” Zivanic said via email. “You can spend days, months and years trying to master AWS or you could simply do what large-scale IT organizations have been doing for decades — rely on more than one vendor.”

Another #AWS outage -> Another #Heroku outage. You'd think #Heroku would be scrambling to architect around that as quickly as possible.

The fact that Amazon, like any other data center-dependent business is not bulletproof also raises questions about why its customers don’t pursue a multi-cloud strategy or, if they’re going to rely solely on Amazon, why they put so much of their workload in one geography — a practice Amazon itself advises against. Of course, it isn’t good practice for any vendor to blame snafus on its customers.

Presumably the tech folks at Instagram, Heroku, et al. know better. Earlier this month, I asked Byron Sebastian, the Salesforce.com VP of platforms who runs the Heroku business, if Heroku was actively seeking other cloud platform partners. He said the company is always evaluating its options.

Twitter was awash in comments. Many wondered why Amazon’s data center did not cut over to generator power while others, like Gartner(s it) analyst Lydia Leong preferred to wait to see what part Amazon’s data center operator played in this mess.

@cloudpundit Breakdown: grid power failed twice in June and AWS crashed twice same way. My dc in Ohio lost grid Fri, stays up on gen power.

Reached for comment Saturday afternoon, an Amazon spokeswoman reiterated that the storm caused Amazon to lose primary and backup generator power to an availability zone in its east region overnight and that service had been mostly restored. She said the company would share more details in the coming days.

21st century infrastructure included databases need to cope with these types of disasters. Building infrastructure & apps that run on one and one datacenter only will not stand the test of time. We need every aspect of a scalable cloud infrastructure to be able to run in a distributed fashion so that apps like Instagram will be up running no matter whether outages impact a specific datacenter location.

There will always be outages, amongst big or smaller providers. Amazon is the biggest, so the most likely to be on everyone’s radars. Users should carefully choose their cloud providers and not just pick Amazon. It’s not that another cloud provider will not have issues as well… but a healthy market distribution amongst providers makes any specific outage less impacting. And for those that really care about availability, using different cloud providers at the same time is the answer. It’s just a matter of how much money/effort your are willing to put on your platforms versus the risk you’re willing to accept. I use lunacloud.com in Europe.

â€Itâ€™s becoming rather clear that the answer for [Amazon’s] customers is not to try to master the AWS cloud and learn how to leverage multiple availability zones in an attempt to avoid the next outage but rather to look into a multi-vendor cloud strategy to ensure continuous business operations,â€ Steve Zivanic, VP of marketing for Nirvanix

Steve Zivanic, VP of marketing for Nirvanix, has it right in my opinion. Multiple Cloud Providers are better than Multiple Availability Zones from a single provider. Joyent teams with Enstratus to offer this up.

Reblogged this on Virtualized Geek and commented:
I agree that a multi-vendor cloud approach is ultimately the approach needed when considering putting revenue generating and enterprise services on the cloud. There is however still the problem of managing a multi-vendor cloud provider solution. There are both start ups and open source initiatives gunning for your business to become the proxy to multi-vendor clouds. This includes Euculyptus, Openstack and Cloudstack from a software perspective. From a vendor perspective you have a few companies you guys have already profiled trying to become the single point of sale and management for multiple cloud vendors.
This is a difficult nut to crack no matter the approach you take. You could choice to build a management platform yourself that distributes the load across multiple cloud vendors or you could go with one of the two options presented above. Either way it ainâ€™t easy. Workloads are not portable across multiple cloud vendors. You have to worry about how you replicate data between vendors (see Gigaoms recent article on â€œThe enterprise needs a better network to the cloud). You also need to worry about the actual difference in compute performance between multiple cloud vendors. The way that Amazon provisions and categories performance is completely different than the way Rackspace does.
Yeah, multi-vendor clouds are the way to go. Let me know when thereâ€™s a commercial option available and I will start a business reselling it or building services for customers around it :)

Amazon isn’t the problem and let’s educate ourselves a bit before blaming the wrong side! Bad architectures will always find an excuse fail and their architects will always blame somebody else. Let me dumb my point down a little for the layman: Was Amazon.com down? No. Is Amazon.com using the same AWS? Yes. Got my point now? :)

No. I don’t believe Amazon is using the same AWS. Do you honestly believe Amazon isn’t giving it’s business higher quality of service on their infrastructure? I doubt that they even share bandwidth with AWS. I’m also sure Amazon is using SAN based replication and other enterprise infrastructure class solutions to help enable their high levels of service for Amazon.com that they don’t offer to AWS customers.

Why wouldn’t they? AWS is resilient enough – more than any other cloud provider today. Do you know for sure or just believe that Amazon.com is not using AWS itself? As those two are different things! Sharing totally makes sense knowing the genius of Vogels and Bezos – especially from economical point of view. Also, multi-cloud resilience is at the cost of using only the bare basics of AWS as it offers the richest set of cloud tools and you’ll have to dumb down to the lowest common denominator, i.e. OpenStack, and you will be sacrificing and having to rebuild a lot! AWS is a lot more geographically redundant than any other cloud, so, I’m really not sure what benefit adding another cloud provider will add to your architecture. Unfortunately, AWS is still lacking what Akamai has to offer and I’ve secretly hoped that Amazon would acquire them given the historically low share price. Well, Amazon is constantly evolving their CloudFront solution, but it still has a long way to go to come closer to Akamai, which offers a lot more than multi-origin CDN. That’s why all the serious businesses I’ve worked with use Akamai.

AWS is not capable of offering a differentiated service to Amazon.com. That’s just not how the cloud computing platform(seperate from the other AWS services) operates. It is certainly giving it the best levels of support and whatnot, for which AWS charges a minimum of $15K/mo for the hoi polloi, but the hardware/software/service delivery is the exact same thing everyone else uses. If you don’t appreciate this, you don’t really understand what Amazon has built.

The reality is this – in order to operate in the cloud, you need to be able to understand that failure is not only an option – it’s a promise. This is the reason why Microsoft chose ClearDB to run MySQL on Windows Azure. It’s also the reason why Heroku and AppFog customers who use ClearDB didn’t notice any database level disruption in the last two outages. Check out our site for more info: http://bit.ly/Lc6XKp

Abhijeet Kumar- this particular outage affected the elastic load balancer service- our company was prevented from removing instances in the affected availability zone from the load balancer, sending users to downed servers. Architecting for an AZ failure did not save our site. This is probably what happend to others.

Also, this came on the heals of events Friday AM and just a week or so ago with EBS. It has been avery bad month for Amazon. I can’t speak for all of Amazon’s customers, but it has certainly has me exploring a multi-vendor or non-aws strategy.

Correction: It is available in all regions now at this time. What I really meant was during the issue, AWS was having an outage in one availability zone in one region, while all other regions were perfectly fine.

I am in no position to explain why there was a power issue that affected only one availability zone in Amazon’s Virginia location, that is something I guess we will hear more about soon from Amazon. Having said that, AWS is available in other regions which were still working fine. A lot of these posts by people, who do not know the facts (i.e. they don’t work for Amazon, Netflix, Pinterest, Heroku, Instagram etc), seems to be self-serving, malicious rants without any concrete technical base. When you talk about outages, Google had an outage in April, where several Gmail accounts were affected, Apple earlier this week had an outage that affected their iCloud, Twitter had an outage too earlier this week. Why do people go so religiously batshit when they hear AWS having an outage in one availability zone in one region?

Amazon gets picked on because their relevant. From an enterprise IaaS perspective who really cares about any of the other providers you mentioned? AWS is the premier provider for IaaS. They have mind share not just in tech companies like Netflix, Pintrest and Instagram but in the non-tech enterprise as well.

Try going into a company as a consultant and suggest cloud vendors outside of AWS and chances are they will ask you about AWS. The con for Amazon is that if events like this continue the happen potential enterprise customers will get spooked off. Right or wrong AWS is developing a perception for not being reliable. If Netflix can’t design a redundant solution on AWS how will a non tech company?

I agree, as an IaaS cloud provider, especially given the fact that many smaller enterprises are only beginning to move to cloud based hosting (most tech start-ups use cloud based hosting like AWS), any AWS outage will be interesting. However, Google, Twitter etc are supposed to be pretty highly regarded when it comes to available services offered on the web. I was reading that Netflix, one the biggies on AWS, usually didn’t get affected during most previous AWS outages. I also read they got affected only for an hour or so for this outage, which is still something. But even Google running in their own data centers aren’t doing way better, are they?

I have read similar posts from Netflix. The fact that AWS was available in other Availability Zones in the same region and all other regions, is something that should be taken into account before going batshit on Amazon for an outage like this.

Redundancy is very key, without which any business is impacted big time. So the business should ensure the HA solution is in place and have their data residing in 2 Availablity zones spread across the geo.