Amazon's EC2 contract promises its infrastructure cloud will provide 99.95 per cent "uptime" over the course of a year. But that doesn't mean the company will dish out credits in the wake of the outage that affected some users for as many as four days, if not more.
Though the EC2 service level agreement says users will be …

COMMENTS

AWS SLA basically says..

20-20 foresight

I and a colleague of mine who are interested in developing programming frameworks for cloud computing were discussing just such a situation a year ago, as we had developed a high availability, fault resilient framework for distributed manufacturing systems some 15-20 years ago. We spoke the other day, and were each nodding our heads that these systems are harder to implement than many realize. We were successful, and our software now runs 80-90% of the world's semiconductor fabs at a 6-sigma plus reliability level (less than 5 minutes downtime per year). In any case, we both saw this sort of problem on the cloud horizon. Designing systems infrastructure for high-availability purposes is very difficult, and has to be DESIGNED in, long before real code is actually written. Given it took us about 3 years to design and deliver a first release of our MES software, with the help of the senior engineers of a consortium of major semiconductor manufactures, I can't imagine that Amazon invested anything near that level of resource in the EC2 cloud infrastructure. I may be wrong, but results speak for themselves...

It sounds largely like "If you didn't buy redundant service, it won't be redundant".

That's what "spreading your application across multiple availability zones" means, isn't it? I note this terminology from the EC2 SLA you linked:

>"“Region Unavailable” and “Region Unavailability” means that more than one Availability Zone in which you are running an instance, within the same Region, is “Unavailable” to you."

IOW, as long as you've requested more than one AZ [in a region], you get a refund if two go down. As far as I can see, that means Amazon is even offering refunds if your redundancy fails but your actual service stays up and running. If you spread your app across three AZs all in one region, and two go down, but your app is still running on the third, you're due a refund just for the lack of backup, no?

This does mean that redundancy is limited to *within* each region, and cannot operate between regions. But it's not like they said otherwise, is it? The meaning of the words seems pretty plain to me, and I'm not even a lawyer.

( reads second half of story )

Oh wait. That was what you were saying all along?

>"Engates says that Amazon's cloud service and its service-level agreement is set up in such as way that users must ensure redundancy across zones – if not across entire regions"

Well yeah, but it's not just him who says that, it's their SLA that says so.

Shouldn't the story just be "Engates says that Amazon's SLA says whatever it is that Amazon's SLA says"?

Partly cloudy with a chance of outages

@William Boyle re: "I can't imagine that Amazon invested anything near that level of resource in the EC2 cloud infrastructure.". Well that's the thing, people see these services as being so great because the price is so low. You get what you pay for, economies of scale can drive prices down some, but so can cutting corners.

(Obviously, for this, I don't have actual quotes, that is for effect...)

So, does the storage use RAID, does it have some kind of backups of any type? "Don't worry about it, it's a cloud." Is there any redundancy in the networking with in the building? "It's a cloud, you don't need to know that." What about external connectivity? "Well, this cloud connects to the Internet." What about power? "It's a cloud, leave that to us."

I don't really think Amazon is necessarily skimping on any of this stuff. But, that is the thing, if you go with any conventional hosting provider, you will get a real SLA, you will get as much hard info as you want about what type of storage you have (plain disks, RAID, maybe even some off-site backups), what kind of power setup (power conditioning, battery backup, generator, etc.), if there's redundancy on the internal networking or not, if there are redundant outside connections, and if your provider provides the possibility of distributing between multiple physical data centers. You can find out what kind of physical machine you are on if you pay for a dedicated one, and in genreal how many virtual servers are stuck onto a physical machine if you are on a shared server. Of course, you can skimp and get some provider where the answer is "best effort SLA, no backups, no power backup, no network redundancy, and as many virtual servers as they can cram on each physical server" but then at least you know that, instead of just being told "Well, it's cloud, don't worry about it."