Thoughts on Amazon S3 Outage and Cloud Computing

I noticed there was a problem this morning when the avatars on Twitter were not showing up. I know Twitter uses Amazon S3 to host the avatars but thought the issue was with Twitter because they constantly have problems keeping the service running. I gave little thought that the problem might have been Amazon S3 service instead.

The coverage on the Internet spread like wildfire and the details and speculation can be found many places, including TechMeme which has lots of links and the list is growing.

I am not posting to report the Amazon outage (a million before me have already done so) but more to reflect on it and give some of my thoughts about using services like S3 and EC2 versus other ways of accomplishing the same tasks. I don’t have the answers, just some thoughts and opinion. I would like to hear from readers who may have more experience than I with this.

I have been thinking lately about the best way to host my applications, either in part or full. Right now I use a hosting provider, RailsPlayground, to host a couple Ruby on Rails applications. I have a single VPS setup with them and have never had a problem, but I could. The applications hosted there are small and if the VPS was down for a period of time I doubt it would matter much. The pricing is right but there is no redundancy.

I have a couple additional applications I have been working on and the thought of hosting them is where my concerns are. Hosting them could be done one of several ways:

All options have their strengths and weaknesses with regards to costs, scalability and reliability. My biggest concern at this point is balancing reliability with costs. We can easily through money at the problem if money was no object. Even with virtually unlimited funds such as those Amazon has, they are obviously not excluded from downtime issues.

I don’t know the cause of Amazon’s downtime but do they have S3 or any of the other services running in a single data center? Even if the data center is huge it still has a single point of failure. It could also be their disaster recovery procedures were not tested well enough. I think we will hear more and more coming from Amazon about the exact cause of the problem, any disaster recovery issues and a statement to the effect of why this will not happen again.

Relying on the Cloud

Can we rely on what we put in the cloud to be available? How will we be effected by having our applications unavailable while there is some downtime?

I think I can characterize my use of the cloud by application level, which is probably like other users as well:

Level 1 Application – used daily but can live without, won’t really miss if gone. An example of this type of application is Twitter. Lately, this service has been down more than it has been up, but life goes on.

Level 2 Application – used daily but depending on the day, will be a productivity loss if down. An example of this is Google Docs or GMail.

Level 3 Application – used daily and cannot be down without financial loses. An example of this may be my own web applications I make a living on.

So how can I position my applications to give me the most uptime without throwing money at the problem unnecessarily? I would think a single location with a single database is asking for problems. The solution that stands out in my mind for a solution to my particular problem would be to used managed services hosting the applications on VPS slices on a cluster across multiple data centers, in different states or even continents. I know Engine Yard does this but I don’t know if Amazon does this with EC2. If they don’t, they should. If EC2 was fully down today then Amazon needs to rethink their strategy a bit.

Service Level Agreements

Problems happen to the best of companies and we know uptime is not 100% guaranteed.

Looking at part of Amazon’s Service Level Agreement (SLA), they admit they can’t ensure 100% uptime and will pay customers back for anything less than 99.9% uptime.

Service Commitment

AWS will use commercially reasonable efforts to make Amazon S3 available with a Monthly Uptime Percentage (defined below) of at least 99.9% during any monthly billing cycle (the “Service Commitment”). In the event Amazon S3 does not meet the Service Commitment, you will be eligible to receive a Service Credit as described below.

SLAs are a fair statement as to the importance of uptime to the company but no real guarantee. The company gives us their best and will refund anything less, which is more than fair.

Final Thoughts

Amazon’s S3 downtime today just really got me thinking how risky it could be for users to rely on the cloud for computing. The concept is great but even large companies like Amazon with massive infrastructure are still prone to problems.

It makes me think how I run my business and how I can minimize downtime. I certainly can guarantee 100% that the services I host will never be down. I can only do my best and have a recovery plan.

The same is true for those services like Twitter who takes its share of slams for not being up all the time. Heck, it’s a free service and they have a huge number of users, they are growing and we will experience pain. Amazon S3 has experienced its real issues today and they will get better. We can only learn from mistakes and weaknesses, this will be a learning experience. It has opened my eyes.

5 responses to “Thoughts on Amazon S3 Outage and Cloud Computing”

Last summer my 15 year old greeted me – “Dad! My cell phone is broke and I can’t text my friends!” “mmm.. this is serious. What did you use to do before I bought you the phone?” “I wasn’t able to text back then, dah!!”

Today’s uproar re: the Amazon S3 outage takes me back to that funny moment when my daughter finally got my point – that I enabled her to enjoy the world of texting. And, like many AWS bloggers today, she did not appreciate that I gave her this gift.

So, to put a big picture perspective on today’s outage – most of us start ups, if not for AWS, would have burned thru our angel and round A funds to replicate AWS before we would have hit the tipping point and had the luxury of telling our customers that “we are experiencing an outage.”

Looking back on my “old school” days of expensive networks, users running out of storage and the constant flow of cash to admin staff, I must admit to having a soft spot for the AWS team and service. In those days, a two hour outage was considered an opportunity for our users to chat with the cube neighbor or go down to the cafeteria for a donut. Fast forward to today’s demanding customers and an outage of minutes starts Armageddon. Now, imagine if by some miracle, these customers actually pay for the start up’s service.

Today, I welcomed the outage as it reinforced my need for AWS. How would my small team respond to an outage? We don’t have the talented staff nor the passion the AWS team has. We forget that Amazon is in the small group of visionary “start-ups” who helped get the net to where we are today.

@Phil Thank you for the detailed comment. I agree 100% with what you are saying and I hope my post didn’t sound like I do not appreciate Amazon. I think this can happen to anyone whether they are prepared for it or not and Amazon is no different. The services Amazon provides are phenomenal and make start-ups especially lucky for what they offer and what they take off our plates.

Small companies are not staffed to be able to handle such an outage as today brought, I can sympathized with you. I am thankful to be able to use a service like S3 so I don’t have to.

Of course as more and more companies rely on Amazon or a company like Amazon they will need to have ways to ensure up-time, much the way Google does with server failover and such.