People Using Amazon Cloud: Get Some Cheap Insurance At Least

I’m reading through Twitter streams, Amazon Forums, and other news sources trying to get a sense of how users are responding and what their problems are. It’s pretty appalling out there. B2B companies admitting they have no recent backups and just have to wait for it to come back online. A company that claims patient’s lives are at stake as they do cardiac monitoring based in the Amazon Cloud and are desperately seeking assistance. The list goes on.

There’s some basic insurance any company using the Amazon Cloud needs to take out first chance they get. It’s not hard, it’s not expensive, it’s not push a button and get hot failover to multiple Clouds, and it won’t fix your problems if you’re caught in the current outage. But it will at least give you a little more maneuvering room. Many of the acounts I’m reading boil down to a lack of options other than waiting because they have no accessible backup data. In other words, they’d love to bring up their sites again on another Amazon Region, but they can’t because they’re missing access to a reasonably current data backup, or the Amazon Machine Instances are all in the affected region or issues along those lines.

Companies need the Cloud equivalent of offsite backup. At a minimum, you need to be sure you can get access to a backup of your infrastructure–all the AMI’s and Data needed to restart. Storage is cheap. Heck, if you’re totally paranoid, turn the tables and backup the Cloud to your own datacenter which consists of just the backup infrastructure. At least that way you’ve always got the data. Yes, there will be latency issues and that data will not be up to the minute. But look at all that’s happened. Suppose you could’ve spun up in another region having lost 2 hours of data. Not good, not good at all. But is it really worse than waiting over 24 hours or would you be feeling blessed about now if you could’ve done it 2 hours into the emergency? These are the kind of trade offs to be thinking about for disaster recovery. It’s chewing gum and bailing wire until you get an architecture that’s more resilient, but it sure beats not having any choices and waiting.

Another thing: make sure you test your backups. Do they restore? Can you go through the exercise of spinning up in another region to see that it works? Don’t just test once and forget about it. Pick an interval and retest. Make it routine so you know it works.

Staging all the data to other locations is not that expensive compared to continuously running dual failover infrastructure. That’s one of the beauties of elasticity.

There’s a lot of grumbling about how hard it is to failover to other regions and how expensive. Nothing is harder than explaining to your customers why your site is down. But at least get some cheap insurance in place so you have options the next time this happens. And there will be a next time, no matter whether it is Amazon, some other Cloud provider, or your own datacenter. There is always a next time.

While you’re at it, consider some other cheap insurance:

– Do you have a way to communicate with your customers when your site is down? An ops blog that you’re sure is hosted in a different cloud is cheap and cheerful.

– Can you at least get your web site home page showing? Think about how to get DNS access and a place to host that don’t rely 100% on one Cloud provider.

– Is there something about your app that would make partial access in an outage valuable? For example, on a customer service app, being able to log trouble tickets as email during an outage or scheduled downtime would be extremely helpful. Mail is cheap and easy to offer as alternate infrastructure, and it is also easy to imagine piping the email messages through a converter that would file them as tickets when the site came back up. It’s not hard to imagine being able to queue many kinds of transaction this way in an emergency. What are the key limited-functionality areas your users will want to have access to in an emergency?

– For some apps, it is easier to provide high availability for reading than for writing. Can you arrange that in an emergency, reading is still possible, just not writing or creating new objects? Customers are a lot more tractable if they know they still have access to their data, but just can’t create new data for a while. For example, a bookmarking site that lets me access my bookmarks but not create new ones during an outage is much less threatening than one that just brings up its Fail Whale equivalent on me.

Welcome to the world of Disaster Recovery. Disasters have a User Experience too. Have you planned your customer’s Disaster UX yet?

12 Responses to “People Using Amazon Cloud: Get Some Cheap Insurance At Least”

Learning from AWS incident, we need to develop processes to easily have clouds on different providers and mitigate the risk that one of them fails. CloudSigma and Strategic Blue press release is significant announcement , particularly in the context of the AWS failure http://bit.ly/g09yRl%20

“Now customers of CloudSigma can choose to interface directly or through Strategic Blue for their cloud billing arrangements, depending on their preference. The partnership will complement the pre-pay billing model available directly from CloudSigma with invoice payment terms available through Strategic Blue. Strategic Blue “solution transforms multi-cloud infrastructure deployment from a potentially complicated management issue to a simple billing relationship with one party, Strategic Blue.”

Next trend to watch is how cloud providers will develop “easier ways” to migrate to other clouds. We need a magic list of cloud providers customer may choose for multiple hosting for defense against cloud failure. I am paraphrasing Gartner Magic Quadrant for cloud providers

You write: For some apps, it is easier to provide high availability for reading than for writing. There are 5 lessons to be learned:

Lesson 1: Both Cloud and Dedicated Computing Have Single Points of Failure
Lesson 2: Size is No Protection from Outages without Redundancy
Lesson 3: All Data Centers Are Not Equal
Lesson 4: The Price-Performance-Reliability Metric
Lesson 5: Achieving a highly robust set-up is cheaper and easier in the Cloud

miha, in this case, even just architecting for failure of a region would’ve made all the difference between businesses like Netflix and Twilio that stayed up, and those that didn’t. This particular post of mine is advocating a level of basic prudence to at least have control of all your data no matter what. Folks that haven’t taken that step are a long way from being ready to start worrying about multi-cloud architectures.

I wonder whether the credit – for not being affected by AWS failure – Netflix gets because they were smart (which they are), or because they were luck, because they used the Western California AWS locations. If you saw the movie Matchpoint, by Woody Allen, if one survives is not caught not caught – automatically becomes a genius

In that case – I know the people at Netflix, like Adrian Cockcroft – Netflix made an extraordinary point for all other companies porting to Amazon. But the quality of Netflix team, is hard to match by other mainstream companies porting to AWS. Yet people like zencoder and quora have brilliant engineering leadership, just like Netflix, and went down. Actually Zencoder blog is a must to read: http://bit.ly/fiehlU . How come they went down and Amazon didn’t? Perhaps because is very expensive to have three locations of the size of Netflix, instead of one .

Perhaps you can have a look at affordable IaaS providers who are both easy-to-use and lower cost. See this video of cloudsigma.com http://bit.ly/g3UuTN . Cloudsigma will launch in US at four location in June 2011

schlaflysaid

You’re right, it is foolish to be so dependent on the Amazon Cloud. But I am guessing that to a lot of businesses, the main appeal of the Cloud is that they do not have to worry about backups and database outages and all those other problems. I think that this is a disaster for Amazon.

Roger, any business that outsources and assumes they gave up responsibility along with the work is going to get what they get. Another Cloud provider, which will eventually have an outage too, and more disappointments. Businesses are responsible to their customers for delivery, regardless of who they sublet it to. The smart businesses understand that reality and will deal with it.

Let’s hope most of the businesses caught in this mishap just weren’t quite pessimistic enough and will take care of their architectural shortcomings.

Unfortunately, I think the Tech World has also gotten pretty complacent about doing things on the cheap Consumer Internet style. There are drawbacks to that approach as we’re seeing.