Rather than wringing our hands and shaking our heads about “That Darned Cloud, I knew this would happen”, let’s talk about it a bit, because there are some things that can and should be done. Enterprises wanting to adopt the Cloud will want to have thought through these issues and not just avoided them by avoiding the Cloud. In the end, they’re issues every IT group faces with their own infrastructure and there are strategies that can be used to minimize the damage.

I remember a conversation with a customer when I was a young Vice President of R&D at Borland, then a half a billion dollar a year software company (I miss it). This particular customer was waxing eloquent about our Quattro Pro spreadsheet, but they just had one problem they wanted us to solve: they wanted Quattro Pro not to lose any data if the user was editing and there was a power outage.

I was flabbergasted. ”It’s a darned computer, it dies when you shut off the power!” I sputtered in only slightly more professional terms. Of course I was wrong and hadn’t really thought the problem through. With suitable checkpoints and logging, this is actually a fairly straightforward problem to solve and most of the software I use today deals with it just fine, thank you very much.

So it is with the Cloud. Your first reaction may be, “We’re a Cloud Service, of course we go down if our Cloud goes down!” But, it isn’t that black and white. I like John Dodge’s thought that the Cloud should be treated just like rubber, sugar, and steel. When Goodyear first started buying rubber from others, when Ford bought steel, and when Hershey’s bought sugar, do you think they didn’t take steps to ensure their suppliers wouldn’t control them? Or take Apple. Reports are that Japan’s recent tragedies aren’t impacting them much at all and that they’re absolutely sticking with their Japanese suppliers. This has to come down to Apple and their suppliers having had a plan in place that was robust enough to weather even a disaster of these proportions.

What can be done?

First, this particular Amazon outage is apparently a regional outage, limited to the Virginia datacenter. A look at Amazon’s status as I write this shows the West Coast infrastructure is doing okay:

Most SaaS companies have to get huge before they can afford multiple physical data centers if they own the data centers. But if you’re using a Cloud that offers multiple physical locations, you have the ability to have the extra security of multiple physical data centers very cheaply. The trick is, you have to make use of it, but it’s just software. A service like Heroku could’ve decided to spread the applications it’s hosting evenly over the two regions or gone even further afield to offshore regions.

This is one of the dark sides of multitenancy, and an unnecessary one at that. Architects should be designing not for one single super apartment for all tenants, but for a relatively few apartments, and the operational flexibility to make it easy via dashboard to automatically allocate their tenants to whatever apartments they like, and then change their minds and seamlessly migrate them to new accommodations as needed. This is a powerful tool that ultimately will make it easier to scale the software too, assuming its usage is decomposable to minimize communication between the apartments. Some apps (Twitter!) are not so easily decomposed.

This then, is a pretty basic question to ask of your infrastructure provider: “How easy do you make it for me to access multiple physical data centers with attendant failover and backups?” In this case, Amazon offers the capability, but Heroku took it back away for those who added it in their stack. I suspect they’ll address this issue pretty shortly, but it would’ve been a good question to explore earlier, no? Meanwhile, what about the other vendors you may be using that build on top of Amazon. Do they make it easy to spread things around and not get taken out if one Amazon region goes down? If not, why not?

Here’s the answer you’d like to hear:

We take full advantage of Amazon’s multiple regions. We’ll make it easy if one goes down for your app to be up and running on the other within an SLA of X.

Note that they may charge you extra for that service and it may therefore be optional, but at least you’ve made an informed choice. Certainly all the necessary underpinnings are available from Amazon to support it. Note that there are some operational niceties I won’t get into too deeply here, but I do want to mention in passing that it is also possible to offer a continuum of answers to the above question that have to do with the SLA. For example, at my last startup, we were in the Cloud as a Customer Service app and decided we wanted to be able to bring back the service in another region if the one we were in totally failed within 20 minutes and with no more than 5 minutes of data loss. That pretty much dictated how we needed to use S3 (which is slow, but automatically ships your data to multiple physical data centers), EBS, and EC2 to deliver those SLA’s. Smart users and PaaS vendors will look into packaging several options because you should be backed up to S3 regardless, so what you’re basically arguing about and paying extra for is how “warm” the alternate site is and how much has to be spun up from scratch via S3.

Another observation about this outage: it is largely focused on EBS latency, though there is also talk of difficulty connecting to some EC2 instances. This is the second time in recent history we’ve heard of some major EBS issues. We read that Reddit had gone down over EBS latency issues less than a month ago. Clearly anyone using EBS needs to be thinking about failure as a likely possibility. In fact, the ReadWriteWeb article I linked to implies Reddit had been seeing EBS problems for quite some time. One wonders if Heroku has too.

What will you do if you’re using EBS and it fails? Reddits says they’re rearchitecting to avoid EBS. That’s certainly one approach, but there may be others. Amazon provides considerable flexibility in the combination of local disk, EBS, and S3 to fashion alternatives. The trick is in making your infrastructure sufficiently metadata driven, and having thought throught the scenarios and tried them, sufficiently well-tested, that you can adapt in real-time when problems develop. In this respect, I have seem Netflix admonish that the only way to test is to keep taking down aspects of your production infrastructure and making sure the system adapts properly. That’s likely another good question to ask your PaaS and Cloud vendors–”Do you take down production infrastructure to test your failover?” Of course you’d like to see that and not just take their word for it too.

I haven’t even touched on the possibilities of utilizing multiple Cloud vendors to ensure further redundancy and failover options. It would be fascinating to see a PaaS service like S3 that is redundant across multiple data centers and multiple cloud vendors. That seems like a real winner for building the kind of services that will be resilient to these kinds of outages. It’s early days yet for the Cloud, the some days it seems like Amazon has won. There’s plenty of opportunity for innovators to create new solutions that avoid the problems we see today.

One response to “What to Do When Your Cloud is Down”

Amen to that. As cloud tenants we know its evolving and every incident is a learning experience. Yes the ideal is platform agnostic or even amazon region agnostic. I know heroku are concious of the safe harbor situation, whereby data is US based only even though amazon layer is international. Maybe a chance to solve two problems with the one solution? In crisis, opportunity.