Amazon and the cloud rules of engagement.

As Amazon’s web services outage passed its third day, the debate on the future of cloud computing is underway. The outage is costing web sites such as Reddit and Quora considerable losses as users turn elsewhere to get their social media needs met.

Amazon’s Elastic Compute Cloud service hosts thousands of major web sites that rely on it to serve pages to users. And users rely on these services to store their personal accounts and data remotely. So when the EC 2 service goes down, so do the web sites, and that means users can’t log in to access their data. It’s a big hiccup for an industry that is supposed to grow to $55 billion by 2014, according to market researcher IDC.

The duration of the outage has surprised many, since Amazon has a lot of backup computing infrastructure. If Amazon can’t safeguard the cloud, how can we rely on it? So the debate begins on the future of cloud computing and what to do to make users and companies put their trust in cloud vendors such as Amazon.

I love the romantic notion of cloud computing, with computing power on tap, minimal outlay, and majestic, infinitely reliable availability. This story feels like the every science fiction novel of my childhood complete with aliens, robots, and rocket-ships.

Tragically, this is still a fiction, a dream. But it is a dream deeply believed, apparently, by many. Why else would we see such outrage for an outcome that was predictable, and arguably a good thing.

Cloud environments are very large machines, with large-scale components (warehouses) and large numbers of self similar sub-components (servers, virtual machines, processes, etc). This is made more complex by explosive growth alongside the march of progress in servers, appliances, and other components. Cloud environments are extremely valuable, and powerful, but we should not expect perfect robustness.

This is not at all to say that the ‘cloud’ should be avoided. Rather that the risks need to be understood, and managed.

So occasional outages like Amazon’s are healthy. We need signals that tell us to architect our cloud-reliant systems robustly, to avoid failure scenarios. Without seeing some ‘outages’ along the way, we are much more likely to end up in something of a “computing sub-prime” crisis, with blind over-commitment to a resource that does’t reach our ‘better than real’ expectations. (This is human nature.)

These failures, and the outrage and disbelief are part of cloud computing transitioning from a fairy tale, into a hard nosed, down to earth, resource. Into something that builds our future. This is an IT cultural transition, collectively learning the upsides, downsides, and rules of engagement for the IT ‘cloud’.