Why cloud-based email doesn’t guarantee availability

27 October, 2015

Danny Bradbury

Vendors promise email business continuity for customers based on the robustness of the cloud. “Give your email service to us,” they say. “We can run it more efficiently and securely than you can.” Unfortunately, this isn’t always the case. Cloud services can fail, too.

The computing systems underpinning cloud email services run all kinds of services other than just email. They are incredibly complex environments where failures do happen. Gartner’s cloud performance evaluation service CloudHarmony maintains a chart showing each cloud service provider’s uptime record. Many experience numerous outages, and multiple hours of downtime.

Why do these cloud failures happen, and what can customers do about it?

Code flaws

Let’s look at one of the most egregious outages in recent times: the June 2014 Exchange Online failure that disrupted Office 365’s email continuity for up to nine hours.

The mail continuity failure was caused by a directory problem. Directories are a record of end-user accounts and privileges – a kind of sophisticated phone book for the enterprise – and Microsoft maintains its own for cloud-based customers. When it failed, user accounts wouldn’t authenticate, and they couldn’t access their mail.

Human error

Sometimes, problems are compounded – or even caused – by human error. In November 2014, Microsoft’s Azure service suffered from a massive outage. A change in its storage configuration designed to improve performance contained a bug that took storage offline.

Normally, cloud vendors test changes like this multiple times on test servers and small numbers of production servers to catch such problems before deployment. In this case, an engineer assumed that the tests had been run, when they hadn’t.

Cascading failures

Cloud infrastructures are complex networks of highly interconnected systems. Occasionally, one small error can ripple throughout the infrastructure turning into a far larger outage. This butterfly effect hit Amazon in 2012.

The firm replaced a single failed server and updated its DNS records (which tell other computers where to find it). The updates didn’t reach all the other computers, which kept looking for the old, failed machine. This took down an increasing number of servers, which in turn affected still more Amazon services.

What they can do

Some cloud service vendors are smart about their infrastructures. They run multiple, high-availability instances of their applications across different datacenters within a single vendor’s cloud, or even between multiple vendors’ infrastructures.

Online video giant Netflix goes one step further, testing its cloud resilience by introducing real failures. It uses software tools called Chaos Monkey and Chaos Kong to deliberately take out individual production servers or even entire regions in its infrastructure. This testing helps the firm to stay agile and prepare for vendor cloud outages.

What customers can do

Depending on what application customers are using in the cloud, they can hedge their risk by spreading their cloud presence between different locations or providers. Approaches will be different depending on what type of cloud service they are using. There are three broad categories:

Infrastructure as a Service, which provides access to basic compute and storage resources.

Platform as a Service, which gives you a framework on which to build software applications.

Software as a Service, which shields you from the other things and just gives you the applications you want to use. This is where email and collaboration services sit.

Hybrid cloud is one option for IaaS and PaaS, in which internal cloud-based servers are connected to a public cloud system and one fails over to the other.

Users of SaaS tools such as Office 365 can use complementary public cloud services to assist them, providing an extra layer of functionality and email continuity. For example, MAXMail checks and sanitizes all email before it even reaches Office 365 servers, but can also serve as an email continuity layer, providing access to email even when Office 365 goes down.

The bottom line: don’t rely entirely on one cloud provider for services that you consider mission critical. If you can’t tolerate too much downtime, then reduce your risk by spreading the workload. If your alternative provider can complement the service offered by your main provider with extra functionality, so much the better.