Cloud outages reinforce need for care when negotiating SLAs

Tony Redmond

Wed, 2011-08-17 23:00

Another month, another set of outages come along to delight those who follow Microsoft’s journey to cloud nirvana. On August 7 it was a power outage that affected their Dublin datacenter and affected service to European users of Business Productivity Online Services (BPOS). Curiously, the same outage (originally reported as being caused by a lightning strike on a transformer in the Dublin CityWest district and later assessed as a failure of the local power grid) also knocked out the Amazon EC2 service for some of their customers. Thankfully the BPOS outage didn’t last long and a flurry of Twitter updates helped to reassure customers that Microsoft was doing everything necessary to address the problem.

Of course, this outage didn’t affect all BPOS users because Microsoft distributes companies across its cloud datacenters around the world, taking into account requirements such as data location (some customers require data to remain inside specific national borders), server and storage availability, and ability to on-board users. In addition, BPOS (and Office 365) “double-teams” datacenters so that if a customer loses service from one datacenter another datacenter is available to take on the load.

Although BPOS outages have been reasonably common over the last year or so, we’ve heard from Microsoft that Office 365 would be different. The story goes that Office 365 has been re-engineered from scratch and has taken account of the problems that have occurred for BPOS since its launch in 2007. Indeed, on June 22, Microsoft replied to some user enquiries with a tweet saying that “Office 365 should provide a more stable service. It is built from ground up new…”

The issue that knocked Office 365 out on August 17 therefore came as a surprise. This was the first major outage suffered by Office 365 since its formal launch in June and started around 3:18pm EST and lasted for approximately two hours until service was slowly restored from 5:23pm onward. Inevitably, it took time for the service to resume normal operation as mail queues had to be cleared. Some users reported that things only really reached normality around 6pm. The times reported here are taken from customer tweets as people reported the loss and then return of service.

Apparently the underlying cause was a networking problem that affected Microsoft’s North American datacenters. The outage only affected Exchange Online and users could not connect to their mailbox using Outlook or OWA. Both SharePoint Online and Lync Online chugged ahead (the same outage affected Microsoft CRM but this isn’t part of Office 365). Those who were online reported that messages to external domains took far longer to be delivered. My Exchange mailbox based in a European datacenter remained online throughout and delivered its normal high quality of service.

The twittering class rapidly filled the ether with commentary that varied from advice that now would be a good time to get a coffee (but for so many hours in the middle of the working day?) to loud complaints decrying the frailty of cloud services and the inability of Microsoft to diagnose what was going on. Others commented that they were in the middle of testing or learning about Office 365 and the downtime didn’t inspire confidence. Tweets from Microsoft to advise users to check the service dashboard (which had some problems of its own during the incident) or consult the blog telling them how to get support didn’t do much to soothe troubled brows. Likewise, tweets from PR people telling of the wonder experienced by companies that had just signed up for Office 365 or those about products that integrate with Office 365 didn’t help either.

The Office 365 outage was made more embarrassing because it occurred on the same day that Microsoft Press launched a free ebook called “Microsoft Office 365: Connect and Collaborate Virtually Anywhere, Anytime”, The assertion in the PR write-up that “Office 365 is Microsoft’s smart and simple answer to cloud computing” rang a little less true following the problems with Exchange Online.

No one can legislate for power outages as these can happen at any time in any place and affect datacenters dedicated to individual companies as well as those shared amongst many. The same is true for the network connections that link datacenters with the Internet. However, outages such as those on August 7 and August 17 demonstrate once again that cloud services do not operate in a state of perfect utopia and that companies that sign up for cloud services must realize that even the most professionally-designed and operated datacenter operation can be taken offline by unpredictable and unexpected happenings. What’s therefore important is that cloud services providers demonstrate how quickly they can detect out-of-course events, react appropriately, and restore services. Ideally without users noticing!

How critical events are handled by the service provider is a critical topic to cover during the negotiation of a Service Level Agreement (SLA) and it’s a topic that should involve a great deal of input from a company’s technical staff. After all, it’s natural that management is focused on the SLA terms and conditions, including financial penalties that flow from lack of delivery of service, and that the legal team will probably do an excellent job of ensuring that the right language is in place to protect the company’s interest.

But it’s simply not good enough to assume that any cloud provider will anticipate problems that cause additional costs to mount for a company. Indeed, if any criticism can be levelled at cloud providers about how they deliver technology, it is that they make far too many assumptions as to how technology is consumed within individual companies and attempt to make one solution fit all needs. The technical staff should understand how users work within their company so their function is to ask the difficult but essential questions of the service providers during SLA negotiations to ensure that all bases are covered.

For example, if one of a datacenter pair is rendered inaccessible, how quickly is it possible to transfer user connections to its pair? Will any configuration changes be required to user PCs? Forcing an Outlook profile update to reconfigure the PC to pick up a new datacenter might become a horror story for many companies, especially if they don’t have reliable and centralized control over PCs. But of course, that’s never the case in the real world… is it?