Microsoft's Office 365 and Google Docs both went down this week. For cloud …

Share this story

Outages are becoming a distressing fact of life for Microsoft’s cloud e-mail customers, and users of other cloud services such as Google Apps. Two weeks of e-mail glitches plagued Exchange Online customers using Microsoft’s Business Productivity Online Suite (BPOS) in May. Office 365, the successor to BPOS which launched in late June, suffered an e-mail outage in August and then again last night and this morning.

The latest Microsoft outage was caused by what the company vaguely called a “DNS issue” and affected not just Office 365 but also the consumer services Hotmail and SkyDrive. The outages were spread throughout the world.

Taken together, the outages may put second thoughts in the minds of IT executives considering the move from locally hosted Exchange servers to Microsoft’s cloud, to Google Apps or to Amazon’s cloud.

Of course, IT systems can go down whether they are run by customers in their own data centers or outsourced to cloud vendors. But large institutions with multimillion dollar IT budgets may be able to achieve greater reliability by keeping IT in-house, without worrying about sensitive data residing in a vendor’s data center.

In response to the Hotmail and Office 365 outages, Microsoft tells Ars “On Thursday, September 8th at approximately 8:00pm PDT, Microsoft became aware of a DNS issue causing service degradation for multiple services. We achieved full service restoration at approximately 11:30pm PDT. We are conducting a review of our processes. We appreciate your patience.”

The Hotmail outage was discussed further on Microsoft’s Windows Live blog, which said fixing the problem required “propagating our DNS configuration changes around the world.” Microsoft’s Office 365 team kept customers updated on Twitter. Despite Microsoft’s statement that all problems ended at 11:30pm PDT, the company was still receiving complaints from customers via Twitter 8 hours later. In response to those complaints, the Office 365 team tweeted “We are investigating an issue for a small number of customers.”

Google was also forced to explain itself this week.

“On Wednesday we had an outage that lasted one hour and meant that document lists, documents, drawings and Apps Scripts were inaccessible for the majority of our users,” Google Docs Engineering Director Alan Warren wrote in a blog post. “The outage was caused by a change designed to improve real time collaboration within the document list. Unfortunately this change exposed a memory management bug which was only evident under heavy usage.”

Cloud services may make the most sense for small business owners, such as Paul Burns, an IT industry analyst whose small firm Neovise uses Office 365 for e-mail and other services. Last night, Burns says he was able to access e-mail through the Outlook client on his Windows desktop, but could not get mail on his mobile phone or through the Office 365 Web interface.

The outages are starting to become “a little bit of par for the course,” Burns told Ars. “There are going to be outages even within corporate IT. People that are investing millions of millions of dollars in keeping e-mail and SharePoint up and running, they’re still going to have outages. But I think for Office 365, it strikes me for a public service that it’s happening a little too often.”

Unfortunately, Burns says it is hard to stay up to date when cloud problems happen. Microsoft’s status portal also went offline during the outage yesterday, he notes. And Burns is troubled by some less serious errors that nonetheless make life harder for customers. Earlier this week, Burns says he was unable to upload an audio file to SharePoint Online, and had to search through customer forums to find a workaround.

On the more critical e-mail issue, Burns notes that customers can regain some control by archiving e-mail locally. “On my Outlook clients I set them up to have all of my e-mail cached locally," he says. "It doesn’t help me send or receive mail when it’s down. But let’s say they lost my data. I’m not expecting them to lose my data, but I would have it all archived.”

With Windows Azure, another cloud service, Microsoft employees have tried to prove it is enterprise-ready by using it themselves. Microsoft’s IT division is also starting to move employees to Office 365.

Google’s Warren says “We use Google Docs ourselves every day, so we feel your pain and are very sorry.”

A Microsoft spokesperson tells Ars that millions of users and more than 20 percent of the Fortune 500 is using Microsoft cloud productivity tools, and lists BPOS, Office 365, Exchange, SharePoint and Lync Online as the services in use. This could include cloud-based versions of Exchange, SharePoint and Lync offered by Microsoft partners.

According to Google, more than 4 million businesses use Google Apps, which includes Docs, Gmail and Calendar.

But for many customers, the decision on whether to go in-house or cloud is murky. When asked if he’s still glad he signed up for Office 365, Burns says “I have been reevaluating it. The outage does concern me but I’m hopeful it will improve over time.”

Update: Microsoft has sent us a revised statement on this week’s outage, which reads: “On Thursday, September 8th at approximately 8 p.m. PDT, Microsoft became aware of a Domain Name Service (DNS) problem causing service degradation for multiple cloud-based services. A tool that helps balance network traffic was being updated, and for a currently unknown reason, the update did not work correctly. As a result, the configuration was corrupted, which caused service disruption. Service restoration began at approximately 10:30 p.m. PDT, with full service restoration completed at approximately 11:30 p.m. PDT. We are continuing to review the incident.”

Probably DRM, but that aside: wouldn't it be a more reliable solution to have the software run client side and store all files locally and files in a specified folder are sync'd with the cloud on the fly? That way you could just enter your login in the client program and retrieve whatever, wherever instead of "Oh, Google's down - can't do squat today."

Local should always be the preferred method for access "mission critical" things. Cloud is a nice add-on for syncing between devices, backing up, or sharing with others or low-storage devices, but they shouldn't assume 100% connectivity. This is also an argument for a peer-to-peer augment to the cloud--is there any reason why two computers connected only to each other should not be able to share documents in the same way without a central server farm?

Cloud is just a bunch of interconnected systems and systems occassionally go down. I think this points out there's some stuff you should put in the cloud, some stuff you could put in the cloud and other items that just don't belong.

<quote>But large institutions ... may be able to achieve greater reliability ...</quote>

While that's true, the far more important benefits are the increased control and visibility of in-house IT. Even if internal IT has twice as much down time (by whatever measure you pick), knowledge of the details of the outages and authority over the management of the systems gives leadership options they will never get from a vendor.

Obviously these things always boil down to money. The extra costs with internal IT are staff and infrastructure. With third-party hosted IT ('cloud' or otherwise) these costs are rolled into the licensing of the software/service, but there is the hidden extra cost of not having the liberties available when IT is internal.

It's an extremely difficult equation to evaluate as there are so many external variables (local job market, vendor options, lock-in, training and re-training costs, differences of scale, ...). Availability is just one variable and I don't think it dominates.

I'd love to see some charts of cloud downtimes vs company hosted downtimes.

Me too... we'll work on it for a future story!

I work for a fortune 500 company and the brass officially ruled out the option of using a 3rd party cloud about a month ago. The reality for most corporation will be internal ''cloud'' aka server consolidation and a move toward more thin clients.

The whole cloud thing is just a hipster buzzword to get you more locked in to your vendor. The only reason this ever had traction was CIOs wanting to get a nice fat outsourcing bonus. It has some positive aspects if you are a SMB though. I can definitively see them offloading the less critical parts of their ''IT'' infrastructure.

Remember, the big difference between local and cloud is that when an outage happens (and it WILL happen), you can blame someone in your own organization instead of someone else. I can almost guarantee you that Google and Microsoft both offer FAR more uptime for messaging and collaboration services (what we're talking about here) for their cloud users than any other private entity in the world.

The false dichotomy in the press of "local = up, cloud = down" is nonsense. It is a matter of "local = low public visibility, cloud = high public visibility". Private companies have tons of outages all the time. They just don't advertise it, most don't do a cost/benefit analysis, and far too many don't even both doing root cause analysis.

I second the post comparing cloud uptime to internal uptimes. Obviously internal uptimes would vary depending on the organization though.

I think the balance is in cost of outages vs cost of critical services going down. For some even "mission critical" apps can be down for hours at a time without a major crisis. In that case it may well be worth it to go with cloud solutions and avoid the IT cost and infrastructure. Compare that to industries such as Wall Street banking and great pains must be taken to avoid any outages whatsoever. In those instances cloud based solutions should most likely be avoided entirely.

While that's true, the far more important benefits are the increased control and visibility of in-house IT. Even if internal IT has twice as much down time (by whatever measure you pick), knowledge of the details of the outages and authority over the management of the systems gives leadership options they will never get from a vendor.

This. In terms of perceived control, which do you think an executive is going to prefer when the network goes down, calling up the in-house IT people who can be terminated via email on a whim, or calling up some third party call center in India, where you're one of probably dozens if not hundreds of other clients calling to wonder why you can't access your data, and who's contract doesn't come up for review for another 8 months?

It doesn't mean that their information will come up faster. But it does make them feel like they've got more control over the situation. And when you're dealing with Type A personality executives, their perception that they've got some control over things is the paramount issue.

It doesn't mean that their information will come up faster. But it does make them feel like they've got more control over the situation. And when you're dealing with Type A personality executives, their perception that they've got some control over things is the paramount issue.

Unless of course, you can show them that their percieved "control" is costing them $50, $70, $90 a month per user, versus $6, $12, or $24. Then of course, they see the dollar signs in their executive bonus they'd recieve for cutting costs so dramatically while increasing productivity. That's why there's traction in all this. It's not going to take over the world, but it won't go away either.

I don't have actual numbers to back this up, but I have a feeling that the lower cost per user that cloud services offer is part of the problem. If a company spent the same reduced amount on IT that they pay MS or Google for these services, they'd probably have increased downtime too.

Obviously it won't be a one-to-one correlation, but maybe they are pricing the services too cheaply to try and get companies on board?

Of course it could just be growing pains. I suspect Exchange 1.0 was far less reliable than the current version.

I don't have actual numbers to back this up, but I have a feeling that the lower cost per user that cloud services offer is part of the problem. If a company spent the same reduced amount on IT that they pay MS or Google for these services, they'd probably have increased downtime too.

Obviously it won't be a one-to-one correlation, but maybe they are pricing the services too cheaply to try and get companies on board?

Of course it could just be growing pains. I suspect Exchange 1.0 was far less reliable than the current version.

Well we'll see what the final report is that they issue, but it's worth noting that so far they say it was a DNS issue. That doesn't relate to the core services, but the infrastructure services around it. The question is, was there some automatic process that didn't work right, was there some user mistake that slipped through checks, or was there some propagating failure that took place? It's hard to know without the root cause analysis that should come later. Frankly, every organization has to deal with DNS issues as well. I am sure there are more than a handful of people here in the Ars forums who've lost multiple weekends to wonky DNS. I wish there were something better.

Remember, the big difference between local and cloud is that when an outage happens (and it WILL happen), you can blame someone in your own organization instead of someone else. I can almost guarantee you that Google and Microsoft both offer FAR more uptime for messaging and collaboration services (what we're talking about here) for their cloud users than any other private entity in the world.

The false dichotomy in the press of "local = up, cloud = down" is nonsense. It is a matter of "local = low public visibility, cloud = high public visibility". Private companies have tons of outages all the time. They just don't advertise it, most don't do a cost/benefit analysis, and far too many don't even both doing root cause analysis.

You're forgetting the simple fact that if a company has 1,000 PCs and 10 of them died, you still have 990 others running. Whereas if your local connection gets disrupted or the "cloud" goes offline, then every single one of those 1,000 PCs connected to the service goes "down". Something to think about...

Edit:

Let me also add that there's no such thing as predictive risk because Murphy's Law dictates that while 99.999% of the time shit doesn't happen, the 0.001% will happen when you LEAST expect it and can LEAST afford it.

Remember, the big difference between local and cloud is that when an outage happens (and it WILL happen), you can blame someone in your own organization instead of someone else. I can almost guarantee you that Google and Microsoft both offer FAR more uptime for messaging and collaboration services (what we're talking about here) for their cloud users than any other private entity in the world.

The false dichotomy in the press of "local = up, cloud = down" is nonsense. It is a matter of "local = low public visibility, cloud = high public visibility". Private companies have tons of outages all the time. They just don't advertise it, most don't do a cost/benefit analysis, and far too many don't even both doing root cause analysis.

You're forgetting the simple fact that if a company has 1,000 PCs and 10 of them died, you still have 990 others running. Whereas if your local connection gets disrupted or the "cloud" goes offline, then every single one of those 1,000 PCs connected to the service goes "down". Something to think about...

Edit:

Let me also add that there's no such thing as predictive risk because Murphy's Law dictates that while 99.999% of the time shit doesn't happen, the 0.001% will happen when you LEAST expect it and can LEAST afford it.

The problem comes down to what dies. At your small company if your fileserver goes down then it doesn't matter how many PC's are up, little work gets done. Same with a single google server, or even a thousand of them, there's stacks of others there running the same thing, so it just increases load. But when you have a DNS problem like Microsoft supposedly had, or one flaw in the code that's running on every machine, then it all goes down. If the wrong router dies in a company or server or whatever, then it all goes down.

Individual results vary too, the smaller the company the worse the downtime can be (talking to a tech buddy of mine one day a local accounting firm had one harddrive die in their RAID array and they tried to rebuild the entire thing on a slow as hell computer. It took most of an entire day, on the second last day to file taxes for the year. Murphy's Law, that one). Where I am it was common practice to keep decks of cards around until we hired an excellent IT admin who's been working on network issues and reliability for a few years now. Downtime does happen more from our hosted Outlook than our local systems, but it was expensive to get this level of reliability too.

Neither the cloud, or local server or the network link is going to be 100% reliable. Any decent business will have a backup for contingencies in case things go wrong. I use Syncdocs to sync Google <-> laptop. That way there's no single point of failure if Google Docs goes down or someone steals the laptop or the network dies. Just common sense.

Cloud is just a bunch of interconnected systems and systems occassionally go down. I think this points out there's some stuff you should put in the cloud, some stuff you could put in the cloud and other items that just don't belong.

It's funny you guys ran this story, I was just sent a survey from VMWare on cloud services yesterday. It had quite a few questions about security & cost, but mentioned very little about uptimes. I can see this is a very big money spinner for companies like MS, Google & Amazon, but they really need to get their houses in order before IT managers move mission critical services to the cloud.

The whole selling point of cloud services is that they are "on demand", you can't really say that if there's not five nines uptime or at the very least decent disaster recovery. I think we're seeing this house of cards come down, it's nothing new, computing has moved in ebbs and flows for as long as I can remember.

If you remember computing before, say, the 90's, then you've already been to the cloud.

Thank you! I've been giggled at many times for so suggesting, but as a mainframe sysprog, I spent many hours trying to explain why it took a half hour to restore system availability. ("It's not a bug, it's a feature.")

I'm a copywriter for a small online business, it's just the 3 of us and I use Google Docs. Im not much of a fan of google, but it's something they've done really well. An outage would be a minor annoyance, but certainly nothing that serious, and far less annoying than using MS Word every day. Fuck, I remember those days, I'd rather have Google Docs go down for 10 mins every hour.

According to Google, more than 4 million businesses use Google Apps, which includes Docs, Gmail and Calendar.

I wonder how many of the 4 million are tiny part-time businesses like mine with just one user, who only ever uses the gmail and calendar services? And I don't even use the web interface, I sync it all to my phone and use mutt on the PC. But it wouldn't surprise me a bit if there were 2.5 million busineses using google apps that approximately match my profile, in which case 4 million sounds far less rosy.

EDIT:BTW, I'm not in love with google but their gmail and calendar haven't let me down for more than a few minutes since 2007. I do keep current local copies of absolutely everything on a server which I can log into from my phone, not doing so would make me worry. I don't create documents with google docs because LibreOffice seems good enough, and OOo before that did too.

Remember, the big difference between local and cloud is that when an outage happens (and it WILL happen), you can blame someone in your own organization instead of someone else. I can almost guarantee you that Google and Microsoft both offer FAR more uptime for messaging and collaboration services (what we're talking about here) for their cloud users than any other private entity in the world.

The false dichotomy in the press of "local = up, cloud = down" is nonsense. It is a matter of "local = low public visibility, cloud = high public visibility". Private companies have tons of outages all the time. They just don't advertise it, most don't do a cost/benefit analysis, and far too many don't even both doing root cause analysis.

I call BS. In the last year my corporate Exchange has had less downtime then Office 365's email. Saying that private stuff goes down and we don't know about it is convenient because we also don't know how often it happens. Except I do know in the case of the one I use, and it's rare.