Microsoft Explains Exchange Outage

Microsoft pledges to do better after frustrating customers with last week's Exchange Online and Lync Online outages.

Microsoft Office For iPad Vs. iWork Vs. Google

(Click image for larger view and slideshow.)

Microsoft has provided more details to explain the outages suffered last week by its Exchange Online and Lync Online hosted services. Some customers were unable to reach Lync for several hours Monday, and some Exchange users went nine hours Tuesday without access to email. Many customers took to Microsoft's online forums and social media accounts to voice displeasure, not only at the service outage, but also at Microsoft's handling of the situation.

In a blog post, VP of Office 365 engineering Rajesh Jha said both outages affected Microsoft's North American data centers but that the issues were unrelated. "Email and real-time communications are critical to your business, and my team and I fully recognize our accountability and responsibility as your partner and service provider," he wrote.

Jha said the June 23 Lync Online disruption stemmed from external network failures that caused a short loss of client connectivity in Microsoft's data centers. The connectivity problem persisted only a few minutes, but Microsoft claims the ensuing traffic spike caused networking elements to become overloaded, which led to some customers' extended service issues.

The June 24 Exchange Online disruption, meanwhile, was caused by a periodic failure that caused a directory partition to stop reacting to authentication requests. Jha said "a small set of customers" lost email access altogether, and that others -- due to another, previously unknown flaw -- experienced email delays. Jha did not divulge how many customers were directly affected by Exchange Online's root error, nor how many dealt with the larger ripple-out effects.

The Exchange outage was compounded by a problem in Microsoft's Service Health Dashboard publishing process. The dashboard indicated to some customers that their services were fully functional, even as those services refused to load.

Jha said Microsoft has a full understanding of the problems that caused the disruptions, and is "working on further layers of hardening" to protect against future outages. He said customers can expect a Post-Incident Report in their Service Health Dashboards. Jha promised it will contain a detailed analysis of what went wrong, how Microsoft reacted, and how the company plans to avoid similar problems going forward. Though Jha's failure to detail how many customers were affected doesn't suggest a particularly transparent tone, Microsoft has a good record for sharing technical details following a service disruption.

Though Microsoft's cloud products experience few outages, this week's problems demonstrate why service lapses can be a big concern when they occur. Microsoft, Google, and others want companies to use cloud services to handle data and applications that have traditionally been hosted and managed in-house. The big cloud players have made progress over the last year, but all it takes is one outage to make professionals reconsider whether they want essential data and services to be handled by a third party.

During Tuesday's Exchange outage, a number of customers made such concerns abundantly clear. Microsoft didn't acknowledge the problems, which started around 6:00 a.m. EDT, for several hours. Even then, communications were labored; the company relied on user forums and social media to spread the word, which, given the Service Health Dashboard problem, left some customers confused and frustrated. Some criticized the company for euphemistically calling the disruption a mere "delay" in email deliveries.

"If by 'delays' you mean 6+ hours of complete outage," wrote Twitter user JD Wallace in response to a Microsoft tweet that acknowledged some Exchange customers were "experiencing email delays."

Others complained that Microsoft was slow to estimate when service might be restored. Some customers said they waited more than hour to talk via phone with Microsoft reps, only to be given no new information.

"Microsoft needs to work more with us. IT people are getting crazy without having [anything] to tell our users," a user with the handle JanetsyLeandro wrote in an Office 365 community forum. "We need a real update... [It's] causing a big problem to our business."

Time will tell whether the service outage affects the momentum of Exchange Online, Office 365, and other Microsoft cloud products. Was your business hit by last week's outages, and were you satisfied with Microsoft's response? Let us know in the comments.

Here's a step-by-step plan to mesh IT goals with business and customer objectives and, critically, measure your initiatives to ensure that the business is successful. Get the How To Tie Tech Innovation To Business Strategy report today (registration required).

Michael Endler joined InformationWeek as an associate editor in 2012. He previously worked in talent representation in the entertainment industry, as a freelance copywriter and photojournalist, and as a teacher. Michael earned a BA in English from Stanford University in 2005 ... View Full Bio

This is one critical thing about cloud based stuff. The cloud must be 24x7 up and running. No outage is really tolerable, which is quite different compared to old enterprise software days - we can at least allow some maintenance window. This is a challlenge for both development and operation personel.

But their explanation doesn't jive with customer experience. The "external" network issue only lasted a few minutes and yet cascaded into an all-day outage. That sounds a freeway with so much traffic that a 15 minute flat tire on the shoulder creates a parking lot that takes all day to dissipate. If you were in the shipping business, would you ever route deliveries on such a freeway?

I don't see any way to spin this positive unless we're still missing information such as a DDOS attack or some kind of rabid SPAM event.

One would also expect the world's largest software company whose goal is to be the world's largest cloud resource to have a plan C and probably even a plan D. I also don't think it's unreasonable to expect that when plan A fails, a task force convenes and starts working on plan E and plan F -- possibly skipping plan C and D because they've come up with a specific response that solves the issue.

Imagine what might happen to a retailer that relied on Microsoft for credit card payments? Why would a service provider of this claimed caliber assume e-mail is such a casual service?

The communication on this issue from MSFT was poor, which heightened the frustration from the masses I think. Although I wasn't in the office that day (whew) here's the email we received from our SPAM company, MIMECAST. "Mimecast has identified that Office 365 servers may be issuing intermittent "4.3.2" deferrals for inbound messages. Mimecast services are working correctly and emails sent to these servers will continue to queue. Office 365 customers should contact Microsoft directly to report and investigate the issue."
At least someone is looking out for us.

Here's a thought. What if Microsoft really can, and does, handle most network spikes without any noticeable delays or outages? We know when something fails, but how do we know when a Plan B does work? I'm not saying that's the case, but we wouldn't know if it was, would we?

I wonder how many of these situations have to occur before people stop relying on the "all in one" solution providers for productivity applications? While it's true that any e-mail server can fail, it seems as if companies selling all-in-one solutions seem to particularly be prone to failures.

People once thought of Blackberry e-mail as "rock solid," until they had a few outages lasting multiple hours at a time. Like Microsoft, their main selling point was reliability.

If minor network blip can create a traffic storm for which they aren't prepared, what happens if connectivity is lost to entire data center for several hours?

What happens if there's nothing wrong with MS data centers but a major fiber cut renders a large geographic swath of customers unable to connect and when repaired, the "traffic storm" takes out cloud e-mail for everyone?

Are they running their network capacity that close to 100%? I'd think they'd have Hoover dam spillway sized pipes and paying for the potential to burst even higher if traffic warrants it.

The services likely run on thousands of virtual servers. From a network perspective, it sounds like they should better-segment the traffic so they can perhaps shape the traffic and contain the storm.

To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.

Chances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.