Microsoft Offers Explanations for Lync and Exchange Service Outages

Microsoft provided a somewhat more detailed public description of its Office 365 service outages that occurred earlier this week.

Rajesh Jha, corporate vice president for Office 365 engineering, described the two separate incidents that affected Lync Online users on Monday (June 23), as well as Exchange Online users on Tuesday (June 24). The service outages just affected Microsoft's North American datacenters and the problems causing the outages have since been fixed, he explained in a Microsoft forum post.

With regard to the Lync Online problem, some users in North America were affected and couldn't log into the service. Microsoft fixed that specific log-in problem "in minutes," Jha said, but that "the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration." That extended duration appears to have been a good part of the working day on June 23, according to a chronicle kept by veteran Microsoft reporter Mary Jo Foley.

The Exchange Online outage also seems to have been a small problem that just escalated after being detected. Jha explained that a directory partition stopped responding to authentication requests. That problem caused "a small set of customers to lose email access." However, the problem somehow affected Microsoft's broader e-mail traffic flow. Many Exchange Online users reported not being able to send or receive e-mail. Jha said that the initial Exchange Online failure led to an "unexpected issue":

Unfortunately, the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers. Our recovery strategy was two pronged: 1) We partitioned the mail delivery system away from the failed directory partition and 2) directly addressed the root cause for the failed directory partition. In addition to fixing the root cause trigger, we are working on further layers of hardening for this pattern.

The Exchange Online problem persisted through most of the day on June 24. Jha also noted that the Service Health Dashboard, which provides Office 365 service uptime reports to subscribers, had a problem with its "publishing process, meaning not all impacted customers were notified in a timely way." He said that the problem with the Service Health Dashboard has "since been addressed."

Microsoft plans to provide more details about the outages to its customers via a "post-incident report," which will appear in the Service Health Dashboard, Jha said. Microsoft doesn't have a publicly accessible portal showing its Office 365 service health, and so much of the news about the outages on Monday and Tuesday were initially relayed through Twitter posts.

Microsoft offers a "three nines" or 99.9 percent uptime service level agreement as part of its Office 365 business plans. If Microsoft fails to meet a 99.9 percent uptime each month, then the subscriber may be eligible to get a service credit. However, the subscriber has to file with Microsoft to get the credit. The service credit is calculated as a percentage of the monthly service fees that gets returned to the customer, depending on the degradation of service uptime. Microsoft shows those uptime percentages and corresponding service credits in the following table:

It's estimated that a 99.9 percent uptime translates to experiencing about 43 minutes of downtime per month, or about eight hours of downtime per year. Microsoft's outages on Monday and Tuesday lasted perhaps six hours and nine hours, respectively, according to press reports.