Update on recent customer issues…

I lead the engineering organization responsible for Office 365. My team builds, operates and supports our Office 365 service, and over the last few days, we have not satisfied our customers’ needs. On Thursday, November 8 and today, November 13 we experienced two separate service issues that impacted customers served from our data centers in the Americas. All of these issues have been resolved and the service is now running smoothly. These incidents were unique to the Office 365 Exchange Online mail service, not related to any other Microsoft services.

I’d like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused. We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider. We will provide a post mortem, and will also provide additional updates on how our service level agreement (SLA) was impacted. We will be proactively issuing a service credit to our impacted customers.

I also want to provide more detail about the recent issues.

The first event occurred on November 8th from 11:24AM to 7:25PM PST. This service incident resulted in prolonged mail flow delays for many of our customers in North and South America. Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers’ inboxes. One of these multiple engines identified a virus being sent to customers, but the engine started to exhibit a lot of latency even as it handled the messages. To compound the issue, our service was configured to allow too many retries and provide too long of a timeout for these messages. Given the flood of these specific emails to some of our service capacity, this improper handling caused a significant backlog of valid email message throughput in these units. We resolved the issue by deploying an interceptor fix to deal with the offending messages and send them directly to quarantine. Going forward, we are instituting multiple further levels of defense. In addition to fixing the engine handling, we now have instituted more aggressive thresholds for deferring problem messages. We have also built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.

From 9:08AM to 2:10PM PST today, November 13th, some customers in North and South America were unable to access email services. The service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service. This morning, the Office 365 team was performing planned non-impacting network maintenance by shifting some load out of the datacenters under maintenance. In combination with this standard process, we experienced a ‘gray’ failure of some active network elements; the elements failed, but did not alert us to their failure. Additionally, we have an increasing load of customers on-boarding to the service. These three issues in combination caused customer access to email services to be degraded for an extended period of time. By 10:42am PST, remediation work was underway to balance users to healthy sites, broaden the service access points and remediate the failed network devices. At 2:10PM PST all services were fully restored. Significant capacity increase has already been well underway, but we are also adding automated handling on these gray failures to speed recovery time. Across the organization, we are executing a full review of our processes to proactively identify further actions needed to avoid these situations.

As I’ve said before, all of us in the Office 365 team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business – that’s not acceptable. I want to assure you that we are investing the time and resources required to ensure we are living up to your – and our own – expectations for a quality service experience every day.

As always, if you are experiencing any service issues, we encourage customers to check the Service Health Dashboard for the latest information or contact our customer support team. Our customer support is available 24 hours a day via Service Requests submitted from the Office 365 Portal.

Join the conversation

As a re-seller, I would like a place I could check the status (if promptly reported, as Keane mentions, it isn’t always) that I don’t have to log in. I am subscribed to the "Office 365 Service Health RSS Notifications" RSS feed – but every time I want to view the article for more details, I am stuck having to log in. It would be far more convenient if I could quickly pull some information to my customers calling in wondering if it’s a problem with the service, or something more local to their network or PC.

That is a great piece of feedback, particularly related to an easier method for service incident notification. The opportunity we have with a service is to institute some of these capabilities as part of regularly-scheduled updates, and we will definitely take your feedback into our resourcing and feature planning. Thanks for taking the time to post a reply.

I truly appreciate the transparency refelected in this post. I hope it is perpetuated throughout Microsoft. O365 is a journey that we are in together and this transparency makes me as a customer feel more like a partner in this journey.

I appreciate the fact that you have provided an update with an actual explanation. However, that doesn’t excuse the terrible communication and process failure that occurred during the outage. In addition to addressing the technical issues that caused both of these outages, Microsoft also needs to address the monitoring and communication processes.

Also, your times on the November 13th outage are incorrect, as I was experiencing mail problems as early as 8:45am PST; in fact I reported it to Microsoft before 9AM–and our engineers were on hold with premier support for over 15 minutes by then as the support queue was overwhelmed. I’m curious to know how you identified 9:08 as the start of the incident.

Please see my reply above to Mr.Grivich. We definitely will evaluate what seems to be a miss regarding our service communication timeliness as part of the service incident post mortem. Thanks for taking the time to provide your perspective.

I agree with the previous commenters – I was able to gather more information that there was a problem over 2 hours before Microsoft reported that they started investigating by following #Office365 on Twitter. I think it would have saved people a lot of time and trouble if MS would use the social media more effectively to communicate issues than relying on your current reporting system.

Thanks for the feedback. As stated above, while the primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard, we are also always trying to improve various other channels such as Twitter and our own community. Twitter is an excellent mechanism to provide a notification, and we will continue to spend time improving our speed to response on these kinds of issues.

Rajesh … Its nice that someone at MS decided to start communicating about the issue but your start time is way off. Our company’s outage started well before 12:08 EST. In fact, I think I had already sent my request for SLA reimbursement by then. I know that I had started troubleshooting the issue around 11:20 EST. With your start time being so far off, I question on whether your engineers really found the root cause of just poked and hoped that they fixed it.

Not Rajesh, but i thought I would reply to your post. Thanks for taking the time to comment. Based on the feedback from this comment blog and in the communities, we definitely are reviewing policy and procedures on posting to the Service Health Dashboard, and the speed at which we do that to represent accurately the situation with the service. We’re convinced that we’ve found root cause and remediated the issue, but remain vigilant for any possible service impact.

This is a great post. The recognition of ‘gray’ failure is intriguing.

Thanks for the insight into how massive scaling of connected systems rquires us to take off the blinders of digital certainty and give serious attention to the contingent reality. We need more accounts like this.

Also, I commend the trustworthiness that is exhibited by the care reflected in your account and how the breakdown is dealt with and accounted for.

It is intriguing that the November 2012 issue of Communications of the ACM that I just read through features system resiliency and, not by name, ‘gray failures’ in file systems.

PS: I see by the earlier comments that another advantage to this level of transparency is gaining valuable feedback on how to improve notification and providing ways of easily knowing system status. That’s great to see.

Mr. Jha, I have to agree with the other comments that its nice to have the dashboard, but its nearly useless when there is a massive problem and everything is green for over an hour. Even if it were just a "huh, we just had 200% spike in calls for Exchange in the past 10 minutes; maybe we should at least put up the ‘There might be a potential problem’" just so there’s not as many of us flooding the call center, but as a fellow engineer, the first step after identifying there is a problem should be notifying of it.

Also, is the "increasing load of customers on-boarding to the service" related to the US VA announced earlier that day was it? I sure hope ya’ll can deal with growth otherwise this is going to be a limitting issue for O365.

Finally, what sort of network issue that was "grey" from 9:08 to 10:42 results in a 3+ hour *outage* for everyone as a ‘failover’. You really need to examine your failover triggers and determine theresholds against the remediation time.

I would agree that historically the Health Dashboard is way behind the actual event (almost 90 minutes on 11/13). Then it was updated twice with this message: "A few users are unable to access their email at this time." Both incidents, used the words a "few users". When you have an issue, it rarely will affect only a few users.

It’s hard to not to take issue or "wordsmith" the message, but it seems like the more appropriate words would have been "some users" or "many users". A few users makes it sound like a very minor issue. Ninety minutes in it was hard to believe it was only a few users. We are professionals and are held accountable for our decisions. One of those decisions was choosing Exchange Online. We deserve answers and updates we can work with and not that a few users might be having issues.

Communication during a service incident or outage is critically important. We (like others here) have clients contacting us immediately trying to figure out what is going on (is it their computer, ISP, Internet, MS?). It was great to find this detailed explanation here but disappointed that #Office365 on twitter only shows ‘The issue has been resolved and the service is now restored’ at 4PM PST on Tuesday. A link back to this report would have been helpful and allow us to provide a more technical explanation regarding the issue.

This is great feedback that we will consider for future service incidents that warrant proactive Twitter posts. I appreciate you taking the time to provide your perspective. One of the challenges to coherent communication is to coordinate all of the different communication channels in the midst of a significant service incident, particularly to match the level and depth expected from customers and partners. We try to use the Twitter handle as an ’emergency broadcast system’ to alert the broadest set of community members to a potential issue. Along with that, we try to maintain an active presence in our community forums and update the Service Health Dashboard regularly. As others have noted, we need improvement in the timeliness of our SHD updates, and we’ll work hard in future events to fulfill that requirement. We definitely recognize that it’s critical to keep customers and partners well-informed in the case of any service issue, so we’ll take your feedback as concrete recommendations on ways to improve.

The amount of information on Twitter was insufficient and not timely. But what is worse is that the marketing machine of Microsoft uses that moniker to advertise, so the signal-to-noise ratio is inadequate to serve as a telegraph of a problem. Three tweets over the duration of the outage is hardly overwhelming anyone with actionable information.

During an outage, we the management of the companies that have entrusted Microsoft to run this environment, need real time information that we can use to manage the outage within our respective companies. Unfortunately, that’s not the first time you heard that from me.
-J

I am amazed you are holding to falsehoods of the issue starting at 9:08 am, which matches the 12:08 pm Eastern on the web site for the start of failures. Failures really started about 10:30 am Eastern Time for my first call.

After checking the site around 10:45 and then spending another 45 minutes tracking the problem, I called the 800 number (about 11:30 am Eastern) and received a message that there was an issue and to log a request on the web site.

Of course, being obedient, I went to the web site, and it said I should check the status. The status was still not listing a problem, though it was obvious there was an issue. You wasted 90 minutes of my time because of your lack of professionalism.

Now I come here and what I see is the same misinformation continued. My complaint is about communication, not about having a problem. I am glad to see so many others offended by this. Maybe you will get a clue how important this is to those who really support the end users.

Hi Rajesh, with all due respect to your last paragraph, the Service Health Dashboard is the absolute last place you want to look when you see issues. In the case with the last 3 outages, the portal did not report any issue or problems until well over an hour, in one case 2 hours after we started experiencing the problem. MS can do a much better job of reporting problems to customers. Customers are wasting valuable time troubleshooting problems that MS is already aware of but just has not reported yet. I’m sure that MS was aware of the issue because when I called the 800 support number the technician was already aware of the problem. If they are already aware of the problem, then put it on the portal and save us the 30-60 minutes of waiting on the phone! Thanks.

My name is Morgan Cole, and I’m a Director in our Customer Experience team for Office 365. First, thanks for taking the time to write up the comment and feedback on our service communication. As Rajesh states above, we’re doing a post-mortem of the issue, which will include a detailed walkthrough of the customer impact and timing of our communications. We value your perspective as it will help us to hone our ability to respond with appropriate speed. We understand that it is critical for our customers to be as fully informed as possible during service incident, and it is a consistent goal for us to continue to improve the timeliness and specificity of our communications. While the primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard, we are also always trying to improve various other channels such as Twitter and our own community. Again, many thanks for taking the time to post a reply and help us to gain greater insight on customer experience improvements.

Great, detailed note. Stellar to see ownership and responsibility and commitment to make this improve. One area to improve – the SHD doesn’t display issues, Or perhaps it’s better to say it has too big a bias to show GREEN when it’s really Yellow/Red, as I’ve seen it "green" a couple times when the general outages you reference were going on. yesterday around 11 it showed green for example. With public acknowledgement like this, I am sure this will get fixed, and smart engineers are digging into it right now:)

What frustrates me the most is that when the dashboard finally does show that there is an issue with the service, it always says that "A few users are experiencing a problem with…"

Amazingly, my clients are ALWAYS among those "few users".

These problems with Office 365 have occurred probably eight (8) times this year. You’re going to need to spread your 99.9% SLA over a 50 year period to hope to achieve it.

We have never managed a mail system with anywhere near the downtime Office 365 provides. Our clients don’t care about the Cloud or recurring billing. They want a service that reliably works. This has not proven to be it.

Something is very wrong with the service health portal. Seems to take the office365 user forum to boil over before the service health is updated. And then, we see ‘some’ ‘few’ ‘potential issue’. That after dozen of notes showing up reporting the problem. My company switched to office365 9/22 and since then nearly a half dozen outages of some type. When can we expect an improvement ot the service health display and reliability?

Rajesh, Either your people are lying to you or you are lying to us. This outage started before 9:00 am Pacific. We started fielding user complaints at 6:30 am Pacific. I’m guessing our SLA reimbursements wil be based on this inaccurate start time. If your people would have updated the service health dashboard properly, I could have saved a lot of time troubleshooting my ADFS environment. Lastly… I am offended by the statements your staff reported that a "few users" were imapcted. At least two Exchange clusters were impacted. A few thousand users were impacted… is my best guess. I can deal with an outage. What really pisses me off is being lied to and having my vendor downplay the severity!

Hello Rajesh and Office 365 team,
This post is helpful. Here are some follow-ups that would greatly benefit us and help communicate to our end users to increase their/our confidence in Office 365
1. Are there threshold limits for # of customer calls raised, before you can recognize and acknowledge an issue and send out a blast to your office 365 customer system admins notifying us about the outage? For example, there were 3 or 4 IT resources spending 2-3 hours for each incident, troubleshooting this problem. Whereas, that effort can be saved if we knew about the issue and got a broadcast and in turn notify our internal customers. Please also send that email notification to a non office 365 account.
2. Since this was a significant outage affecting several customers, will you be hosting a LIVE meeting and answering questions submitted before or during the conference call.
3. There is not a single place either on the blogsite or the Portal to check the status of the system unless you are an adminstrator. And I heard from my admins that there was no blog/notification till much later. Would you be able to publish the metrics or atleast outages in a timely manner in multiple places vis-a-vis blog, portal, etc.
4. Do you have tools to monitor particular servers affected and to let us know the specific user population facing an issue. For the Nov 13th outage seems like only folks on specific servers were affected. And from what we can tell our user base is spread across about a 100 servers.

I would like to re-iterate the poor communication Microsoft provides with the Office365 offering from an operational standpoint. The dashboard is never updated in a timely manner or contains accurate times for the outage. This is not an isolated incident but has become common place. I’ve grown tired of the constant apologies and need to see real change. If my customer service, communications and sense of urgency is as poor as the O365 operations team I would be unemployed.

First let me say that I am truly sad to say that for me as a Partner with Microsoft for
very long time, it makes me very irritated when Microsoft makes explanations the way of
compensation for what we are calling the “BLACK WEDNESDAY INCIDENT” of Office365.
I can truly tell you just don’t get it. So as a result the clients I work for and pay my bills that they will be leaving Office365 very soon.

1) Again just to review any outage that results in loss of communication should be a complete credit of the month of service or better. Microsoft needs to feel it. Partners and clients this business are at loss more than double because of the time spent trying to find out the problem and then working around a fix for the issues because of it. Maybe the IT at Microsoft should think of it like a person on a breathing machine, if the machine stops for just a few minutes he dies. Any outage is not accepted. This is MICROSOFT you can’t tell me you don’t have the resources to have the Exchange Server do a confidence check for the health of the system and then warning us of trouble and then applying a fix.

2) You need to understand that as partners we deal with the clients every day.
If the there is a problem we fix it. That means a work around is put in place that minute instead of reasons for the outage several days later. Don’t get me wrong it’s great as an IT person to know why the problem happened. But because of the outage and many more Brown outages not spoken about many of my clients are going to another provider for the service already as they needed something as a way to communicate. In my line of work that means a loss of revenue and confidence in my ability to business. If the outage is treated any less that that then you just don’t get it. Please think of it like you were supplying air to breathe. A message to a company trading stock or message for a notary to go to a loan signing or a message for a client telling me of an issue with Office365 does not get seen because of the outage will not be accepted.

Wake up MICROSOFT this was yours and my “BLACK WEDNESDAY INCIDENT”.
Because this message comes from my clients! Remember that.

The transparency is couched in spin "we have not satisfied our customers’ needs" means that there is no stinkin way to get your email, and on the east coast, it came up after you left the office.
The only question of meeting Service Level Agreements is will there be another outage before the end of the month.

Thanks for the update, but you seriously need to spend this weekend looking at how other large organizations communicate with customers during an outage. Many of us are MSP’s and we had absolutely zero information to give our customers.

Go look at the tremendous job utilities, wireless providers and data-centers( peer1 squarespace etc.) handled outages during the recent east coast storm. They all used twitter, blogs, facebook, etc. to keep us up to date and informed.

Post -Mortem has very little customer value…I simply assume you learned something and it won’t happen again. Give me information during the outage.

As an admin to our O365 E1 environment, a few thoughts around the incidents and this posting:

1) The detailed explanation and transparency here as to what happened is appreciated.
2) The Service Health page is typically the first place I look when I hear of a potential platform issue, not so much for the actions being taken to resolve, but to correlate any possible maintenance activity around the timeframe in question.
3) From an administrative standpoint, our company would benefit more from timely, accurate data on the Service Health pages as opposed to messages broadcast on other channels (Twitter, RSS, etc).
4) Twitter as an abbreviated emergency broadcast channel could be improved by a dedicated Service Health account (to distinguish it from advertising as another poster points out), and/or hashtags to match the ticket number (#SP2598 for example) so we could correlate a message stream directly with the incident in question.

Looks like you’ve got another outage on your hands, Rajesh. The part that ticks me off is that I can’t even see your stupid Service Health Dashboard because I’m not an O365 user. However, my hundreds of in-house mail users get upset with me because they get NDR’s trying to send a simple message to one of the poor saps who signed up for your service.