Modern Service Management Blog Series Part 2: Monitoring

This is the second blog post from our blog series on Modern Service Management for Office 365. These insights and best practices are brought to you by Carroll Moon, Senior Architect for Modern Service Managment.

In the initial blog post in this series, we framed the Office 365 Service Management discussion into five categories:

Monitoring and Major Incident Management…knowing if your users are impacted (regardless of root cause) and ensuring that the right things happen without heroics when users are impacted

Evergreen Management…being ready to successfully absorb the changes and to achieve business value from the evergreen service

Service Desk and Normal Incident Management…being ready to support Office 365 end-users leveraging the automation investments from the Office 365 service and being able to measure the call and escalation rates driven by your users on-premise and in the cloud

Administration and Feature Management…managing the workloads and configurations thereof through the Admin Portal as well as programmatic management

Business Consumption and Productivity…a higher order focus on the business to drive transformation using Office 365 capabilities do drive more business, more productivity, and lower costs

This blog post will focus on Monitoring and Major Incident Management for Office 365. For more thoughts on overarching cloud monitoring, read the eleven posts in this blog series that Microsoft wrote for ITIL.

Monitoring in the realm of Major Incident Management

Monitoring is a broad topic. For now, we will focus on “Availability and Performance Monitoring” for Office 365. Receiving monitoring alerts without a downstream action and workflow will not accomplish much, so we will focus on Availability and Performance Monitoring within the Major Incident Management workflows that it supports. We will use the following diagram to help in the discussion:

In the diagram above, we are representing users from the customer premise connecting to Office 365 via “A” through Express Route and via “B”+”C” internet route. Also, many customers have users that connect directly from the internet in addition to connecting from customer premises.

Major Incident Management Scenarios and Portal Specificity

From a Major Incident scenario perspective, if we focus on “cloud only” rather than “hybrid” for simplicity, there are only three Major Incident scenarios:

I. (Customer has help desk calls OR end-to-end alerts) AND (Microsoft posts something for the customer’s tenant)

II. (Customer has help desk calls OR end-to-end alerts) AND (Microsoft has NOT posted something for the customer’s tenant)

III. (Customer does NOT have help desk calls OR end-to-end alerts) AND (Microsoft posts something for the customer’s tenant)

Now is a good time to speak to tenant specificity in the Office 365 Service Health Dashboard and Message Center. Most people do not know that the communications dashboards are tenant-specific. We do not have humans writing millions of paragraphs to publish uniquely to each tenant. Rather, we write one paragraph and publish it to all relevant, possibly impacted tenants. That is why we have an authenticated dashboard experience. If we have the admin log in, we know who the admin is. If we know who the admin is, we know the tenant. And if we know the tenant, we know the capacity that the tenant’s users depend upon. Thus, we can direct communications to the appropriate tenants as necessary. Our systems allow us to post to a single tenant, to every tenant on the planet, or more likely, to a subset of tenants. For example, we may get an alert that tells us “based on statistics, we know there is Outlook-connectivity impact for some North America users.” In that scenario, we might automatically post that we are investigating Outlook-connectivity issues to all tenants with users in North America so the customers can get in front of any Help Desk volume and so the IT Pros can notify their management quickly. Moments later, as more internal telemetry fires, we might know that the impact is limited to a particular unit of capacity. At that point, we would update the post to reflect impact only to the tenants who have one or more users on that particular capacity. Those tenants would continue to see the Incident, but the other tenants in North America would then see the issue as a “false positive”.

Major Incident scenario “I” is a fairly cut and dry scenario. In that case, the customer knows they have impact end-to-end andMicrosoft has published a corresponding incident in the dashboard. The customer workflow would likely be to give the help desk a talk-track, to stand up automated voice response to deflect the help desk calls, to notify senior management, etc.

Major Incident scenario “II” is where the customer is getting help desk calls or end-to-end alerts, but Microsoft has not posted anything for the customer tenant [yet]. In this scenario, it could be a Microsoft issue that has not posted yet (in this case, soon, we will let you “tell us about issues” quickly from the admin portal. It could be a customer-side issue. Or it could be an issue in between (e.g. an Internet Service Provider issue). In this scenario, the customer would likely stand up an Incident bridge on their side to begin troubleshooting the scope and root cause of the issue. The customer would likely give their help desk a heads up, and they would likely engage senior management. The customer would pull in Microsoft support when their triage process determines that it is appropriate.

Major Incident scenario “III” is also fairly simple. In that case, there are no end-to-end alerts or user calls to the help desk, but Microsoft has posted something for the customer tenant. In that case, it could be

A false positive (per the scope example above)

A real issue for a feature that the customer does not care about at the moment. For example, we may post a Service Incident for “the ability to assign licenses” and the customer is not assigning licenses right now, so it is not an issue. But another customer might be in the middle of massive mailbox migrations, so license assignment is very important to them at that moment.

A real issue for real users but not enough to trigger end-to-end alerts or help desk calls. Perhaps we post that “1% of emails are delayed up to 2 minutes”. In that example, the impact is probably not enough to make your end users call the help desk nor the is it severe enough to make your end-to-end monitoring fire, but the impact is real nonetheless. Or perhaps only one of the customer’s users is on a particular unit of capacity that is actually impacted. If only that user is on the capacity, the test account used for end-to-end alerts would not be impacted. And if that user is on vacation, she will not call the help desk to report the impact. Recent improvements in providing user counts for impact in the Service Health Dashboard are intended to help with this scenario; note screenshot below:

In Major Incident scenario “III”, the customer workflow is likely to give the help desk a talk-track, to ask the help desk to be on high alert and to page the appropriate team if they start receiving calls about the issue, and to email senior management with a heads up as a safety precaution.

Monitoring Scenarios

In support of the Major Incident scenarios, there are six core monitoring scenarios that we need to discuss (we will add more scenarios over time):

A) Does Microsoft think my tenant is impacted (Microsoft-side)?

B) Does Microsoft think that I need to take action to get healthy or to stay healthy with my tenant (Customer-side)?

C) Does Microsoft think that I need to be aware of an upcoming release for my tenant? NOTE: we will discuss this bullet more in the forthcoming Evergreen Management blog post

D) Does Microsoft think that I need to be aware of general Service Management information for my tenant?

E) Is AAD Connect and/or ADFS working well on both ends of the service?

F) Are the Capabilities that my users depend on working well end-to-end?

Scenario A’s information is available via the Service Health UI in the Admin Portal. It is also available via the Office 365 Service Communications API under the “Service Incident” class. There is an Office 365 Mobile Admin app that allows for Push Notifications. And finally, there is a SCOM Management Pack for Office 365 that pulls the relevant information from the Service Communications API. Finally, per recent announcements, soon we will let you sign up to “stay informed via your preferred channel” for Service Health information via text or email.

Scenario B thru Scenario D are all available using the “Prevent or Fix Issues”, “Plan for Change”, and “Stay Informed” categories respectively. As with Service Incidents, Message Center information is available programmatically thru the Office 365 Service Communications API using the “Message” class with filters for each category.