Intercepting failures with time pattern recognition

You have both hourly and daily data for service component availability, performance, events and alerts gathered in SCOM. Since it is just heaps of data, it can be difficult to get an understanding whether any of the logged events occur according to some kind of time pattern and when any of them relate to one another.

Recognizing reocurrences is very beneficial when managing sparse and at first sight unrelated events. After grouping and filtering all events by an identified time pattern we are able to find correlations and (hopefully) causations of specific happenings.

In this blog post we will review the built-in SCOM reports for analyzing this kind of data and will also show you how adding some extra capabilities makes life much easier.

Note:

To keep the blog post to more manageable size we will focus on Alert monitoring from here on. But the same concept can be applied to any type of data that is logged by Operations Manager.

Alert reporting in Operations Manager

There is one main go-to resource within SCOM reporting when it comes to alert reporting. The “Most Common Alerts” report located in the Microsoft Generic Report Library.

By choosing the time range you get an overview of the top alerts created from the included management packs in that time period. Here we can start investigating which alerts are most common in our environment.

The first question everyone asks themselves when seeing this report is: what entity caused all of these {2} alerts? So this report is just stating one fact but does not really help us move much further in our investigation.

Veeam Alert Statistics Report
This report extends Microsoft System Center reporting by adding two modes of analysis―per rule/monitor (Alert Statistics mode) and per object (Troublemakers mode)―as well as two levels of detail for every mode, a comparison of time intervals, management pack selection and interactive table

Veeam Alert History Report
This report provides a detailed alerting history, helps to identify the most affected infrastructure objects. It helps you review health state for the specified time period and details how many alerts with different severity and priority levels were raised on each day of the reporting period.

To gain better insight in the alerts generated we are going to use the report called Veeam Alert Statistics Report that can be displayed in two different ways as shown below:

Alert statistics

Troublemakers

Even though these reports do extend your possibilities to figure out where the alerts have originated from, you can still not use these to understand when the next alert trigger can be expected nor whether these alerts occur in some kind of predictable time pattern.

When do things happen?

By filtering all alerts generated by a Service – in this case a distributed application we have a good overview of what´s happening around that particular service as shown below (the screenshot is from Excel):

Besides knowing the fact that we have a certain number of alerts generated, which management pack or managed entity generating the alert we could also make use of some additional knowledge about the subject such as when is the most common time of day for alerts to be generated. Or which weekday stands out from the rest when it comes to problems in our environment?

Here is an example of two extra charts we like to use in our reports. First one is showing alert counts by hour of day. In this example it is quite clear that most trouble happens at 5am. And so we have narrowed down our troubleshooting range by 24 times!

We can also do the same thing with day of week to see if there is any pattern on a daily basis as shown in the example below.

These two charts tell us that something that triggers an alarm happens every weekday at 5 am. Quite a better ground for investigation than a list of events, don’t you think?

What else happens at the same time?

In the start of this post I said that we will be looking only at alerts. And so we did find that we have alerts generated at 5am every weekday. But what else happens in our landscape at the same time as these alerts occur?

This is the really awesome part of having all data in one separate data mart. We can now easily tell our report (the screenshot below is also from Excel) to show us only data generated during these specific time slots (i.e. 5am):

After setting the filter you can now review events, performance and availability data related to the issue of your interest. This way your never-ending list of alerts and events and what ever else not is trimmed down to manageable size and can be dealt with in minutes.

Before you’ve dealt with the root cause of this problem you can already predict that the next time you have trouble is tomorrow 5am.

Conclusion

By adding new time attributes to your System Center Operations Manager data you get the possibillity to go through the process of trouble-shooting alerts quicker and in a much more focused way than ever before.

Putting all this together results in a good overview of alerts from Operations Manager that you could consume like in this example with standard tools like Excel.