Machine Learning algorithms together with SCOM for correlations between performance, alerts and events data

Introduction

This summer we had the benefit to do some research with the new features built in SQL Server and the more advanced analytical part of SQL Server 2016 such as R.

The research intends to explore the possibility to find correlation between certain events, such as broken hardware or server unavailability, and deviations in performance data from related computer systems.

The main purpose and final goal is to be able to automatically find the most probable causes to certain events, especially where there may not be any obvious connections and/or the number of considerable causes is far too many for manual analysis.

The analysis has taken place in a smaller environment containing simulated errors.

Performance, alerts and events data has been collected from a period of 30 days. Performance data from each point in time has been compared with other data from the same entity to measure its deviation.

Its deviation ratio has been used to find patterns by filtering out data considered as “not deviating”.Some filtered data has appeared to have patterns similar to the patterns from some alerts and events.

Calculating deviations from 30 days (7.7 million observations) has been possible in less than 3 minutes. The method should therefore be applicable on larger environments with decent execution time.

Ways to utilize and implement the method for automatic correlation detection is still to be found. However, this analysis is considered to be the first big step towards finding correlations buried in vast amounts of data.

Problem and solution description

Performance data has been collected from 1115 different entities. Each entity has on average 7.8 different counters.

Performance data should therefore be considered as data from a total of about 8700 variables that should be treated separately.

Each of these variables has also summarized to MaximumValue, MinimumValue and AverageValue for every hour before being used in this analysis.

Detect deviations

A modification of 1.5·IQR (interquartile range) has been used to identify outliers. The boxplot below illustrates 1.5·IQR.

The box in the middle represents 50 % of all values from an arbitrary counter in one entity, where 25 % of all values is below the box, and 25 % above the box.

Thus, the top and bottom edge of the box represents the data’s third and first quartile. IQR (interquartile range) is the distance between these quartiles.

The dashed line represents the distance 1.5·IQR from the box. Values outside of the range 1.5·IQR is considered potential outliers, and has been plotted separately as small circles. In our case, these are the interesting values.

For the sake of analysis, a ratio has been calculated for each observation in the performance data.

The ratio has been called Outlying Distance Ratio (ODR) and has been calculated according to the formula: ODR = (Distance to closest edge of the box)/IQR. ODR depending on its original value is illustrated below.

Some datasets may contain fewer (or non) values where ODR != 0. ODR > |x| can be used as a threshold when filtering out normal values.

Correlations between deviations and events

Where recurrent events occur at the same time every day, deviations in performance data can be found with similar patterns. An example of such correlation is illustrated below.

As shown in the illustration above with extreme outliers we notice that some logical disk counters increases abnormaly. In this case the SQL Server on the machine is running batch jobs. But the memory on the server is not enough to efficiently deal with the memory required for the jobs, therefor forcing the server to use the pagefile on the disk and causing bottlenecks on the server.

This is just one example how an automated analysis can loop through massive amounts of data to correlate different patterns within your datacenter.

Execution Time

Daily calculation of deviating values requires short calculation time, even for big data sets.

While the execution time mostly depends on the number of observations, the number of measure points compared to the number of counter could also affect the execution time.

The plot below represents the execution time depending on the number of observations, where the number of measure points and entities varies.

These test results has been has been used to estimate the execution time for larger datasets.

The table below shows the estimated execution time for an environment 2, 10 and 100 times larger than the environment used for this analysis (using observations from 7 days).

Estimation shows that the method should be applicable on systems at containing at least 100 times more counter with an execution time below 1 hour, using performance data from one week.

Discussion and future analysis

The earlier illustration of a correlation between events and performance data is a simple example of a case where it is possible to find similar patterns.

This method does not measure how strong the correlation is. However, the new data containing a variable of deviation rather than absolute values would be applicable to typical machine learning algorithms where it is possible to measure the importance of a variable, in correlation to a specific event.

By using the most important variables to form prediction models, it would be possible to predict events based on performance data.

That would also make it possible to find the actual cause/causes to an event, since the cause/causes should be among these variables.

To create models based on all performance to find correlation to each event could require extremely long calculation time.

Since the models only are supposed to be compared to each other, they do not necessarily need to perform very well on prediction.

Poorer models could possibly reduce the calculation time. The calculation may need to be calculated on daily basis, and therefore this method, as well as other possible alternatives, need further analysis.

Finding, measuring and selecting deviating values has proven to be possible in short time even for big data sets.

For a fully automated system this new data will be useful, regardless of the method to find correlations between performance data and events.