Moogsoft Introduces Probable Root Cause

PRC is the first technique that can understand causality in unpredictable IT environments without reliance on a model.

Share

One of the most prominent terms in the vocabulary of anyone who works in IT is ‘Root-Cause.’ Highly skilled teams across IT organizations dedicate their careers to investigating the root-cause of service impacting incidents, and they use tools that are supposed to help them identify those root-causes, typically through the use of historical models.

However, the only definitive way for root-cause analysis to be 100% accurate is to model every potential outcome of your IT environment. In today’s virtualized and highly redundant IT environments, this is clearly impossible. The outcomes and features of an enterprise-level IT environment are unpredictable at any given moment in time.

“At Moogsoft, we embrace unpredictability.”

Richard Whitehead, Chief Evangelist, Moogsoft

Incident.MOOG applies machine-learning to massive volumes of IT telemetry in real-time to identify truly anomalous features that get clustered into groups of related alerts — we call them ‘Situations.’ This takes immense heavy-lifting away from humans.

But once you have a Situation, how does the operator quickly identify what caused it?

This can often be successfully accomplished by looking at the Situation Timeline or at the Knowledge Tab, where Incident.MOOG presents similar Situations from the past along with the remediating steps that were taken. However, to increase the degree of certainty, Moogsoft has taken a huge leap forward.

In the v5.1.7 release of Incident.MOOG, Moogsoft announced the introduction of Probable Root Cause (PRC) to some customers as an Alpha test.

What is Probable Root Cause?

Probable Root Cause (PRC) is a supervised machine-learning process that interprets patterns in user-supplied feedback to identify which alerts in a Situation are ‘root-causes.’

Once the system’s neural net is adequately trained, PRC provides insight into where to begin troubleshooting and diagnosis, reducing the burden on operators and dramatically speeding up incident resolution.

How Does It Work?

When an operator identifies the root-cause(s) of a Situation, they’ll be able to label Alerts within Situations as Causal and Non-Causal with a single click.

User-Defined Root Cause

PRC will enable Moogsoft to learns each time this is done. When new Situations are generated, Incident.MOOG will assign an Alert or Alerts as having a ‘Root Cause Estimate.’ The Root Cause Estimate can range from 0-100%, and will represent a very accurate estimate of causality, which will only get better as the sample size increases. Each ‘bar’ for the Alerts represents a 10% probability that the Alert is the Root Cause for the Situation being viewed.

Root Cause Estimate

Each Situation will indicate a ‘Max Root Cause,’ which will indicate the probability that the Situation contains a causal Alert. A value of 3%, for example, means that no Alert has more than 3% probability of being the Root Cause. A value of 98% means that at least one Alert has a 98% probability of being the Root Cause.

Max Root Cause for Situations

How Does Incident.MOOG Learn from Probable Root Cause?

Moogsoft’s PRC feature will apply machine-learning techniques that leverage features like Severity, Host, Description, and Class, and will use a Neural Network to estimate the root cause probability for all alerts within a newly created Situation. PRC will work even if the Situation has never been seen before.

With the coming integration of PRC into AIOps, Moogsoft will allow ITOps and DevOps teams to leverage machine-learning technology to learn from their everyday actions, and streamline future troubleshooting and diagnosis. Instead of applying rules and models to unpredictable environments, Moogsoft will allow you to loosen your constraints and embrace unpredictability by leveraging data-driven models.

Moogsoft is a pioneer and leading provider of AIOps solutions that help IT teams work faster and smarter. With patented AI analyzing billions of events daily across the world’s most complex IT environments, the Moogsoft AIOps platform helps the world’s top enterprises avoid outages, automate service assurance, and accelerate digital transformation initiatives.

About the Author

Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.