What Does Actionable Insight for IT Monitoring Look Like?

How does the service desk in your digital enterprise stack up when compared against these five KPIs for measuring overall efficiency?

Share

If you take a look at the technology landscape for IT monitoring, you’ll surely notice a consistent language used to convey the product benefits. For example, every product touts some form of insight — operational insight, proactive insight, data-driven insight, real-time insight, etc.

While monitoring and time-series tools with sexy dashboards tend to demo quite well, what are they actually telling you? In my conversations with IT Ops and DevOps professionals, it’s clear that they still struggle to detect incidents.

The disconnect between massive amounts of insight — metrics, log files, dashboards, etc. — and improved service quality is that the insight itself is rarely actionable. I love my insights and raw data as much as the next guy, but how valuable is a spike in a chart if you don’t know how to act upon it?

Monitoring Dashboards Lack Context

Actionability is all about context. In others words, actionable insight should answer the following questions: What is this data telling me? What is the cause and/or impact? What steps do I take to solve the issue?

In the below screenshot, the tool is being used to search for spikes in multiple fields across three different hosts for a five day moving average. This visualized query clearly points out any abnormalities to indicate that an operator should investigate what’s happening.

So how does an operator leverage this chart? Let’s walk through the three questions that I previously mentioned to act upon operational data…

What is this data telling me?There are unusual peaks in traffic for certain hosts at certain times.

What is the cause/impact?Unknown.

What steps do I take to resolve the issue?Run some additional queries to pull data within related components to see if there is any impact. Once you understand the impact, identify the root cause, and figure out how to resolve it.

As another example, look at the screenshot below. This visualization nicely represents memory usage across various hosts. As opposed to the historical nature of the previous example, this chart delivers real-time insight of potential impact.

But how does an operator leverage this information?

What is this data telling me?A certain host is utilizing more memory than usual.

What is the cause/impact?Unknown.

What steps do I take to resolve the issue?Not conclusive that an issue is actually occurring.

While the information represented by these charts is undoubtedly valuable, it’s ultimately up to the the operator to take the correct steps to understand the bigger picture and resolve the underlying issue. As a human-dependent process, there is the risk of not asking the right questions, misinterpreting the data, and not taking the right resolution steps.

ITOps and DevOps Need Actionable Insight

It’s totally possible today to automate a bulk of the human-dependent process in IT incident detection and troubleshooting. For example, algorithms can be applied to reduce operational noise (repetitive and irrelevant data), correlate events across toolsets to provide rich context, and capture and recommend resolution knowledge for recurring incidents — providing operators with truly actionable insight.

Moogsoft AIOps is a solution that can leverage existing monitoring toolsets and provide the missing context. In other words, the solution can turn insight into actionable insight.

In the below example, Moogsoft AIOps has correlated over 300 alerts from different sources into a single actionable Situation. The timeline visualization in the screenshot shows how the incident unfolded over time.

By hovering over the first critical alert, we can see the event details.

In this case, the alert represents a critical ‘Disk Unavailable’ alert from an EMC Symmetrix array.

The next alert came from Oracle OEM, and reported a “File cannot be read at this time” alert. This makes sense given that the storage array just failed.

What the timeline shows next is a wave of application alerts that were thrown as a result of all the applications that were connecting to this database.

Furthermore, Moogsoft AIOps captures resolution knowledge and recommends remediation steps when a clustered Situation matches a previous Situation to a certain degree of similarity. In the below screenshot, we can see that there is a previous Situation with 99% similarity that has a recommended resolving step.

So let’s recap on this example:

What is this data telling me?Several applications are experiencing exceptions and policy violations.

What is the cause/impact?These applications were connected to the Oracle Database, which cannot read to the EMC storage array because the disk is unavailable.

What steps do I take to resolve the issue?Contact the Storage team to fix the storage array.

The information in this example is presented with rich context, allowing the users to immediately understand the issue and take the necessary steps towards resolution.

The Benefits of Actionable Insight

Enterprise IT organizations are leveraging tools like Moogsoft AIOps to gain real time, actionable insight. This allows ITOps and DevOps teams to save time in investigating and troubleshooting incidents.

Furthermore, the solution helps them to massively reduce their actionable workloads. This includes noisy alerts, false tickets, duplicate tickets, and more.

Moogsoft AIOps helps modern IT Operations and DevOps teams become smarter, faster, and more effective by providing technological supplementation that automates mundane tasks, enables scalability, and frees up human beings to do what they do best — ideate, create, and innovate. Start your free trial today by clicking here.

About the Author

Sahil Khanna is a Sr. Product Marketing Manager at Moogsoft, where he focuses on the emergence of Algorithmic IT Operations. In his free time, Sahil enjoys banging on drums and participating in high-stakes bets.