iGLASS NOC Blog

The Impact of False Positive Monitoring Alerts and How to Minimize Them

The issue of false positives rears its unwelcome head in many fields—medicine, quality assurance, statistical hypothesis testing, and many other kinds of measurement. A false positive occurs when a process, test or procedure indicates that a condition exists when, in fact, it does not.

When monitoring an IT network, a false positive alert might advise that a certain server, network device or application has failed, or an important threshold has been crossed. False monitoring alerts, depending on severity, can disrupt an entire IT department at worst, or simply waste time that could be better spent elsewhere at best.

Let's take a look at how false positive monitoring alerts can disrupt an otherwise smooth operation, and how you can reduce and eliminate those time-wasting fake alerts.

Impacts of False Positive Monitoring Alerts

Take a moment to see if this scenario sounds familiar:

An alert pops up on your monitoring screen. Someone on your team receives the alert and opens a ticket. The technician in charge of resolving the incident (who may be a different person) is notified of a new ticket, then checks the monitoring screen to begin searching for the root cause of the problem.

If the false positive has returned to an okay state, he can close the ticket. However, if the error message is still raised, the tech will analyze what’s happening and eventually conclude it’s a false positive. Total time: 20 to 30 minutes with one or two people affected.

For networks that throw 15 or 20 false positives in a day's time, remediation efforts could easily cost you the equivalent of a full-time employee. Not to mention, your MTTR increases, as well.

Furthermore, when you're dealing with a large number of false positives, your team may begin to feel annoyed by the repeated urgent, but unnecessary, demands on their time. Their confidence in the network monitoring tools you use may falter as well, and they may begin to ignore various alerts altogether. When people get into the habit of ignoring false positives rather than addressing them, your network gradually becomes vulnerable to real downtime, performance problems and other damage from true alerts that were mistakenly neglected.

How to Minimize False Positive Monitoring Alerts

Outsourcing your network monitoring to a professional NOC not only frees up your IT staff to work on more mission critical and value-added projects, but will also put your network safely into the hands of pros who have spent years monitoring networks that range from moderate SMB networks to massively complex enterprise networks that span the globe.

And, to the issue of false positive alerts, an experienced NOC will make regular adjustments to the NMS that eliminate time-wasting fake alerts. While every network is different, the following are some of the steps a NOC will take to eliminate false positives and allow you to focus on more important matters.

Establish baseline behaviors for key network components.

For example, if you ping your servers every so often to confirm they are connected and running, a random event could cause a ping to time out and raise an alert. For example, employees arriving at work each morning and logging en masse into a large database application can create so much traffic on a particular subnet segment that pings to servers occasionally time out and create false alerts. Establishing a baseline for various network components helps prioritize where attention needs to focus.

Review polling methods and thresholds to prevent false alerts.

In addition to addressing network traffic problems in the example above, the monitoring rule that determines whether a fault exists needs to be reviewed. Is the monitoring system using scripts that call ping, hping or fping? Different ping tools give different results. Does the monitoring tool trigger a fault when just one ICMP packet isn't returned, or does it look for multiple time-outs? Studying the scripts that evaluate network device health can often uncover the source of false positives.

Write rules that allow customizable thresholds.

Once you've studied the scripts responsible for generating false positives, you can customize them with thresholds that will eliminate false positives. For instance, if you are monitoring CPU activity on a critical server, a script that triggers when CPU utilization peaks at 100 percent will produce nearly countless false alerts. Look instead at the average CPU utilization over a period of time rather than alerting on a single, momentary peak at 100 percent. An outsourced NOC partner can implement thresholds that are based on time-of-day, duration, consecutive poll cycles and more.

Develop a custom runbook.

A runbook is essentially an agreement between your company and your NOC partner. It documents standard operating procedures the NOC uses to troubleshoot and resolve most common outages.

The runbook will also define escalation procedures, any tier-1 remediation efforts, and precisely who should be notified of different alerts. There's no reason to text or email multiple people when perhaps one technician and his/her supervisor are the only people that need to be involved. A NOC can help you develop these and similar procedural guidelines that lead to quick response and efficient operation.

Review alerts and logged incidents.

A NOC partner can review your monitoring alerts and logged incidents to best evaluate the current alerting strategy, and to recommend how it can be improved. They can learn from post mortem analysis and improve the monitoring as a direct result.

Look for correlations.

A NOC can also analyze alerts to determine if there's a relationship between false positives and the resolution of a real problem. Going back to the case above where employees flood the network with traffic after logging in each day, the false positive monitoring alert that mistakenly points to a server being down has led to corrective actions to relieve traffic congestion on the particular subnet. In such a case, a false positive actually helped identify a real problem that is now corrected.

iGLASS helps IT professionals improve the uptime and health of their IT infrastructure, increase productivity, improve the bottom line and enhance the quality of life for themselves and their employees.
If you struggle with life/work balance; if you're tired of nuisance alerts from multiple platforms keeping you and your staff up at night; if you need to increase uptime without increasing headcount; if you realize you're not in the Monitoring business and want to focus on Running your business, connect with me today. Let's start a conversation.