Use of Un-supervised learning in AIOps

December 3rd, 2019

Machine Learning and AIOps

Machine Learning allows us to describe precise rules and algorithms for tasks where defining such rules is challenging. Monitoring and event management tools today are based on the simple and static IT infrastructure that was required in the 1990s. These require manually-built rules to correlate events that have grown exponentially both in volume and velocity over the last decade.

The challenge in the current IT setup is to reduce alert noise and provide context enrichment. However, defining rules requires high precision and specificity as the various permutations and combinations that could occur in your environment need to be mapped out.

This necessitates building thousands or even hundreds of thousands of rules that need constant support and updates. Now, however, every time a change occurs, new rules need to be formulated. This consistent maintenance requires a minimum of two to three dedicated resources or even more.

However, the biggest drawback is that your system is capable of only detecting what it has previously encountered. Machine Learning in concert with algorithms optimizes and streamlines usage within ITOps.

How does Un-Supervised Learning (Clustering Algorithms) help AIOps?

With the IT Operations tool stack becoming exponentially more complex, excessive alerts create confusion and cost valuable time that could be better spent remediating the actual root cause.

Clustering algorithms are an intuitive, intelligent way of extracting commonalities using features and group them as similar data points from the raw data. When you have large data and several features to consider, it is near impossibles for humans and rule based solution to accomplish this. This is where ML based clustering algorithms can help streamline the identification of patterns in a fully customizable manner to quarantine critical issues.

How does it work?

The CloudFabrix system ingests the data in the raw format into its AI engine which is then further normalized and enriched through configuration information, operational categories, and other customized variables. The AI then automatically finds associations among the alerts logically categorizing them into high-level actionable groups called Incidents by using a technique called un-supervised learning

The algorithms work during the clustering stage when patterns can be compared and evaluated against specific properties such as attributes, feedback, or time constraints such as:

Alerts with a similar description and from the same application or service.

Alerts from the same host or location.

Topology-based correlation using Vertex Entropy.

Cascading failures

Performance failures

Brownouts.

The AI engine evaluates the incoming alerts in real-time and makes an intelligent decision if they can be grouped under an existing incident or not create a new one. To enhance the efficacy of alert correlation, correlation pattern definitions can be tailored based on the architecture and processes that drive the client’s infrastructure. For instance, it can help in navigating the following situations:

Network-related connectivity issues within the same data center

Application-specific checks on the same host

Load-related alerts from multiple servers in the same database cluster

Low memory alerts on a distributed cache

The final word

Implementing Machine Learning in IT operations is imperative with the volume and scale of the data being handled. The primary strategy should be centered around locating and prioritizing the root cause of the mission-critical problem. The next step once the root cause has been determined is to progress to providing contextual recommendations to remediate the incident. The algorithms defined should be capable of learning on the fly, sans historical training.

The most streamlined systems are aware of how to route data from disparate sources to the appropriate algorithms. A single definition is worth hundreds of rules in such use cases. These algorithms are agile and can adapt to application and infrastructure changes, meaning that it’s easy to deploy and maintain.

CloudFabrix will help you run your IT smoothly, seamlessly integrating with all your systems. We will leverage real-time and historical data to provide actionable insights and streamline your IT team. This will allow management of priority tasks rather than wasting time in handling everyday tasks saving labor, money, and time. We guarantee that the long-term impact of AIOps on your IT operations will be transformative.