Next generation AI root-cause analysis

The next generation of the Dynatrace AI engine delivers smarter, more precise answers along with increased awareness of external data and events.

To enable the new AI engine, select Problems from the navigation menu. Then click the Switch to next generation AI button on the in-product teaser.

Opt out

Once initially enabled, you can switch between the previous version and next generation causation engines (current and the enhanced engines). This enables you to try the new AI engine without risk and to provide feedback before the next generation AI engine becomes the new standard.

To switch between the current and enhanced causation engines

Select Problems from the navigation menu.

Click the Browse [...] menu in the upper-right corner.

Click either Switch to new causation engine or Revert to previous AI engine.

Smarter and more precise root causes

Switching to the new causation engine provides several major improvements:

Metric and event-based detection of abnormal component state
The new AI engine automatically checks all component metrics for suspicious behavior. This involves the near real-time analysis of thousands of topologically related metrics per component and even your own custom metrics.

Seamless integration of custom metrics within the Dynatrace AI process
You can integrate all kinds of metrics by writing custom plugins, JMX, or using the Dynatrace REST API. The new AI causation engine seamlessly analyzes your custom metrics along with all affected transactions. It’s no longer necessary to define a threshold or to trigger an event for your custom metrics as the Dynatrace AI automatically picks up metrics that show abnormal behavior.

Third-party event ingests
While the current Dynatrace AI doesn't consider external events for root-causes, the new Dynatrace AI seamlessly picks up any third-party events along the affected Smartscape topology.

Availability root-cause
In many cases, the shutdown or restart of hosts or individual processes is the root-cause of a detected problem. The newly introduced availability root-cause section summarizes all relevant changes in availability within the grouped vertical stack.

Grouped root-cause
To-date, each problem details page presented root-cause candidates as individual components, no matter if the affected component was a single process or a subset of processes within a large cluster. The improved root-cause section still shows up to three root-cause candidates, but those candidates are aggregated into groups of vertical topologies. This enables you to quickly review outliers within affected service instances or process clusters.

The following sections go into greater detail about these improvements.

Metric and event-based detection of abnormal component state

The original root-cause analysis depends on events to indicate an unhealthy state of a given component. An example here is a baseline triggered slowdown event on a web service or a simple CPU saturation event on a host. Dynatrace detects more than 100 event types on various topological components that are raised either by automatic baselining or by thresholds.
Whenever an event is triggered on a component, the AI root-cause analysis automatically collects all the transactions (PurePaths) along the horizontal stack. The analysis automatically proceeds with the analysis if the horizontal stack shows that a called service is also marked as unhealthy, as it is shown within the figure below. With each hop on the horizontal stack also the vertical technology stack is collected and analyzed for unhealthy states.
This automatic analysis proved to be highly superior to any manual analysis. One of the downsides that the enhanced root-cause analysis solves is that this approach is highly dependent on single events.
As shown within the figure below, an event is open on all unhealthy components and Dynatrace correctly detects the Linux host as root-cause:

If an event is present, the root-cause analysis correctly detects the Linux host as root-cause:

The last four years has shown that not in all situations the baseline or a threshold is able to trigger an event within an abnormal situation. Let’s modify the above example and remove one of the critical events within the affected topology. Assume that the Linux host CPU spikes but misses the critical threshold as shown below:

As there is no event on the Linux host, the host is shown as healthy and the old analysis would not consider the host as part of the root-cause. See the changed vertical stack diagram below and focus on the Linux host that no longer shows an open CPU event:

Compared to the situation above we would detect the root-cause on the backend service but we would not identify a root-cause on process or host level. In many cases the root-cause section will be simply empty as shown below:

The overall vision of the next generation of the Dynatrace AI engine was to solve the above situation of not showing a root-cause in non-event scenarios. Follow our considerations listed below that lead to the new approach within the enhanced AI root-cause analysis:

Every host comes with around 400 different metric types and timeseries depending on the number of processes and technologies running. That means that 10K hosts result in 4,000,000 metrics in total.

Every threshold you set on a metric or even the best automatic baseline observed over a period of time means ~1% false positive alerts. 1 false positive alert on a host does not sound much but it also means 10,000 on 10K hosts! With growing number of metrics per component we must expect a proportionally higher number of false positive alerts, which leads to alert spam.

It’s obvious that additional or more aggressive thresholds or even baselines on all those metrics is not a solution!

To tackle the challenge of the increasing number of metrics the new root-cause analysis automatically checks all the available metrics on all the affected components. Suspicious metric behavior is detected by analyzing the metric value distribution in the past and comparing it with the actual metric values. Therefore, the new analysis is no longer depending on events and thresholds. In case an event is present, or a user defined a custom threshold this is still included in the root-cause process.

See how the new root-cause analysis would tackle the missing root-cause scenario that was described above:

To sum up, the new root-cause analysis is based on a hybrid approach that can detect root-cause even if there is no open event on a component.

Seamless integration of custom metrics within the Dynatrace AI process

The Dynatrace platform allows the ingest of customer defined metrics and events through plugins and REST API. Plugins for third-party integrations can represent a great resource for additional root-cause information. An example here is the tight integration into your continuous integration and deployment toolchain that provides information about recent rollouts, responsible product owners and possible remediation actions.
The new analysis covers both information ingest, custom metrics as well as custom events sent from third-party integrations.
Let’s focus on the analysis of custom metrics first, as its main functionality was already described within the previous section.
A specific JMX metric with the title ‘Account creation duration’ where you measure the time needed to create a new account. Once the JMX metric is registered and monitored, it becomes a first-class citizen within our root-causation engine.
In case of a real user affecting problem the JMX metric is automatically analyzed. If it shows an abnormal distribution compared to the past it will be identified within the root-cause as shown below:

Third-party event ingests

External events are another new information source that the enhanced AI engine analyzes along the root-cause detection process.
Such events are either semantically predefined, such as deployment, configuration change or annotation or can be generic events on each severity level such as availability, error, slowdown, resource or just informational purpose events. External events can also contain key-value pairs to add additional context information about the event.
See following example for a third-party deployment event that was sent through the REST event API and was collected along the root-cause process:

Availability root-cause

Changes in availability on host or process level often represent the root-cause of large scale issues within your technology stack. Different reason leads to changes in availability, such as explicit restart of application servers after software updates, restart of hosts or virtual machines but also crashes of individual processes or servers.
While each of the Dynatrace monitored hosts and processes shows an availability chart within its component dashboard, it can be hard to quickly check the availability state of all the relevant components on the vertical stack of a service.
The newly introduced availability section within the problem root-cause section immediately collects and summarizes all relevant downtimes of the relevant infrastructure. The availability section shows all the changes in availability of all the relevant processes and hosts that are running your services on top of the vertical stack.
See an example of the newly introduced availability root-cause section within the screenshot below:

Grouped root-cause

Another improvement within the new analysis is the detection of grouped root-causes. While the old analysis did detect root-cause candidates on individual components rather than on group level this always led to an information explosion in case of highly clustered environments.
Imagine a case where you run 25 processes within a cluster to serve a microservice. If some of the processes were identified as root-cause Dynatrace root-cause section did show individual instances rather than explaining the overall problem.
The new analysis identifies root-cause candidates on group level to explain the overall situation, such as a set of outliers within a large cluster of service instances.
While the problem details screen just shows a quick summary of the top contributors a click on the drilldown ‘Analyze findings’ button opens a detailed analysis view.
The root-cause analysis view is capable of charting all affected service instances in a combined chart along with all the identified abnormal metrics.
This drilldown view is organized to show an identified root-cause as a grouped vertical stack, meaning the top layer always shows service findings followed by process group findings and finally all host and infrastructure findings.
As shown within the screenshot below, each vertical stack layer is shown as a tile containing all the metrics where abnormal behavior was detected.
If more than one service instance, process group instances or docker images are affected, the metric chart automatically groups those instances into a combined chart that shows all metric findings on the vertical stack, as shown below:

Overall benefit

By introducing the next generation of the Dynatrace AI engine, we've further improved the strengths of the existing automated root-cause detection.
Well-proven aspects such as business impact analysis as well as the PurePath based analysis of single incidences are unchanged while improvements such as metric anomaly detection, custom events and custom metrics have been seamlessly integrated. Overall, these improvements have pushed the boundaries of automatic AI based root-cause analysis into the future and have opened up Dynatrace as a platform for third-party integrations.