cloud, technology and trends

Alerts Based on Rolling Averages in Log Analytics

This post will go over how to create an alert for Log Analytics that evaluates two recent time periods for comparison. It’s a little, let’s say, “in depth” as far as Log Analytics queries go. The alert is intended to trigger when a variable threshold is met based on the recent baseline as opposed to a static metric. Used with my PingTimeLog tool found here, alerts can be triggered if recent response time goes over a rolling average value. I also include a disk free space alert to identify when a large amount of data is added to a disk.

Background

A quick background to start. The PingTimeLog tool logs response time in milliseconds. Servers that are geographically close have a lower response time, say 30ms where servers located further away can be up to 80ms. Setting a static alert for the lower response time, say 40ms, would trigger the responses from the server further away at 80ms. Setting an alert threshold for 85ms would give the server that is close 55ms before it alerts, but only 5 ms for the server further away.

A similar issue occurs with other counter data. Take hard drive free space for example. An alert could be set up to warn for less than 10% free space, but 10% of 100GB is far different than 10% of 2TB. Also, a lot of data could be added to the larger drive before an alert is triggered. Wouldn’t it be better to know when drive space increases significantly over a short amount of time?

By Percent

The first version of this alert query is based on percent value. I start with some Let statements that define the percent threshold for the alert. In this case it is triggered by a 10% overage (multiply by 1.1). I also set the from time that will begin evaluation until the end time of the first baseline evaluation. In this case, I am comparing the average value from 30 minutes to 5 minutes ago to the average value of 5 minutes ago to the current time. If the 5 minutes to current value is 10% over the 25-minute average baseline, an alert will trigger.

let warnPercent = 1.1;
let fromTime = ago(30m);
let toTime = ago(5m);

Next, two tables are created next with the Let statements. The first creates the 25-minute average baseline of response time from 30 minutes ago to 5 minutes ago for each address pinged. The second Let statement creates an average of the past 5 minutes for each address. Keep in mind this is based off my implementation of PingTimeLog. If you are using that tool and changed data fields, the fields in this query will need to be updated as well.

Once the Let statements are set, a new table is created using the join statement on the address field. A column is created with the Extend command to create the warnPercentThreshold. This column multiplies the 25-minute baseline value with the warn percent value.

Next, a comparison is done that only returns endpoints where the 5 minute value is over the warnPercentThreshold (25 minute value plus 10 percent). Only selected fields are returned with the Project statement. I also changed the column heading on the projected fields for readability.

This is good, but it still has a problem. Ten percent of 80 is more than 10 percent of 30. The deviation from the baseline is different depending on the average response time. Not that significant, but still not perfect.

By Fixed Value

While walking my dog (pictured) and contemplating the issues with using a percent value, a simple solution came to me. Instead of multiplying by a percent, simply add (or subtract) a fixed value to create the threshold. For example, by adding 10ms to the 25-minute average, alerts can be generated by a consistent deviation. If the endpoint with a 25-minute average of 30ms jumps to 41ms, an alert will trigger. If the endpoint with the 25-minute average of 80ms jumps to 91ms, an alert will be triggered as well.

To illustrate this, a hard drive free space monitor is outlined below that will alert if 10GB of data is added to a drive. This example will not use averages like above, it will find the minimum amount of free space reported over a given time period.

To start, two tables are created the Let statements like above. This time, gathering the LogicalDisk Free Megabytes counter from the Perf table. Some additional filtering is also taking place to get results of only Logical Disks with drive letters. A regex expression was used for that. I also converted the megabytes value to gigabytes.

The rest is much like above. Once the two tables are created they are joined by Computer and InstanceName, accounting for multiple drives in each computer. The table is extended with the warnThreshold and a filter applied so only server and instances with 10 GB of space added over the past 5 minutes show. This, as well as the previous example, would need to run every 5 minutes to be effective. Full query is: