Basic Alerts

Combine metrics

Rule

In this alert we use the slim (session limit) metric and compare it to the current amount of sessions on each frontend. Because of this ability to combine metrics, we can easily create a percentage of utilization even though that metric doesn’t exist in haproxy’s csv stats.

alert haproxy_session_limit {
macro = host_based
template = generic
$notes = This alert monitors the percentage of sessions against the session limit in haproxy (maxconn) and alerts when we are getting close to that limit and will need to raise that limit. This alert was created due to a socket outage we experienced for that reason
$current_sessions = max(q("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", ""))
$session_limit = max(q("sum:haproxy.frontend.slim{host=*,pxname=*,tier=*}", "5m", ""))
$q = ($current_sessions / $session_limit) * 100
warn = $q > 80
crit = $q > 95
}

Consistently in a certain state

Some metrics represent bools, (0 for false, 1 for true). If we take a time series and run min on that, we know that has been in a false state for the entire duration. So the following lets us know if puppet has been left disabled for more than 24 hours:

Macro that establishes contacts based on host name

This is an example of one of our basic alerts at Stack Exchange. We have an IT and SRE team, so for host based alerts we make it so that the appropriate team is alerted for those hosts using our macro and lookup functionality. Macros reduce reuse for alert definitions. The lookup table is like a case statement that lets you change values based on the instance of the alert. The generic template is meant for when warn and crit use basically the same expression with different thresholds. Templates can include other templates, so we make reusable components that we may want to include in other alerts.

Graphite: Verify available cluster cpu capacity.

This rule checks the current cpu capacity (avg per core, for the last 5 minutes), for each host in a cluster.

We warn if more than 80% of systems have less than 20% capacity available, or if there’s less than 200% in total in the cluster, irrespective of how many hosts there are and where the spare capacity is. Critical is for 90% and 50% respectively.

By checking a value that scales along with the size of the cluster, and one that puts an absolute lower limit, we can cover more ground. But this is of course specific to the context of the environment and usage patterns. Note that via groupByNode() the timeseries as returned by Graphite only have the hostname in them.

To provide a bit more context to the operator, we also plot the last 10 hours on a graph in the alert

Forecasting Alerts

Forecast Disk space

This alert mixes thresholds and forecasting to trigger alerts based on disk space. This can be very useful because it can warn about a situation that will result in the loss of diskspace before it is too late to go and fix the issue. This is combined with a threshold based alert because a good general rule is to try to eliminate duplicate notifications / alerts on the same object. So these are applied and tuned by the operator and are not auto-magic.

Once we have string support for lookup tables, the duration that the forecast acts on can be tuned per host when relevant (some disks will have longer or shorter periodicity).

The forecastlr function returns the number of seconds until the specified value will be reached according to a linear regression. It is a pretty naive way of forecasting, but has been effective. Also, there is no reason we can’t extend bosun to include more advanced forecasting functions.

Anomalous Alerts

The idea of an anomalous alert it in Bosun is that a deviation from the norm can be detected without having to set static thresholds for everything. These can be very useful when the amount of data makes it unfeasible to manually set thresholds for these. Attempts to fully automate this from what I have seen and been told are noisy. So Bosun doesn’t just have an “anomalous” function, but rather you can query history and do various comparisons with that data.

Anomalous response per route

At Stack Exchange we send which web route was hit to haproxy so that gets logged (It is removed before sent to the client). With over a thousand routes, static thresholds are not feasible. So this looks at history using the band function, and compares it to current performance. An alert is then triggered if the route makes up more than 1% of our total hits, and has gotten slower or faster by more than 10 milliseconds.

Graphite: Anomalous traffic volume per country

At Vimeo, we use statsdaemon to track web requests/s. Here we’ll use metrics which describe the web traffic on a per server basis (we get them summed together into one series from Graphite via the sum function), and we also use a set of metrics that are already aggregated across all servers, but broken down per country.
In the alert we leverage the banding functionality to get a similar timeframe of past weeks. We then verify that the median of the total web traffic for this period is not significantly (20% or more) less than the median of past periods. And on the per-country basis, we verify that the current median is not 3 or more standard deviations below the past median. Note also that we count how many countries have issues and use that to decide wether the alert is critical or warning.
Note that the screenshot below has been modified. The countries and values are not accurate.

Alerts with Notification Breakdowns

A common pattern is to trigger an alert on a scope that covers multiple metrics
(or hosts, services, etc.), but then to send more detailed information in
notification. This is useful when you notice that certain failures tend to go
together.

Linux TCP Stack Alert

When alerting on issues with the Linux TCP stack on a host, you probably don’t
want N alerts about all TCP stats, but just one alert that shows breakdowns:

Dell Hardware Alert

No need for N Hardware alerts when hardware is bad, just send one notification with the information needed. Metadata (generally string information) can be reported to bosun, and is reported by scollector. Therefore in the notification we can include things like the model and the service tag.

Rule

alert hardware {
macro = host_based
template = hardware
$time = "30m"
$notes = This alert triggers on omreport's "system" status, which *should* be a rollup of the entire system state. So it is possible to see an "Overall System status" of bad even though the breakdowns below are all "Ok". If this is the case, look directly in OMSA using the link below
#By Component
$power = max(q("sum:hw.ps{host=*,id=*}", $time, ""))
$battery = max(q("sum:hw.storage.battery{host=*,id=*}", $time, ""))
$controller = max(q("sum:hw.storage.controller{host=*,id=*}", $time, ""))
$enclosure = max(q("sum:hw.storage.enclosure{host=*,id=*}", $time, ""))
$physical_disk = max(q("sum:hw.storage.pdisk{host=*,id=*}", $time, ""))
$virtual_disk = max(q("sum:hw.storage.vdisk{host=*,id=*}", $time, ""))
#I believe the system should report the status of non-zero if the anything is bad
#(omreport system), so everything else is for notification purposes. This works out
#because not everything has all these components
$system = max(q("sum:hw.system{host=*,component=*}", $time, ""))
#Component Summary Per Host
$s_power= sum(t($power, "host"))
$s_battery = sum(t($battery, "host"))
warn = $system
}

Backup – Advanced Grouping

This shows the state of backups based on multiple conditions and multiple metrics. It also simulates a Left Join operation by substituting NaN Values with numbers and the nv functions in the rule, and using the LeftJoin template function. This stretches Bosun when it comes to grouping. Generally you might want to capture this sort of logic in your collector when going to these extremes, but this displays that you don’t have to be limited to that:

Conditional Alerts

###Swapping notification unless there is a high exim mail queue
This alert makes it so that swapping won’t trigger if there is a high exim mail queue on the host that is swapping. All operators (such as &&) perform a join, so this alert makes it so if there is no exim mailq for that host, then it is as if the mail queue were high.