Kapacitor - The latest Kapacitor binary and installation packages can be found at the downloads page.

Terminal - The Kapacitor client application works using the CLI and so a basic terminal will be needed to issue commands.

The use case

This guide will follow the classic use case of triggering an alert for high cpu usage on a server. CPU data is among the default system metrics generated by Telegraf out of the box.

The process

Install InfluxDB and Telegraf.

Start InfluxDB and send it data from Telegraf.

Install Kapacitor.

Start Kapacitor.

Define and run a stream task to trigger CPU alerts.

Define and run a batch task to trigger CPU alerts.

Installing

The TICKStack services can be installed to run on the host machine as a part of Systemd, or they can be run from Docker containers. This guide will focus on installing and running them all on the same host as Systemd services.

Next install Telegraf using the Linux system packages (.deb,.rpm) if available.

Once Telegraf is installed and started, it will, as configured by default, send system metrics to InfluxDB, which automatically creates the ‘telegraf’ database.

The Telegraf configuration file can be found at its default location: /etc/telegraf/telegraf.conf. For this introduction it is worth noting some values that will be relevant to the Kapacitor tasks that will be shown below. Namely:

[agent].interval - declares the frequency at which system metrics will be sent to InfluxDB

[[outputs.influxd]] - declares how to connect to InfluxDB and the destination database, which is the default ‘telegraf’ database.

[[inputs.cpu]] - declares how to collect the system cpu metrics to be sent to InfluxDB.

Example - relevant sections of /etc/telegraf/telegraf.conf

[agent]
## Default data collection interval for all inputs
interval = "10s"
...
[[outputs.influxdb]]
## The HTTP or UDP URL for your InfluxDB instance. Each item should be
## of the form:
## scheme "://" host [ ":" port]
##
## Multiple urls can be specified as part of the same cluster,
## this means that only ONE of the urls will be written to each interval.
# urls = ["udp://localhost:8089"] # UDP endpoint example
urls = ["http://localhost:8086"] # required
## The target database for metrics (telegraf will create it if not exists).
database = "telegraf" # required
...
[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## If true, collect raw CPU time metrics.
collect_cpu_time = false

InfluxDB and Telegraf are now running and listening on localhost. Wait about a minute for Telegraf to supply a small amount of system metric data to InfluxDB. Then, confirm that InfluxDB has the data that Kapacitor will use.

Since InfluxDB is running on http://localhost:8086 Kapacitor finds it during start up and creates several subscriptions on InfluxDB.
These subscriptions tell InfluxDB to send all the data it receives to Kapacitor.

For more log data check the log file in the traditional /var/log/kapacitor directory.

Here can be seen some basic start up messages: listening on an HTTP port and posting data to InfluxDB.
At this point InfluxDB is streaming the data it is receiving from Telegraf to Kapacitor.

Triggering alerts from stream data

The TICKStack is now setup (excluding Chronograf, which is not covered here). This guide will now introduce the fundamentals of actually working with Kapacitor.

A task in Kapacitor represents an amount of work to do on a set of data. There are two types of tasks: stream and batch. A simple stream task will be used first to present core Kapacitor features. Then there will be presented some more sophisticated use cases. Finally the first simple use case will be covered as a batch task.

Kapacitor uses a DSL called TICKscript to define tasks.
Each TICKscript defines a pipeline that tells Kapacitor which data to process and how.

So what should Kapacitor be instructed to do?

The most common Kapacitor use case is triggering alerts. The example that follows will set up an alert on high cpu usage.
How to define high cpu usage? Telegraf writes to InfluxDB a cpu metric on the percentage of time a cpu spent in an idle state. For demonstration purposes assume that when idle usage drops below 70% a critical alert should be triggered.

A TICKscript can now be written to cover these criteria. Copy the script below into a file called cpu_alert.tick:

dbrp"telegraf"."autogen"stream// Select just the cpu measurement from our example database.
|from().measurement('cpu')|alert().crit(lambda:int("usage_idle")<70)// Whenever we get an alert write it to a file.
.log('/tmp/alerts.log')

Kapacitor has an HTTP API with which all communication happens.
The kapacitor client application exposes the API over the command line.
Now use this CLI tool to define the task and the database—including retention policy—that it can access:

kapacitor define cpu_alert -tick cpu_alert.tick

Note on declaring Database and Retention policy: As of Kapacitor 1.4 the database and retention policy to which the TICKscript will be applied can be declared using an optional statement in the script: e.g. dbrp "telegraf"."autogen". If not declared in the script, then it must be defined when the task is defined using the kapacitor flag -dbrp followed by the argument “<DBNAME>”.”<RETENTION_POLICY>”.

However, nothing is going to happen until the task has been enabled.
Before being enabled, the task should first be tested to ensure it does not spam the log files or communication channels with alerts.
Record the current data stream for a bit and use it to test the new task:

kapacitor record stream -task cpu_alert -duration 60s

Since the task was defined with a database and retention policy pair, the recording knows to
only record data from that database and retention policy.

NOTE – troubleshooting connection refused – If, when running the record command, an error is returned of the type getsockopt: connection refused (Linux) or connectex: No connection could be made... (Windows), please ensure that the Kapacitor service is running. See the section above Installing and Starting Kapacitor. If Kapacitor is started and this error is still encountered, check the firewall settings of the host machine and ensure that port 9092 is accessible. Check as well the messages in /var/log/kapacitor/kapacitor.log. There may be an issue with the http or other configuration in /etc/kapacitor/kapacitor.conf and this will appear in the log. If the Kapacitor service is running on another host machine, set the KAPACITOR_URL environment variable in the local shell to the Kapacitor endpoint on the remote machine.

Now grab the ID that was returned and put it in a bash variable for easy use later on (the actual UUID returned will be different):

As long as the size is more than a few bytes it is certain that some data was captured.
If Kapacitor is not receiving data yet, check each layer: Telegraf → InfluxDB → Kapacitor.
Telegraf will log errors if it cannot communicate to InfluxDB.
InfluxDB will log an error about connection refused if it cannot send data to Kapacitor.
Run the query SHOW SUBSCRIPTIONS to find the endpoint that InfluxDB is using to send data to Kapacitor.

With a snapshot of data recorded from the stream, that data can then be replayed to the new task.
The replay action replays data only to a specific task.
This way the task can be tested in complete isolation:

kapacitor replay -recording $rid -task cpu_alert

Since the data has already been recorded, it can be replayed as fast as possible instead of waiting for real time to pass.
When the flag -real-clock is set, the data will be replayed by waiting for the deltas between the timestamps to pass, though the result is identical whether real time passes or not. This is because time is measured on each node by the data points it receives.

Check the log using the command below.

sudo cat /tmp/alerts.log

Were any alerts received?
The file should contain lines of JSON, where each line represents one alert.
The JSON line contains the alert level and the data that triggered the alert.

Depending on how busy the host machine was, maybe not.

The task can be modified to be really sensitive to ensure the alerts are working.
In the TICKscript change the lamda function .crit(lambda: "usage_idle" < 70) to .crit(lambda: "usage_idle" < 100), and define the task once more.

Any time you want to update a task change the TICKscript and then run the define command again with just the TASK_NAME and -tick arguments:

Now every data point that was received during the recording will trigger an alert.

kapacitor define cpu_alert -tick cpu_alert.tick

Replay it again and verify the results.

kapacitor replay -recording $rid -task cpu_alert

Once the alerts.log results verify that it is working, change the usage_idle threshold back to a more reasonable level and redefine the task once more using the define command as shown above.

Enable the task, so it can start processing the live data stream, with:

kapacitor enable cpu_alert

Now alerts will be written to the log in real time.

To see that the task is receiving data and behaving as expected run the show command once again to get more information about it:

The first part has information about the state of the task and any error it may have encountered.
The TICKscript section displays the version of the TICKscript that Kapacitor has stored in its local database.

The last section, DOT, is a graphviz dot formatted tree that contains information about the data processing pipeline defined by the TICKscript. Its members are key-value associative array entries containing statistics about each node and links along an edge to the next node also including associative array statistical information. The processed key in the link/edge members indicates the number of data points that have passed along the specified edge of the graph.
For example in the above the stream0 node (aka the stream var from the TICKscript) has sent 12 points to the from1 node.
The from1 node has also sent 12 points on to the alert2 node. Since Telegraf is configured to send cpu data, all 12 points match the from/measurement criteria of the from1 node and are passed on.

NOTE: When installing graphviz on Debian or RedHat (if not already installed) use the package provided by the OS provider. The packages offered in the download section of the graphviz site are not up-to-date.

Now that the task is running with live data, here is a quick hack to use 100% of one core to generate some artificial cpu activity:

while true;doi=0;done

There are plenty of ways to get a threshold alert. So, why all this pipeline TICKscript stuff?
In short because TICKscripts can quickly be extended to become much more powerful.

Gotcha - single versus double quotes

Single quotes and double quotes in TICKscripts do very different things:

The result of this search will always be empty, because double quotes were used around “server1”. This means that Kapacitor will search for a series where the field “host” is equal to the value held in the field “server1”. This is probably not what was intended. More likely the intention was to search for a series where tag “host” has the value ‘server1’, so single quotes should be used. Double quotes denote data fields, single quotes string values. To match the value, the tick script above should look like this:

Extending TICKscripts

The TICKscript below will compute the running mean and compare current values to it.
It will then trigger an alert if the values are more than 3 standard deviations away from the mean.
Replace the cpu_alert.tick script with the TICKscript below:

Just like that, a dynamic threshold can be created, and, if cpu usage drops in the day or spikes at night, an alert will be issued.
Try it out.
Use define to update the task TICKscript.

kapacitor define cpu_alert -tick cpu_alert.tick

NOTE: If a task is already enabled, redefining the task with the define command will automatically reload it.
To define a task without reloading it use -no-reload

Now tail the alert log:

sudo tail -f /tmp/alerts.log

There should not be any alerts triggering just yet.
Next, start a while loop to add some load:

while true;doi=0;done

An alert trigger should be written to the log shortly, once enough artificial load has been created.
Leave the loop running for a few minutes.
After canceling the loop, another alert should be issued indicating that cpu usage has again changed.
Using this technique, alerts can be generated for the raising and falling edges of cpu usage, as well as any outliers.

A real world example

Now that the basics have been covered, here is a more real world example.
Once the metrics from several hosts are streaming to Kapacitor, it is possible to do something like: Aggregate and group
the cpu usage for each service running in each datacenter, and then trigger an alert
based off the 95th percentile.
In addition to just writing the alert to a log, Kapacitor can
integrate with third party utilities: currently Slack, PagerDuty, HipChat, VictorOps and more are supported. The alert can also be sent by email, be posted to a custom endpoint or can trigger the execution of a custom script.
Custom message formats can also be defined so that alerts have the right context and meaning.
The TICKscript for this would look like the following example.

Example - TICKscript for stream on multiple service cpus and alert on 95th percentile

Something so simple as defining an alert can quickly be extended to apply to a much larger scope.
With the above script, an alert will be triggered if any service in any datacenter deviates more than 3
standard deviations away from normal behavior as defined by the historical 95th percentile of cpu usage, and will do so within 1 minute!

Triggering alerts from batch data

Instead of just processing the data in streams, Kapacitor can also periodically query
InfluxDB and then process that data in batches.
While triggering an alert based off cpu usage is more suited for the streaming case, the basic idea
of how batch tasks work is demonstrated here by following the same use case.

This TICKscript does roughly the same thing as the earlier stream task, but as a batch task:

This will record the last 20 minutes of batches using the query in the batch_cpu_alert task.
In this case, since the period is 5 minutes, the last 4 batches will be saved in the recording.

The batch recording can be replayed in the same way:

kapacitor replay -recording $rid -task batch_cpu_alert

Check the alert log to make sure alerts were generated as expected.
The sigma based alert above can also be adapted for working with batch data.
Play around and get comfortable with updating, testing, and running tasks in Kapacitor.

Loading Tasks with the Kapacitor daemon

It is also possible to save TICKscripts in a load directory declared in
kapacitor.conf. In this way tasks and task templates can be loaded and
enabled directly with the Kapacitor daemon, when it boots. Such scripts must
include the database and retention policy declaration dbrp.