In the Graphite Series blog posts, I'll provide a guide to help through all of the steps involved in setting up a monitoring and alerting system using a Graphite stack. Disclaimer: I am no expert, I am just trying to help the Graphite community by providing more detailed documentation. If there's something wrong, please comment below or drop me an email at feangulo@yaipan.com.

In the previous blog posts, we've learned how to set up Carbon (caches) and Whisper, publish metrics and visualize the information and the behavior of the Carbon processes. In this blog post, I'll present another feature of Carbon - the aggregator.

The Carbon Aggregator

Carbon aggregators buffer metrics over time before reporting them into Whisper. For example, let's imagine that you have 10 application servers reporting the number of requests received every 10 seconds:

Data points in this fashion can be very insightful. You may verify whether the load balancer is actually functioning correctly and balancing the load between your servers.

However, other times you are only interested in the total number of requests received by all your application servers. This could easily be done by applying a Graphite function on your metrics.

sumSeries(PRODUCTION.host.*.requests.m1_rate)

The problem with this approach is that this operation is expensive. In order to render this graph we first need to read the 10 different metrics from their corresponding Whisper files, then we need to combine the results by applying the specified function, and finally build the graph. If we know that this is something we will always be interested in visualizing, we could benefit by precomputing the values.

To precompute the values, we can define a rule that matches metrics on a regular expression, buffers them for a specified amount of time, applies a function on the buffered data, and stores the result in a separate Whisper metric file. In our example, we would need the following:

Metric matching rule: PRODUCTION.host.*.requests.m1_rate

Buffering time interval: 60 seconds

Aggregation function: sum

Output metric: PRODUCTION.host.all.requests.m1_rate

The per server metrics are reported every 10 seconds in our environment. Given this configuration, metrics will be buffered for 6 publishing intervals, combined using the sum function and stored to the output Whisper metric file. Finally we can build a graph by querying the aggregate metric data.

The Carbon Process Stack

The Carbon aggregators can be configured to run in front of the Carbon caches. Incoming metrics can be received by the aggregators and then passed along to the caches.

The Carbon Cache

Refer to the Carbon & Whisper blog post for instructions on how to configure and run a Carbon cache. In my environment I have a cache with the following configuration:

My Carbon cache process' Pickle receiver port is set to the default (2004). Therefore, I could start up an aggregator process with the default configuration and it would be able to communicate with my cache process.

The aggregation-rules configuration file is composed of multiple lines specifying the metrics that need to be aggregated and how they should be aggregated. The form of each line should be:

output_template (buffering_time_interval) = functioninput_pattern

This will capture any received metrics that match the input_pattern for calculating an aggregate metric. The calculation will occur every buffering_time_interval seconds and the function applied can either be sum or avg. The name of the of the aggregate metric will be derived from the output_template filling in any captured fields from the input_pattern. Using the example at the beginning of this blog post, we could build the following aggregation rule:

Due to the nature of the metrics that I publish, I know that all metrics will begin with the environment, followed by the host string and the corresponding host name. The rest of the metric string corresponds to the actual metric name. The following is a breakdown of the incoming metric and the resulting aggregate metric as they go through the above aggregation rule.

At this point you have a Carbon aggregator process running with a single aggregation rule, sending data points to a Carbon cache. We can now start publishing data points to observe the behavior.

Aggregate The Data

In the previous blog post, we used the Stresser application to publish metrics to a Carbon cache. With some simple parameter modifications, we can configure the Stresser to publish metrics to a Carbon aggregator and simulate metric publishing from multiple hosts - to test the aggregation functionality. Use the following configuration: