Don’t use Timers with exponentially-decaying reservoirs in Graphite

A common error when using the Metrics library is to record Timer metrics on things like API calls, using the default settings, then to publish those to a time-series store like Graphite. Here’s why this is a problem.

‘A histogram with an exponentially decaying reservoir produces quantiles which are representative of (roughly) the last five minutes of data. It does so by using a forward-decaying priority reservoir with an exponential weighting towards newer data. Unlike the uniform reservoir, an exponentially decaying reservoir represents recent data, allowing you to know very quickly if the distribution of the data has changed.’

This is more-or-less correct — but the key phrase is ‘roughly’. In reality, if
the frequency of updates to such a timer drops off, it could take a lot longer,
and if you stop updating a timer which uses this reservoir type, it’ll never decay
at all. The GraphiteReporter will
dutifully capture the percentiles, min, max, etc. from that timer’s
reservoir every minute thereafter, and record those to Graphite using the
current timestamp — even though the data it was derived from is becoming more
and more ancient.

Here’s a demo. Note the long stretch of 800ms 99th-percentile latencies on the
green line in the middle of this chart:

However, the blue line displays the number of events. As you can see, there were
no calls to this API for that 8-hour period — this one was a test system, and
the user population was safely at home, in bed. So while Graphite is claiming that
there’s an 800ms latency at 7am, in reality the 800ms-latency event occurred 8
hours previously.

I observed the same thing in our production systems for various APIs which suffered
variable invocation rates; if rates dropped off during normal operation, the
high-percentile latencies hung around for far longer than they should have. This
is quite misleading when you’re looking at a graph for 10pm and seeing a high
99th-percentile latency, when the actual high-latency event occurred hours
earlier. On several occasions, this caused lots of user confusion and FUD with our production
monitoring, so we needed to fix it.

Here are some potential fixes.

Modify ExponentiallyDecayingReservoir to also call rescaleIfNeeded() inside getSnapshot() — but
based on this discussion, it appears the current behaviour is intended (at least for the mean measurement), so that may not be acceptable. Another risk of this is that it leaves us in a position where the percentiles displayed for time T may actually have occurred several minutes prior to that, which is still misleading (albeit less so).

Switch to sliding time window reservoirs, but those are unbounded in size — so a timer on an unexpectedly-popular API could create GC pressure and out-of-memory scenarios. It’s also the slowest reservoir type, according to the docs. That made it too risky for us to adopt in our production code as a general-purpose Timer implementation.

Update, Dec 2017: as of version 3.2.3 of Dropwizard Metrics, there is a new SlidingTimeWindowArrayReservoir reservoir implementation, which is a drop-in replacement for SlidingTimeWindowReservoir, with much more acceptable memory footprint and GC impact. It costs roughly 128 bits per stored measurement, and is therefore judged to be ‘comparable with ExponentiallyDecayingReservoir in terms of GC overhead and performance’. (thanks to Bogdan Storozhuk for the tip)

What we eventually did in our code was to use this Reporter class instead of GraphiteReporter; it clears all Timer metrics’ reservoirs after each write to Graphite. This is dumb and dirty, reaching across logical class boundaries, but at the same time it’s simple and comprehensible behaviour: with this, we can guarantee that the percentile/min/max data recorded at timestamp T is measuring events in that timestamp’s 1-minute window — not any time before that. This is exactly what you want to see in a time-series graph like those in Graphite, so is a very valuable feature for our metrics, and one that others have noted to be important in comparable scenarios elsewhere.

Here’s an example of what a graph like the above should look like (captured from our current staging stack):

Note that when there are no invocations, the reported 99th-percentile latency is 0,
and each measurement doesn’t stick around after its 1-minute slot.

Another potential bug fix for a related issue, would be to add support to Metrics so that it can use Gil Tene’s LatencyUtils package, and its HdrHistogram class, as a reservoir. (Update: however, I don’t think this would address the “old data leaking into newer datapoints” problem as fully.) This would address some other bugs in the Exponentially Decaying Reservoir, as Gil describes:

‘In your example of a system logging 10K operations/sec with the histogram being sampled every second, you’ll be missing 9 out of each 10 actual outliers. You can have an outlier every second and think you have one roughly every 10. You can have a huge business affecting outlier happening every hour, and think that they are only occurring once a day.’