Combining Similar Time Series

This week was Monitorama 2019, in Portland. I was there. Many in the
monitoring community were there. A great time was had by many. (note: I've never really considered myself
part of the monitoring community, but perhaps should make more of an effort?)

Evan Chan gave a talk entitled: "Rich
Histograms at Scale: A New Hope." (slides) In it, he made everyone aware of
how much error exists if you do histograms the way that's easy in
systems like Prometheus--"a couple of linear buckets"--and showed that
to get even close to a 10% error rate for a span of values from 1000 -
6e10, you need 188 exponential buckets.

This presents a problem.

While it's not that big of a problem to query 10 time series to
reconstruct latency estimates with histograms, it is a problem to make 188 queries
to retrieve the buckets to reconstruct latency estimates with histograms.

Not to worry, though! Evan suggested that a richer histogram model,
which stored all of the buckets in a single, rich time series, would provide
better scalability here without sacrificing accuracy. And, using
delta encoding, 188
buckets would be quite cheap to store (on the order of 1.8 bytes per
bucket).

At the end of the talk I messaged one of my colleagues who was in the
trenches with me when we rebuilt the system that hosts Heroku Metrics
*(post is years out of date at this point). You see, co-locating
similar time series data was exactly the design we used with InfluxDB
(v0.0.76 or something) before 2015, and exactly the design we knew
we'd keep in the new system. Because we knew our graphs would be
composed of, say, 4 memory related metrics, we put them together in
the same way you could put 188 histogram buckets together.

Non-Relational databases always talk about how you've got to design
your schemas for how you want to query it. Relational databases say
similar things about indexing strategies. There doesn't seem to be an
equivalent mantra in the time-series world--perhaps because we're too
busy worrying that if we can't write data fast enough, reads don't
matter anyway?

Not sure! But, co-locating metrics that
will be read together, or written together is very worth it, in much
the same way that denormalizing relational data can be very worth
it, or designing NoSQL table spaces with reads in mind is simply a requirement.

In Heroku's case, our metric co-location strategy uses the following
general schema (stored compressed in Cassandra):

This currently supports value cardinalities from 2 to 500, depending
on the use case (the high cardinality stuff is typical for data store
metrics), with an average of about 7.

A natural extension of this would be to turn Measurements.values
into a matrix and do delta encoding. That would provide options for
higher resolution data, or wider data across a single tag value (say,
dyno in Heroku's case).

So, while Evan put forward that a richer model for histograms is
needed for better accuracy, I'm here to suggest that, while maybe not
trivial, we shouldn't be scared that his ideas can't or won't
scale. They definitely can.