The Case For Tagging In Time Series Data

A while ago I wrote a blog post about time series database
requirements that has been
amazingly popular. Somewhere close to a dozen companies have told me they’ve
built custom in-house time series databases, and that blog post was the first
draft of a design document for it.

One of the things I said in the post was that I had no use for the “tagging”
functionality I’ve seen in time series databases such as OpenTSDB. I’ve since
reconsidered, although I think the functionality I now want is a bit different.

What does it mean to “tag” metrics? Typically, many time series databases let
you add name=value pairs (tags) to individual time series points (observations).
For example, you measure CPU usage to be 59% at 3:41PM on host inst413, and you tag
this measurement as “shard=81” because inst413 holds data for shard 81.

The use case for this is that you can now aggregate and filter metrics, looking
at for example CPU usage for all of the shard 81 hosts at a point in time.

Tags are typically attached to individual points in the systems I’m aware of,
rather than being attached to a host or similar, because the assumption is that
they’ll change over time. You might move shard 81’s data to a different set of
servers. You might destroy host inst413 and rebuild it, and it might not even be
a database anymore.

For my purposes at VividCortex, I want “tags” to
help slice-and-dice-and-filter for a variety of customer scenarios.

We currently can show you Top Queries by total time, and Top Queries by the
count of queries that lack indexes. Customers want to see Top
Queries by total time, filtered to show only those that lack indexes.

We can show you Top Users by total time, indicating the total service demand
placed on a set of hosts by queries running against them from a particular
user. But what our customers want is to see Top Queries by total time,
filtered to those that come from a specific user.

There are a variety of cases where we want to decouple the metrics from the
filtering that can be applied to them. Some we can support by joining together
related metrics, at some cost on our backend; some we cannot currently do as
well as we want, or in the way we want.

However, I do not like one thing about the common tagging functionality. I don’t
like that individual metric points all have to bear the burden of every tag for
every point.

Instead, I view tags as time series facts that exist during some time
interval. And I think these should be applicable to the metric or the host (but
I don’t see a use case where I want them per point in a metric). For example,
a query doesn’t use an index. I can tag the query’s metrics as “no_index=true”
from the timestamp I first observe that, till the timestamp that I observe it
using an index. These durations can be open-ended, in which case they apply for
all time.

Similarly, I can tag the host as “shard=81” during the time that this host
carried that shard’s data.

Finally, the tags need to permit multiple values. Imagine a query that is run by
several different users; we wouldn’t want a uniqueness constraint on it.
“user=nagios” and “user=vividcortex” should be able to coexist for the query
“SHOW GLOBAL STATUS” in MySQL, to mention a real-life use case. Without this
capability, a search for all queries that the vividcortex user runs wouldn’t
work as expected.

I'm Baron Schwartz, the founder and CEO of VividCortex. I am the author of High
Performance MySQL and lots of open-source software for performance analysis, monitoring, and system administration.
I contribute to various database communities such as Oracle, PostgreSQL, Redis and MongoDB. More about me.