Kicking the tyres on OpenTSDB

One of the big buzzwords in IT at the moment is “big data” – data in such large quantities that it’s not feasible to analyse using your traditional set of tools. Thankfully this isn’t a problem that Anchor has to deal with, but we almost wish we did.

We collect a lot of data about the servers we manage, way more than most other hosting providers: on a typical server we monitor and track a couple dozen metrics to know how healthy it is and whether that’s changing over time.

This is good, but it’d be great if we could easily store lots more data. What if we didn’t limit ourselves to keeping a year of data? What if we collected data every few seconds instead of every minute? Even if you can store 100 times as much data, can you still access and analyse it quickly enough for it to be usable?

This is what OpenTSDB was designed to address.

Why do we care?

Having lots of data is pointless unless you can extract useful information from it, and lots of what we do at Anchor needs good information. Whether it’s capacity planning for a customer or diagnosing problems, you’re analysing information to reach a conclusion.

For our purposes, having more data is a Good Thing.

What are we missing out on?

As an example, take one of our usual CPU monitoring graphs, which we might use when diagnosing performance complaints.

CPU usage with 1-min sampling granularity

Samples are taken every minute, meaning that graphs will only show sustained activity. Highly erratic or bursty CPU loads are less likely to show up on the graph, which could lead to incorrect conclusions – the CPU could be all-out saturated for short bursts and we’d never be able to tell.

Sampling the CPU usage much more regularly would allow this sort of behaviour to be detected.

What else could we be doing?

You’ve definitely heard the saying “when all you’ve got is a hammer, everything looks like a nail”; OpenTSDB is a bit like that. Once you’ve got a system that makes it practical to sample datapoints with insanely fine granularity, you start to think about things a bit differently.

“I wish I knew how many queries MySQL is receiving” becomes “Why yes I would like to track all 315 statistics exposed by MySQL”.

Asking “how much memory is the server using?” becomes “it’d be handy to know how much memory is tied up in cache and buffers, as well as tracking memory maps and shared regions and…

You wouldn’t limit yourself to monitoring how many open connections Apache is holding, when you could see how many connections every process on the system has, and in which state.

OpenTSDB also allows for ad-hoc custom querying, something which simply isn’t possible with our current static graphing tools like RRD. If you’ve ever used the Google Analytics Query Explorer you’ll know how powerful this is. Once you spot a pattern you can start hypothesising, and the data is all there to help you test it.

And of course, there’s graphs. We’d love to hook OpenTSDB into something shinier like Graphite or Cubism. Sometimes we forward our metrics to customers, and it never hurts the company to produce slick material.

Pretty much anything has to be prettier than this.

Why aren’t we using it right now?

You’ve probably noticed that we’re talking very hypothetically about all this – we haven’t actually deployed our big OpenTSDB and become masters of time and space just yet. There are many great reasons to do so, as we’ve mentioned here, but we’re going to do it properly.

That means proper redundant hardware for the highly-available HBase cluster, not some janky old boxes that happen to be lying around, and deploying a lot of data collection scripts. Our existing Puppet automation framework will do a lot of this for us, but it’ll be a good chunk of well-planned work to do it right and port the data collection over to OpenTSDB.

Our Nagios setup has served us well for probably the better part of a decade now. It’s time to move on to bigger and better things, and this is as big as it gets for the foreseeable future.