Thoughts on Time-series Databases

“Preetam “recently” blogged about catena, a time-series metric store. There was another blog post about benchmarking boltdb by a Fog Creek engineer, also looking to write a time series database. This is something of a pattern in the Go community, which already boasts seriesly, InfluxDB, and prometheus; there are almost certainly others.

Time series data has been de rigueur at least since the Etsy’s seminal blog post on StatsD, though in reality that was just an inflection point. Time series modeling and graphing predates computer systems, but they have been a popular way of tracking and visualizing systems and networking data since at least the early 90s with MRTG. A few factors are converging now to make these kinds of systems more important: “Big Data” is getting much, much bigger; virtualization and containerization has increased the number of independent “nodes” for a typical distributed application; and the economies of the cloud have put the brakes on the types of performance increases typically attributed to “Moore’s Law.”

This topic is relevant to my work at Datadog, and I’ve been thinking about it for a long time now. I’ve wanted to collect my thoughts somewhere for a while, because some of them are identical to those expressed in other recent projects, and others are antithetical to them. I figured this would make my input at worst interesting.

For a primer on this subject, please read Baron’s Time-Series Database Requirements. There’s a reason that most other recent articles cite it; it contains a brief but complete description of the problem, the requirements for many large-scale time-series users, and some surprises for the uninitiated…”