Storage and beyond

Facebook time series storage

Time series databases are quite niche product, but it is extremely useful in monitoring. I already wrote about basic design of the timeseries database, but things are more tricky.

Having something like HBase for TS database is a good choice – it has small overhead, it is distributed, all data is sorted, HBase is designed for small records, its packs its sorted tables, it supports Hadoop (or vice versa if that matters) – there are many features why it is a great TS database. Until you are Facebook.

At their scale HBase is no capable of handling the write and read monitoring and statistics load. And Facebook created Gorilla – a fast, scalable, in-memory time series database.

Basically, it is a tricky cache on top of HBase, but it is not just a cache. Gorilla uses very intelligent algorithm to pack 64 bits of monitoring data by the factor of 12.

This allows Facebook to store Gorilla’s data in memory, reducing query latency by 73x and improving query throughput by 14x when compared to a traditional database (HBase)- backed time series data. This performance improvement has unlocked new monitoring and debugging tools, such as time series correlation search and more dense visualization tools. Gorilla also gracefully handles failures from a single-node to entire regions with little to no operational overhead.

Design of the fault tolerant part is rather straightforward, and Gorilla doesn’t care about consistency or even more – there is no recovering missing monitoring data. But after all, Gorilla is a cache in front of persistent TS database like HBase.

Gorilla uses sharding mechanism to deal with write load. Shards are stored in memory and on disk in GlusterFS. Facebook uses its own Paxos-based ShardManager software to store shard-to-host mapping information. If some shards have failed, read may return partial results – client knows how to deal with it, in particular, it will automatically try to read missing data from other replicas.

I personally love Gorilla for its compression XOR algorithm.

It is based on the idea that subsequent monitoring events generally do not differ in the most bits – for example, CPU usage doesn’t jump from zero to 100% in a moment, and thus XORing two monitoring events yields a lot of zeroes which can be replaced with some smaller meta tag. Impressive.