Correcting falsehoods in tech documentation, lessons learned

Database size, column types, and retention policies

So, earlier I mentioned that I had a 700meg database of samples, and wasn’t even looking at them.

I started trying to look at them, and quickly had to add some indexes, ballooning the db out to about 950meg. This was clearly going in the wrong direction.

I looked at the table schema, and quickly realised that keeping the sensor sample time (a unix timestamp) as a string instead of a number was pretty wasteful, and likewise for the sensor type and node id. Although both of those can be given meaningful names, all firmware on the nodes have those as numbers as well, because it has to go over the wire(less)

So, simply changing the database schema from some string columns to some integer columns, and my calculator tells me I’ve gone from 90 bytes per sample, down to 32 bytes per sample.

Much better. And now that the date sample is a proper number, searching the db for date windows is much faster, without even needing the index.

And for added deliciousness, the python consumer, that writes into this database didn’t have to change at all, is it a string, is it an integer? It’s both!

That’s all well and good, but it’s still going to just keep on growing forever and ever, and with samples every 5 seconds, and (currently) 3 nodes with 2-3 sensors on each one, this isn’t going to slow down.

Do I really really care about that resolution of data much longer than a few days/weeks/month in the past? Shouldn’t I really just be averaging the older data out to get some general trends? Perhaps :) And that’s exactly what RRDtool is for, I just find it a little obtuse to use at times. (Mostly, I have to decide, permanently and forever, right now, exactly how long I do want to keep things. I didn’t want to have to decide that)