Why details really matter

Why details really matter

I have been investigating, testing, and playing with Influxdb 2.0, and Flux - both of which are still in Alpha (OSS version), but well worth looking at already - in a test environment, and noticed that there is a tiny detail in one of the built-in functions that if overlooked, could really ruin your day.

Now, yes, the detail is documented, but as someone who perhaps doesn't always read the documentation in its entirety (where's the fun in that?!), I overlooked this, and made some assumptions about the default behaviour.

Ok, so, let's run a query for the cpu of a host, over the last 30 days - I know that there is actually only data for the past 24 hours, using a basic query:

from(bucket: "test_bucket")

|> range(start: v.timeRangeStart, stop: v.timeRangeStop)

|> filter(fn: (r) => r.cpu == "cpu-total")

|> filter(fn: (r) => r._field == "usage_user")

And the resulting graph, which only shows the data that we actually have:

Great. Now, lets assume we need to use some form of aggregation, such as a sum - which is a very common requirement. This changes the query to include the use of a built-in function named "aggregateWindow()" which simply applies a windowing function to the data, and calculates a sum over each of these windows. A period of 5 minutes is often used (5 minute rollups are common):

from(bucket: "test_bucket")

|> range(start: v.timeRangeStart, stop: v.timeRangeStop)

|> filter(fn: (r) => r.cpu == "cpu-total")

|> filter(fn: (r) => r._field == "usage_user")

|> aggregateWindow(every: 5m, fn: sum)

And the graph looks roughly the same over the same time:

I didn't notice the little flat line just before the data starts...

Now, let's see how the data looks over the past 30 days (even though there's only data covering 24 hours):

Well...great; but why is the graph now putting a "0 line" in, and why do I now have data where none exists? Remember, the graph without the aggregation behaves correctly, and only shows data where data exists...

The answer is in the aggregateWindow function's default behaviour - which is documented here:

createEmpty:Boolean

For windows without data, this will create an empty window and fill it with a `null` aggregate value.

What is not immediately obvious is that by default, it is True.

This results in Flux returning lots, and lots of empty windows, now summed to 0 (not Null). Whilst this seems innocent enough, if you were to run this against a DB with a more complex query, it can easily result in an OOM of Influxdb 2.0 (Flux is essentially embedded within the DB) as Flux tries to create all of these windows.

Here's the same query, but we now set this parameter to False:

from(bucket: "test_bucket")

|> range(start: v.timeRangeStart, stop: v.timeRangeStop)

|> filter(fn: (r) => r.cpu == "cpu-total")

|> filter(fn: (r) => r._field == "usage_user")

|> aggregateWindow(every: 5m, fn: sum, createEmpty: false)

Which now causes the graph to behave as it does without the aggregation function - returning only results where there is data:

Apart from the likelihood of massively higher memory overhead (for all of those "0 filled" time windows"), there will be an increasing calculation overhead for successive functions called after the call to aggregateWindow() because these functions will be operating against all of this data (a window full of 0s uses the same amount of memory as a window full of 1s)...

So, if you are also actively using Influxdb 2.0, keep in mind that this stuff is actively changing, and keep up with the documentation - or, your day might just get a whole lot more "interesting"!