Monitoring Fast Metrics, Part 3/5, Maximizing what we Log

Howto by Guy Cirino

Background

Welcome to Part 3 of the Monitoring Fast Metric series! From a raw tiny server instance we built a scalable logging server, and then taught our client how to aggregate and send endless raw data for us. Job done, might as well deploy and jump straight to the analysis - with that much, you'd already be doing better than most of the sites and apps out there that either log way too much or miss out on the important fast moving information they need!

I think we can do better, though. Going back to the mind map, we're still caring about data we aren't interested in, and we might miss out on the rare things that seem to be pretty frequent:

Thanks again, @dangerpudding

That's actually already pretty awesome, and feel free to jump ahead to the analysis, or take a segue through mapping IPs to ASNs. FlatHistograms already do a reasonable job of aggregating data, but they give equal resolution to every bin that is being logged, which doesn't exactly match our desires. In the case of latencies, for example, we're paying as much attention to the area of interest as we are to the outliers:

What is to be done?

Today's Agenda

When aggregating the data, there was no reason to assume that we needed to choose the bins to be equally spaced. To take a delay example in a web browser, I'm going to look at transactionTime that we captured last time, which measures how long it takes from a browser firing off a request to that data sitting on the device, ready for consumption. If we assume that our requests take between 0 (cache hits) to 2 seconds to complete, we're less interested in exactly how long they take when they're past 10 seconds - something went wrong with the CDN or the code, but we don't need to know exactly how long those requests took when they're that far out. We really only need to recognize how often they happened.

With the FlatHistogram implementation, to be able to see things over that range, we would need to set the range from 0 to something above 10 (say 20). If we wanted to count the delays in seconds, we'd need 20 bins to get enough resolution, but then we're getting the same precision in the range from 10-20 as we are from 0-10, where we were less interested. If we want more resolution around the area of interest, say for values less than 5, we would still need to increase the number of bins everywhere to get the finer grain, or we'd need to go to quarter second intervals, where we would need 80 bins!

That leaves us with two options:

Bump up the number of bins and deal with increased amounts of data, or

Choose a different binning strategy.

As an additional caveat, I will avoid using any pre-arranged binning strategies since that complicates the implementations for the rest of that data's life. In practice, with how fast things tend to change, I want to prefer more flexible self-describing solutions that can be applied over vastly differing scales quickly with very little maintenance overhead, which leaves only binning that can be calculated at runtime from a simple description, or the raw data itself.

There has, fortunately, been a lot of research on different binning strategies, and today I will show you one that is used by Chrome, Firefox, and Netflix, which incurs little to no additional CPU cost over the FlatHistogram strategy - the ExponentialHistogram.

For more CPU and memory, we can do even better with more complex techniques, but I saw implementation issues with the lowest end hardware I was supporting at Netflix, so we'll leave that for another article, if there's interest.

The Pieces

I. Calculating the exponentially spaced buckets
Using the code from the previous article, we can get a FlatHistogram from 0 to 20 with a bucket for each number, with:

Now what if we still kept 20 buckets, but wanted the numbers near the top to be much further apart than those towards the bottom? We can observe that we need coverage over the range from 0 to 200 (range of 200), but keep sliding the buckets further apart the closer we get to 200 by using increasing roots of the range. To do this, let's swap out the initializeBucketRanges from FlatHistograms with some new math:

then we just need to change the FlatHistogram names to ExponentialHistogram in the rest of the code, and change the layout to use the name "E" instead of "F". You could also get fancy and just replace the initializeBucketRanges function on FlatHistogram's prototype and rename the class, but I'll leave that to you for now.

II. Comparing the resolutions of interest
What sort of histogram do we get when we use the same range and bucket counts?

Close to 0, the buckets are very close together, while they spread out very far near 200. If we put them on the same scale:

Success! In the case of network delays, such as we discussed before, now we have a lot more resolution towards the area of interest, and significantly less at the outliers. This matches our original intention, since the exact magnitude of an outlier is probably less interesting to us as opposed to just knowing that or how often it happened.

III. Filtering out
There is one more useful optimization we can make to cut down on the amount of data that we're logging. Over short time intervals, such as during a user's session on a website, many of the buckets will not be filled with any counts at all. This leaves us with a lot of zeros in the data, such as in this example of transactionTime counts harvested from a far away website:

If we include the layout in the payload (as we have been), then we can always reconstruct which bins were filtered out based on the original parameters, so there's no need to log a whole bunch of bins with no counts in them. Let's apply a simple filter to the log data to remove zero counts:

Additionally we can also drop any record that has no counts at all in the data field, assuming we aren't interested in receiving heartbeats from clients. Cutting down on the amount of data being logged at the client will reduce how much filtering is needed up at the server.

Next Time

We have now improved our resolution in the areas of the data where we are most interested, and blown away the chaff from the rest of the data. In the next article, we will investigate mapping an IP address to an ASN, a useful dimension to roll up against when looking for common network experiences among users, a useful ingredient for our final analysis.