Ad Hoc Analytics with MongoDB's Aggregation Framework

Fair warning: this post is really long, has a lot of code, and I tend to ramble a lot. If you're looking for the graphs, they're at the bottom. :)

Background

While working on some recent projects, I had the need to run some basic dashboard analytics against moderate volumes of machine generated data. Already having some experience with MongoDB (and being quite the fan of it), I decided to do some research on real-time analytics with MongoDB.

A quick search turns up dozens of articles and presentations on how this can be achieved. However, after reading through quite a few of them, it became clear that most of the existing how-tos on the subject are based on pre-Aggregation Framework techniques, relying largely on MongoDB's atomic upsert, $inc, and $set operations.

These techniques are still largely useful, and are powering several successful applications. Unfortunately, however, they tend to be lacking when it comes to the ad hoc side of things - specifically, once multiple values from distinct events have been aggregated into a single value, the ability to slice and dice the results becomes limited. Additionally, these techniques typically require pre-aggregating at multiple levels to support pre-determined aggregation durations, rely on MapReduce, or delegate some re-reducing labor to the application itself.

Given that we now have the Aggregation Framework available to us (since MongoDB 2.1), I decided to run some tests to see how feasible it is to achieve real-time, interactive, ad hoc, dashboard analytics with MongoDB.

Note that this article is intended to be platform agnostic, so all tests are implemented as MongoDB shell scripts. If you're not familiar with scripting the shell, you can read about that here.

Defining our Terms

Several of the terms I used above seem to have wildly varying definitions, so let me take a moment to define how they're being used within the scope of this article:

Real-time

Analytics must be up-to-date; results should be accurate for the data as it is now, not for what the data looked like 2 hours ago.

Interactive

Queries should respond with "web latency". If you're tempted to go browse Hacker News while you wait for a response, we're too slow.

Ad hoc

In general, we should be able to support the above two requirements without one-off indexes or other performance measures that require a priori knowledge of the queries themselves.

Dashboard analytics

This term really has no standard definition, but it should be understood that we're talking about relatively superficial analytics here: mins, maxes, averages, counts, etc. This isn't the be-all and end-all solution to BI.

Test Data

I initially reached for the ubiquitous "server request log" test data, but as I began drafting this post, hurricane Sandy hit and I thought it would be interesting to work with event data that contains some of the metrics I imagine an offshore buoy might send back. Here's what I came up with:

To see how things behave over a reasonable volume of events, we're going to create 1 million events that represent a week's worth of data from 25 distinct buoys. Since we'll be comparing 3 different schemata, I've created a file that will act as our "test runner" and "event generator" across all 3:

We'll simply load this file at the top of our individual test cases, which will allow the actual test files to focus on defining the indexes, insert statements, and queries. Also note that this data is in no way realistic - values are purely random and can jump from one extreme to the other. They're just here to give us something to crunch over.

Three MongoDB Real-time Analytics Schemata

There are undoubtedly more ways to crack this nut, but I settled on three different schemata to compare. Each one is evaluated with respect to their insert speed, memory requirements, and performance against 4 different queries.

Document Per Event

This first schema is the simplest of the three. For each distinct event that occurs, we'll create a corresponding document. This schema provides our baseline, naive implementation that no significant amount of thought has gone into. Here's what an individual document looks like when printed from the shell:

Document Per Buoy Per Hour

This second approach is an attempt at seeing if the Aggregation Framework can unwind and aggregate sub documents faster than it can operate against multiple root level documents (as we tried above).

To test this, we'll create a single root level document per buoy per hour, with all associated events being pushed onto an array. Here's what an example document will look like if you print it from the shell:

Hybrid: Increment-On-Write / Aggregation Framework

This last approach is an attempt at combining the already popular upsert/increment-on-write technique with the awesome power of the Aggregation Framework. Granted, this approach suffers from the same flaw I criticized at the start of this post, namely there is a loss of resolution. This approach does hold some appeal when it comes to high volume feeds, though. Specifically, it allows us to set a "maximum resolution" for our data. In the example below, we'll pre-aggregate on a minute-by-minute basis, so it won't matter whether we're consuming 5 events per minute or 5000; we essentially get a configurable slider with low latency and storage requirements on the one side, and increased granularity on the other. All we have to do is find the sweet spot for the task at hand.

With that said, the only improvement this examples makes to the standard formula is to rely on the Aggregation Framework to perform in-database analytics over our pre-aggregated documents, rather than using MapReduce or in-application logic.

To test this approach, we'll create a single root level document per buoy per minute, which will store our sum totals of each metric for that period. We'll also keep count of how many events are represented by each document, allowing us to provide counts and averages. Here's what an example document will look like if you print it from the shell:

Notice that at the rate of events we're generating, we only have, on average, 4 events per minute per buoy. Even so, as we'll see below in the comparisons, we already start to notice some nice performance gains. These performance margins only improve as we increase the rate of events relative to the duration that we pre-aggregate them at.

Results

Observations

Wherein I hastily slap some explanations on top of these results. Any additional insight or corrections in the comments would be very welcome.

First of all, you'll probably notice that several optimizations are being left on the table with respect to the queries being run. This is actually intentional, as the purpose of this article is to demonstrate performance against ad hoc queries, where pre-optimizing is not an option and not every query will be fine tuned in a development environment.

Secondly, overall, MongoDB is holding up pretty well here. It would be interesting to compare these against, e.g., MySQL, but as a very non-scientific opinion based on past experience, MongoDB has become quite competitive in this space.

Thirdly, as MongoDB's aggregation capabilities mature, developers are being given a lot of options with regards to how a challenge will be approached. As always, unique approaches come with their own strengths and weaknesses:

In the first 3 queries, our second schemata handedly defeats our first. The primary difference between these query pipelines is that the first relies solely on scanning our index, whereas the second relies more heavily on unwinding sub documents. It appears that in these simple cases, then, embedding sub documents that can later be unwinded is generally worthwhile.

The exception to the above is query 4. In the last query, our second schema is over 60% slower than our first schema. What changed here is that we introduced a $match expression after the $unwind. Even though the field we're matching against isn't being indexed, MongoDB seems to be able to perform this operation much more efficiently against root level documents.

In all 4 queries, as expected, our our third schema benefits from the pre-aggregation, which significantly reduces the number of documents to re-aggregate.

We can also see trade-offs in other areas as well, such as insert speed and memory requirements. While all 3 schemata performed nicely on insert speeds (our worst approach is still over 5.2k inserts a second), we had a clear winner here with our first schema achieving over 16.7k inserts per second (and that's with a compound index!).

Lastly, just to try and wrap this thing up, the Aggregation Framework, despite being quite flexible, is extremely verbose. Personally, I'm a fan of the pipeline model - the query semantics were immediately clear to me, whereas SQL took much longer to grok. I have to wonder, though, if the true strength of this pipeline model will be found in higher level UIs targeting business users. Given that the query semantics are indeed so simple, the impedance mismatch from database to front-end can be effectively dispensed with here, which is quite promising.

Well, I have to end this somewhere, so I better just do it now. If you've made it all the way to the end, thanks for reading - you're a far better person than I am. We've covered quite a bit of code and comparisons, and I'm quite sure I've excluded or screwed up some obvious takeaways, so please have at it in the comments!

Posted by jmar777 on November 15, 2012

DevSmash is using [Disqus](http://disqus.com/) to power our comments - please enable JavaScript to join the conversation.comments powered by Disqus

What I've been up to:

I've recently accepted the role of CTO for a
company called InfiniteTakes. If something on
this site has been useful to you, please
consider downloading our free app for streaming live video from your mobile device: