In previous article, I’ve shown two essential ways the events get tracked.

Now, I would like to focus on the latter event of tracking — storing all the event data, and analyzing it ad-hoc afterwards.

It is important to analyze data in the fastest fashion, while preserving the power. That’s the main concern of this post.

We’d explore more about that using the shop tracking system as an example. It would track only one kind of event — orders. To keep the samples simple, we’ll only calculate these metrics:

average purchase total

top products

The structure of the event document would be as follows:

total — float order total

line items — array with elements like:

sku — string product sku

price — float item price

For the sake of this example, we will perform different analysis methods against a set of 1,000,000 of such order events, with each having 5 line items on average.

We’ll use MongoDB as schema-less documents are the fit for storing events and also because of its powerful analytics functionalities. Worth noticing that those features will mainly be described in light of their application to described tasks.

The disadvantage of using it is the write lock being placed while it’s running and overhead of executing it in JavaScript VM.

Map/reduce

Map/reduce is a programming model for processing large data sets. One of highlights is distributing processing of computations across the cluster. Not only that, it also has the benefits of db.eval(). You can learn more about how MongoDB implements it here.

Worth noticing that map/reduce is not always a suitable substitution for db.eval(). The tasks we are trying to accomplish, on another hand, are the perfect match for using it.

Speed was generally worse compared to db.eval(). It is because map/reduce splits tasks in two phases. If map/reduce is run on a shard cluster, tasks would be dispatched to each shard, thus making it faster. (Map/reduce wasn’t meant to be damn fast anyway.)

map reduce

time

avg purchase

20,498ms

top products

112,035ms

total

132,534ms

It also suffers from the same thing as db.eval() does — map/reducing places the write lock while it runs.

Aggregation framework

While map/reduce is truly awesome, the amount of code to aggregate simple things is often overwhelming.

Assuming you’re familiar with UNIX shell you would also wonder why can’t you easily chain different operations in map/reduce? What if you wanted to do something like:

match_criteria | sort | limit | group

Doing that with map/reduce would be a pain in ass.

Ok, that sucks, but what are the alternatives? Well, MongoDB 2.1 and higher ships with something called “aggregation framework”. Essentially, it is chainable map/reduce with common use cases implemented and optimized upfront. It is not a tutorial on it, so refer to the docs when you need. I’d just go by showing how neat it is.

Running it, we can notice great speed improvement compared to any of previously discussed ways:

aggregation

time

avg purchase

2,511ms

top products

29,146ms

total

31,658ms

Possible optimization techniques

Whatever method you end up using, it is wise to apply some optimization to metric calculation.

Computing your metrics on every access to them would be super-dumb. Instead, once calculated, their value should be cached and always returned unless the metric dependencies change.

In our examples, caches should be invalidated only when a new order is tracked.

Comparison

naive

db.eval()

map/reduce

aggregation

avg purchase

119,775ms

8,228ms

20,498ms

2,511ms

top products

canceled after 2+ hours of waiting

42,399ms

112,035ms

29,146ms

total

canceled after 2+ hours of waiting

50,627ms

132,543ms

31,658ms

speed

exceptionally low

high

medium

super high

memory usage

high

low

low

low

data transfer overhead

yes

no

no

no

JS VM overhead

no

yes

yes

yes

write lock

no

yes

yes

yes

easily distributed

—

no

yes

yes

optimized

—

—

no

yes

output limit

—

—

16Mb

16Mb

chainable

—

—

no

yes

Conclusion

Using MongoDB aggregation framework proves to be the fastest option you have. For sure, it is powerful enough to perform simple as well as more advanced analysis of data. It scales well and is simple to write.

If something more sophisticated needed than it’s possible to do with aggregation framework, you may use map/reduce, trading off the performance and loosing easy chainability. Scalability still remains the point.

If your task is really simple but is impossible to do with aggregation framework or is very slow under map/reduce, it might be desirable to give db.eval() a shot. Cons? Doesn’t scale.

Querying all the documents and performing the analysis right in your programming language is the last resort. It’s a loose by all means and generally it should be used only when neither of other options leads to desired result.