MongoDB Map-Reduce Basics

If you’re accustomed to working with relational databases, the thought of specifying aggregations with map-reduce may be a bit intimidating. Here, in the third in a series of articles on MongoDB aggregation, I explain map-reduce. After reading this, and with a little practice, you’ll be able to apply the map-reduce paradigm to a huge number of aggregation problems.

A comments example, of course

Let’s start with an example. Suppose we have a collection of comments with the following document structure:

{
text: "lmao! great article!",
author: 'kbanker',
votes: 2
}

Here we have a comment authored by ‘kbanker’ with two votes. Now, we want to find the total number of votes each comment author has earned across the entire comment collection. It’s a problem easily solved with map-reduce.

Mapping

As its name suggests, map-reduce essentially involves two operations. The first, specified by our map function, formats our data as a series of key-value pairs. Our key is the comment author’s name (this makes sense only if this username is unique). Our value is a document containing the number of votes. We generate these key-value pairs by emitting them. See below:

When we run map-reduce, the map function is applied to each document. This results in a collection of key-value pairs. What do we do with these results? It turns out that we don’t even have to think about them because they’re automatically passed on to our reduce function.

Reducing

Specifically, the reduce function will be invoked with two arguments: a key and an array of values associated with that key. Returning to our example, we can imagine our reduce function receiving something like this:

reduce('kbanker', [{votes: 2}, {votes: 1}, {votes: 4}]);

Given that, it’s easy to come up with a reduce function for tallying these votes:

Notice that running the mapReduce helper returns stats on the operation; the results of the operation itself are stored in the collection specified, here called “mr_results”.

The other stats also prove informative. First is the operation time in milliseconds. Next are the number of input documents, the number of times we called emit (this can be more than once per document), and the number of output documents. Finally, we can be assured that the operation has succeeded because “ok” is 1.

Of course, what we really want are the results. To get them, just query the output collection:

Output types: merge, reduce, and inline

MongoDB v1.8 introduces a couple changes to map-reduce’s output. First, temporary collections are no more. All output now must go into a specifically named collection or be returned inline.

If you specify the collection name only, then any existing collection of the same name will be overwritten.

But there are now two new options that allow you to preserve the existing collection either by merging the new set of results or folding them in with the reduce function. Let’s take a closer look at these.

The syntax for merging is simple:

db.comments.mapReduce(map, reduce, {out: {merge: “mr_results”}});
A merge will overwrite any existing keys with the newly-generated keys.

The syntax for reducing is similar:

db.comments.mapReduce(map, reduce, {out: {reduce: "mr_results"}});

Reducing works like this: whenever a key in the new results already exists in the output collection, the reduce function is applied to both keys, and the return value replaces the existing key.

Definitions are always hard to visualize; an example should help to clarify the difference between merge and reduce. So suppose an earlier map-reduce places the following two results into an output collection.

Notice that the values for the “hwaet” key are reduced using the reduce function. In this case, this simply means adding the votes together.

The final new map-reduce option allows you to return the results rather than writing them to a collection. Here’s how you use it in JavaScript:

db.comments.mapReduce(map, reduce, {out: {inline: 1}});

If you’re using :out => {:inline => 1} with the Ruby driver, be sure that you also specify the :raw => true parameter to prevent the Collection#map_reduce method from attempting to return an instance of Mongo::Collection.