Tag: MongoDB

In this post we are going to see how to perform a GROUP BY , COUNT (SQL like) query in MongoDB.
MongoDB is an open source, document-oriented database designed with both scalability and developer agility in mind. Mongo allows you to perform this operation in different ways:

Group approach

Aggregate approach

MapReduce approach

For this example I used Ubuntu 14.04 and MongoDB 3.2.9. We will work on the Contacts collection, a test collection that represents a simplified contacts list, with the following format:

1

2

3

4

5

6

{

"Name":"Raven",

"Surname":"Dale",

"City":"Massa Martana",

"Country":"Switzerland"

}

The documents stored in the collection have been generated using the online service Generate Data and then imported using the mongoimport util.

Our goal is to group the contacts by Country and then count how many contacts there are for each country.

1

SELECT Country,count(*)from contacts group by Country

Group approach

This approach uses the collection method group to group your collection by some attributes.
You can define the key you want to use to aggregate and a reduce/aggregation function that will be applied to the grouped rows.
In this example, we are going to aggregate by the Country key and we are simply going to count the aggregated rows (incrementing by 1 the total variable)

1

2

3

4

5

6

7

8

9

db.contacts.group(

{

key:{Country:1},

reduce:function(curr,result){

result.count+=1;

},

initial:{count:0}

}

)

Because db.collection.group() uses JavaScript, it is subject to a number of performance limitations, for example, it does not work with sharded clusters. For the full documentation about the group collection method you can take a look to this page.

Aggregate $group approach

This approach uses the MongDB aggregation framework pipeline. The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results. To group the documents we will use the $group pipeline stage. We are going to specify the field that will be used as aggregation key (_id) and the $sum accumulator operator to count the number of aggregated rows.

1

2

3

4

5

6

7

8

9

10

db.contacts.aggregate(

[

{

$group:{

_id:{_id:"$Country"},

count:{$sum:1}

}

}

]

)

Here you can find the list of all the accumulation function and the $group documentation.

MapReduce approach

Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. MongoDB provides map-reduce operations to perform aggregation. Map-reduce operations have two phases: a map stage that processes each document and emits one or more objects for each input document, and reduce phase that combines the output of the map operation. To group by Country we define a new map function that group by the Country key and emit 1 as value (this task looks like the WordCount well known problem).

1

2

3

varmapCountryFunction=function(){

emit(this.Country,1);

};

The reduce function will sum up these value.

1

2

3

varreduceCountryFunction=function(country,n){

returnArray.sum(n);

};

You can now run a mapreduce job from the collection. We can define an output collection, where the results will be stored.

1

2

3

4

5

db.contacts.mapReduce(

mapCountryFunction,

reduceCountryFunction,

{out:"map_reduce_contacts_out"}

)

Results

The output produced by the three different approaches, will look like this:

We grouped by Country and counted how many contact we have for each country.

So, Mongo provides three different ways to group and count documents in a collection, but which one should you use? Here some considerations reported in the MongoDB manual:

The Group approach provides simple access to common aggregation processes but it lacks the flexibility and capabilities of the aggregation pipeline and map-reduce and can not be used for sharded collections.

The MapReduce approach can be used for sharded collections and can store the output result to a new collection and it provides some flexibility that is not presently available in the aggregation pipeline.

The Aggregation pipeline ($group) is usually more efficient and less complex than the MapReduce approach and it can be used for shared collections.

To choose which approach is more suitable for your needs, I suggest you to read the full MongoDB documentation and to run some performance tests on your data. The decisions about the application environment and the development paradigms should be data-driven, so choose the best approach for your data and your timing/performance needs.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. If you want to know more or withdraw your consent to all or some of the cookies, please refer to the coockie policy. Got it!Reject.