What I noticed was that is was quite slow. For example, looping over 12,200 documents to get this data out took a big fat 600 milliseconds on average! Note that it’s not dressing things up as Python model instances as the .collection.find() just returns a dict straight from pymongo.

After realising that I actually only need time, right and answer I just set the fields parameter like this:

times=defaultdict(list)forpqin(db.PlayedQuestion.collection.find({'play.$id':{'$in':play_ids}},fields=('time','right','answer')# the difference)):ifpq['time']isnotNone:ifpq['right']:times['right'].append(pq['time'])elifpq['answer']:times['wrong'].append(pq['time'])

But wait a minute! If I only care about those with a time that isn’t None (null), can’t I include that in the query? Well, the time field is not indexed but it’s usually a good idea to push work onto the database. Let’s try:

You’d think this would be a lot faster now since the only field it returns is time which is a float point number and takes up significantly less bytes compared to answer which is a unicode string. However this approach takes 63 milliseconds on average which is worse than before. Having said that, now things are getting a bit weird because I only have 1,575 documents to loop over so a large part of that 63 milliseconds is probably spent just setting up and tearing down the query. Next: map/reduce!

As you might have noticed, this query is now ripe for an aggregate instead of this silly for-loop. So, let’s try a map reduce job on it:

Unfortunately all that juggling and javascript malarkey isn’t actually that performant. This takes 130 milliseconds on average. Bear in mind that the query reduces the collection down to a small number (1,575) which might mean that a map/reduce job simply isn’t the right tool.

In conclusion

I think doing this optimization dance was a bit premature. The collection size is simply too little. Looping over 1.5k lightweight dicts in python is a breeze. Also, knowing what data you need from a query makes a huge difference (200% in my case).

Also, by the time my database is big enough such that silly loops in the drivers don’t work, MongoDB might have the new promised aggregate functions (estimated for version 2.2 (two releases away)) which will do basic operations such as average() natively without the javascript parsing.

Peter is a senior web developer at Mozilla currently working on Elmo, Socorro and various internal projects. His main weapon of choice is Django and PostgreSQL but dabbles in other frameworks and databases when time allows. He writes more opinionated nerdery on www.peterbe.com