Making our MongoDB Code Run Faster

09 May 2011

Yesterday I talked about the work I was doing in improving the performance around mogade's ranking functionality. Basically, we denormalized our data and leaned on Redis for its built-in sorted set data structure.

When you're working closely with code, trying to improve specific aspects of it, all types of smaller things jump out at you trying to get your attention. Today I wanted to run a some benchmarks and see if any of improvements I briefly considered yesterday would actually be worth implementing.

Idea 1 - Reducing Index Memory

In order to efficiently get leaderboard pages, there's a composite index on leaderboard_id, created_at and points. If we want today's leaderboard, we end up doing the MongoDB equivalent of where leaderboard_id = 1 and created_at >= '2011/05/09 00:00' order by points desc (assuming today is May 9th). So our index is pretty solid.

For 500,000 scores, this index takes ~26MB. My idea? Rather than storing the absolute date the score was entered, store the time as an offset in days from a certain point (say Jan 1st 2011). For example, our first document's date would change from 2011/04/11 10:32 to 100. I know, I know...people hate this type of date messing up..but I wanted to try it out.

The reason for this madness? I wanted to know what impact indexing a 32bit field (an int) rather than a 64bit (a date) field would have on memory. You'd think that at a mere 26MB, I shouldn't care...but I can't help it. As I probably could have guessed, given that this index is made up of 3 fields, and cutting the storage requirement of one field by 1/2, resulted in an approximate memory savings of 1/6th. Not a compelling improvement.

Idea 2 - Improve Paging Performance

Although my main interest with the above change was to reduce memory footprint, I couldn't help but wonder what impact it might have on fetching leaderboards. Rather than finding all scores where the date >= XYZ we're able to use a straight up equality operator (because all scores for today will have the same date value).

Using the original approach, grabbing a couple thousand random pages of scores took about 25 seconds. Using the new approach took 5 seconds. Wow, hold on a minute, a 5 times improvement? This is something I'm interested in.

Two things changed here though, first we're using an equality rather than a greater than or equal. Second, we're compare an int versus a date. I suspected that the performance gain was coming from the equality check. I rolled back my date change and tried again. The same 5 time speed improvement.

This means if we store our date without hours we can still get our 5 times performance improvement without making our data all fugly and weird.

Idea 3 - Rename fields

In MongoDB every field name is stored in every document. This leads to developers who start to try to cheat some characters out of field names. Using the full names our 500K documents took around 80MB. Renaming the fields to things like lid, n, u, cat and p reduced this to 62MB. Not bad.

In v2 we are [stupidly] using our own thin mapping layer between our models and the MongoDB ruby driver. The code takes care of this aliasing for us, so there's no reason not to use it. Hopefully though, this is something MongoDB will support natively one day. Also, our data is pretty small, so the ratio of field name length to data is significant. You might not get the same results.

Idea 4 - Removing an Index

There's a 3rd index (I say 3rd, because I'm also counting the default (and unremovable) index on _id) in our scores table. It's on the leaderboard_id and unique fields. This is the index used when finding or updating a specific player's score for a leaderboard. Since unique is a string (a sha1 hash), this index is as big as the other two indexes combined, and over 1/2 the size of our data. If I'm really interested in reducing memory usage, this seems like an ideal candidate

If you didn't read yesterday's post, you'll need to understand that we actually have 3 scores collections. One for the daily, weekly and overall scores. Also, yesterday we introduced a high_scores collection which tracks player's best score. This is duplicate data from what's available in our 3 scores collections, but we can get to it with a single read, rather than 3.

My idea is to store the _id of the daily, weekly and overall scores within the high_scores collection (in addition to the points which we are already storing). Since we are already reading and writing from high_scores when saving a new score, it isn't too much work to bring back an few more bytes. Rather than updating by leaderboard_id and unique, we can update directly against _id.

This doesn't only mean we get to drop our index, we actually get to remove the unique field since that's all it is being used for. Within high_scores we'll still have an index on leaderboard_id and unique, but we'll have this in a single collection rather than 3 collection. This should reduce that indexes memory footprint by 2/3s (plus 2/3s of whatever space is taken up by the unique field, which itself won't be small potatoes).

Conclusion

I think the most surprising outcome from this was the raw performance gain I saw when moving from >= to ==. Obviously I need to brush up on my B-Tree knowledge (the structure MongoDB uses for indexes). I'm also excited about the memory savings we'll get by removing the 3rd index (our production DB has more than 500K records). This is one of those things that, in retrospect, should have been obvious a long time ago.

For what it's worth, I enjoy doing premature optimization. You can think it's evil, but I find it fun. If I'm coding on my own time, and not having fun, something's not right. Beyond that though, I really think that there are legitimate and important reasons for premature optimization.

First, it is a great way to learn some fundamental stuff that you probably wouldn't consider looking at otherwise. I don't care if you've never had slow SQL, I think every developer should understand a bit about B-Trees, should know how to read query plans, and make sense of the various join approaches a database might take. As more and more gets abstracted away, it becomes increasing important to go out of our way to understand how, once fundamental things like memory allocation, work. Optimizing code is a good, practical way to get a hands on feel for this stuff.

Secondly, performance is a feature. A really important feature. Often times, you simply don't know how slow, or fast, something is until you start playing with it. Was my date to int change stupid? Sure. Could I have used my little brain to figure out that I'd get a 1/6 saving? Eh..perhaps. Should I have stopped once I realized I was talking about 26MB? Probably. But now I know. Plus, in exploring that memory improvement, I found a 5x performance improvement which I will be rolling out. And I won't only be rolling it out, I'm going to try to get a better understanding as to why it's really 5x faster, and apply that knowledge, wherever appropriate, moving forward.