Rethink Your Data

07 Nov 2011

NoSQL isn't just about new tools, it's also about new [and old] modeling and data access techniques. Since traditional modeling approaches are still relevant, it would be foolish to forget everything that you know about data modeling. Equally foolish would be to fail to learn, experiment and consider different techniques; especially in the face of poor performing queries.

For most applications, reads far outnumber writes. Yet most code is written so that reads are often complex while writes are dead simple. When performance is an issue, going from complex queries (involving a lot of fields, aggregation or map reduce) to simple queries can have devastatingly awesome consequences.

The Sample Scenario

We want to build a leaderboard system for groups of friends. When a user goes to his or her score page, he doesn't only see his best-score per level, but also the best score out of his friends.

This approach fits with how we've been taught to build systems. You have well defined and distinct entities which map to real world objects.

With this model, saving scores is simple. We just need to figure out how to display our results. It turns out that, at scale, there isn't a great way to do that. The two most likely solutions, I see, are to either do some type of aggregation (GROUP BY in SQL, Map Reduce in MongoDB) or hit the database multiple times (linear to the number of levels we are showing). Neither of which work particularly well with a lot of levels, a lot of friends, or with a sharded architecture.

Saving a score is fast and easy:

// do this as an upsert, 500 is the new score
db.scores.update({user: 'Paul', level: 4, score: {$lt: 500}}, {$set: {score: 500}}, true)

Saving a score has gotten more complicated. However, writes often aren't the bottleneck. In this case, the complexity only happens when the score is a personal best. The rest of the time, it's the same code. If we wanted to, we could further optimize the code by having Paul know who has him as a friend.

To The Purists

Purists might not like this. But to me, this is more than about being pragmatic. For our system, this denormalized representation of a score is the correct one, because it's what a score actually is. In the real world, data is duplicated. And while duplicate date does have a cost associated with it, it also has a lot of advantages.

Conclusion

If a read query is slow, and you've done what you can with indexes, then the solution may be to push some of the work onto inserts, updates or a background task. If you have difficulty visualizing it, think of it first in terms of an ideal query, then the structure which would satisfy the query, and finally how to maintain that structure.

This approach can be abused. But more often than not, people are not, whatsoever, thinking about the problem from this perspective.

When a query is slow, don't only ask how can I make this faster? Ask how can I make this simpler? If another part must pay a price, at least that's an option for you to consider.