The software development blog of Valeri Karpov

Menu

Monthly Archives: April 2014

From a performance perspective as well as a developer productivity perspective, MongoDB really shines when you only need to load one document to display a particular page. A traditional hard drive only needs one sequential read to load a single MongoDB document, which limits your performance overhead. In addition, much like how Nas says life is simple because all he needs is one mic, grouping all the data for a single page into one document makes understanding and debugging the page much simpler.

A place where the one document per page heuristic is particularly relevant is on pages that display historical data. Loading a single user object is fast and simple, but running an aggregation to compute the average number of times per month a user performed a certain action over the last 6 months is a costly operation that you don’t necessarily want to do on-demand. NodeJS devs are spoiled in this regard, because scheduling in NodeJS is extremely simple. You can easily schedule these aggregations to run once per day and avoid the performance overhead of running the aggregation every time a user hits the particular page.

However, before MongoDB 2.6, shipping the results of an aggregation into a separate collection required pulling the aggregation results in through the NodeJS driver and inserting them back into MongoDB. Furthermore, aggregation results were limited to 16MB in size, which made doing aggregations that would output one document per user impossible. MongoDB 2.6, however, introduced a $out aggregation pipeline stage, which writes the output of the aggregation to a separate collection, and removed the 16MB aggregation limit.

Getting transformed data $out of aggregation

Let’s take a look at how this can be used in practice in NodeJS. Recall the food journal app from the first part of this series: let’s add a route that will display the user’s average calories per day broken down on a per-week basis. This involves a slow and complex aggregation, so we’ll schedule this aggregation to run once per day and dump its data to a new collection using $out. The data for this route will get recomputed for all users using one aggregation, and each time the user hits the API endpoint all the server will do is read one document. Here’s what the aggregation looks like in NodeJS (you can also copy/paste this aggregation pipeline into a mongo shell and get the same result). You can also find this code on Github.

The particular details of the aggregation aren’t that important, what really matters is the $out stage at the end. The $out stage does something very cool: not only will the resulting documents get inserted into a new collection called weekly_calories, $out will overwrite the existing collection once the aggregation completes. In other words, if this aggregation runs for an hour, the weekly_calories collection will remain unchanged until the aggregation is done. After the aggregation finishes, the weekly_calories collection will be atomically replaced by the result of the aggregation. Note that, right now, $out doesn’t have any way of appending to the output collection, it can only overwrite the output collection. Design your aggregations accordingly.

Taking a look at the results

Using a bit of NodeJS magic, we can wrap this aggregation in a service that uses node-cron to schedule itself to run once per day at 0030 (12:30 am) server time:

We can then inject this service into an ExpressJS route and expose the route as a GET /api/weekly JSON API endpoint:

MongoDB 2.6’s improvements to the aggregation framework are a quantum leap forward, and enable you to do some amazing things. While scheduled analytics calculations certainly aren’t the only use case of $out, I hope this post showed you at least one way in which $out allows you to play to MongoDB’s strengths in a new way.

This is Part II of a 3-part series on using new MongoDB 2.6 features in NodeJS. Part III of this series is coming up in 2 weeks, in which I’ll take a look at some of MongoDB 2.6’s query framework improvements, primarily index filters.

MongoDB shipped the newest stable version of its server, 2.6.0, this week. This new release is massive: there were about 4000 commits between 2.4 and 2.6. Unsurprisingly, the release notes are a pretty dense read and don’t quite convey how cool some of these new features are. To remedy that, I’ll dedicate a couple posts to putting on my NodeJS web developer hat and exploring interesting use cases for new features in 2.6. The first feature I’ll dig in to is text search, or, in layman’s terms, Google for your MongoDB documents.

Text search was technically in 2.4, but it was an experimental feature and not part of the query framework. Now, in 2.6, text is a full-fledged query operator, enabling you search for documents by text in 15 different languages.

Getting Started With Text Search

Let’s dive right in and use text search on the USDA SR-25 data set described in this post. You can download a mongorestore-friendly version of the data set here. The data set contains 8194 food items with associated nutrition data, and each food item has a human-readable description, e.g. “Kale, raw” or “Bison, ground, grass-fed, cooked”. Ideally, as a client of this data set, we shouldn’t have to remember whether we need to enter “Bison, grass-fed, ground, cooked” or “Bison, ground, grass-fed, cooked” to get the data we’re looking for. We should just be able to put in “grass-fed bison” and get reasonable results.

Thankfully, text search makes this simple. In order to do text search, first we need to create a text index on your copy of the USDA nutrition collection. Lets create one on the food item’s description:

db.nutrition.ensureIndex({ description : "text" });

Now, we can search the data set for our “raw kale” and “grass-fed bison”, and see what we get:

Unfortunately, the results we got aren’t that useful, because they’re not in order of relevance. Unless we explicitly tell MongoDB to sort by the text score, we probably won’t get the most relevant documents first. Thankfully, with the help of the new $meta keyword (which is currently only useful for getting the text score), we can tell MongoDB to sort by text score as described here:

First, an important note on the compatibility of text search with NodeJS community projects: the MongoDB NodeJS driver is compatible with text search going back to at least 1.3.0. However, only the latest version of mquery, 0.6.0, is compatible with text search. By extension, the popular ODM Mongoose, which relies on mquery, unfortunately doesn’t have a text search compatible release at the time of this blog post. I pushed a commit to fix this and the next version of Mongoose, 3.8.9, should allow you to sort by text score. In summary, to use MongoDB text search, here are the version restrictions:

MongoDB NodeJS driver: >= 1.4.0 is recommended, but it seems to work going back to at least 1.2.0 in my personal experiments.

Now that you know which versions are supported, let’s demonstrate how to actually do text search with the NodeJS driver. I created a simple food journal (e.g. an app that counts calories for you when you enter in how much of a certain food you’ve eaten) app that is meant to tie in to the SR-25 data set. This app is available on GitHub here, so feel free to play with it.

The LeanMEAN app exposes an API endpoint, GET /api/food/search/:search, that runs text search on a local copy of the SR-25 data set. The implementation of this endpoint is here. For convenience, here is the actual implementation, where the foodItem variable is a wrapper around the Node driver’s connection to the SR-25 collection.