Indexing fundamentals and considerations

The importance of proper indexing

Indexes in MongoDB are B-tree data structures that dramatically reduce the resource (i.e., RAM, CPU, and I/O) requirements of querying. Proper indexing is critical; even a single unindexed query is enough to cause significant performance degradation.

When a query is not indexed, MongoDB performs a full collection scan, iterating through every single document in the collection to find results. In contrast, an indexed query performs and scales much better because it inspects far fewer documents.

Without sufficient indexing, the hardware costs for a performant database increase significantly. Importantly, because un-indexed queries use more resources than needed, they also impact the performance of other operations–even if those operations are indexed.

Limiting the number of documents to scan using indexes

Single-field indexes

A query like {"name":"Adam"} may seem straightforward, but if it’s not indexed, it could perform quite badly (the larger the collection, the worse the query will perform).

Rather than examining all the documents in a collection, create a Single Field Index on this “name” field:

db.collection.createIndex( {“name”: 1} )

Now, no matter how large the collection, only documents matching {"name":"Adam"} will enter into memory.

Compound indexes

When a query includes multiple filters, it is best to include all filters in a compound index. For example, {"name":"Adam","year":{"$gte":2016}} can use the following Compound Index:

db.collection.createIndex( {“name”: 1, "year": 1} )

This is preferable to a {"year":1,"name":1} index because an equivalency check on “name” is more selective than a range check on “stars”.

Sorting efficiently using indexes

The sort() method returns documents in the requested ascending (1) or descending (-1) sort order. For example, the results of .sort( {"lastName":1} ) are in ascending, alphabetical order.

For a query with sort that isn’t covered by an index, MongoDB sorts results at query time, using precious CPU time and delaying the response to the app.

However, if that same query is covered by an index which includes, in proper order, all fields used to filter and sort the result set, MongoDB does not sort results at query time. Instead, results are already sorted in the index and can be returned immediately.

Determining the order of fields

A good rule of thumb for queries with sort is to order the indexed fields in this order:

Finally, the field(s) on which you will query for a range of values in the order of most selective to least selective (see range operators below)

The reason one $in filter can precede sort fields is because MongoDB can break the query up according to that $in and merge sort the results.

An example of using this rule of thumb is in the section on “Sorting the results of a complex query on a range of values” below, including a link to further reading. Note that these ordering recommendations are good for most use-cases but may not be the best for your data. The best index in a given scenario is determined through testing using .explain.

Range operators

Query fields(s) which use the following operators should generally be included after the field(s) on which you will sort:

$in, $nin

$gt, $gte, $lt, $lte

$ne

$regex

$exists:true (note that $exists:false is not considered a range operator for sorting purposes)

Determining sort orders

MongoDB can traverse an index in either direction. The {"lastName":1} index therefore satisfies both .sort( {"lastName":1} ) and .sort( {"lastName":-1} ). Therefore, for single-field indexes, the sort order of keys does not matter.

However, for compound indexes, sort order can matter in determining whether the index can support a sort operation - the combination of index field order and each field’s sort order determines whether the index can support a sort operation.

Even though the query is using an index that includes the sort field, the scanAndOrder: 1 is evidence that an in-memory sort took place. This is because of the range criteria on count and the structure of the index. To avoid scanAndOrder in these situations, place sort fields before range fields in the index.

The omission of “scanAndOrder:1” confirms that the query is completed without an in-memory sort.

To better understand the log lines above, read the next section below. Further reading on this topic of sorting the results of complex queries is available at our blog post on sorting the results of queries on ranges.

Performing efficient $or queries

In general, index for $or queries as though each $or clause was its own query. Be sure to consider the sort clause as part of each $or clause.

SERVER-13732

Certain $or queries are affected by a known bug, SERVER-13732, which prevents efficient index selection. To avoid this bug when composing $or queries, ensure that $or is the only top-level field in the query. For example:

Since both of these $or clauses use the same query filters, each clause can use the same index:

{"year":1,"name":1,"sex":1,"count":-1}

Comparing a log line for an unindexed query vs. an indexed one

This example uses the same baby name statistics data as above and compares two log lines for the same query.

The first operation takes almost 2 seconds because it does not use an index, but the second takes 1 millisecond with an index. The key
differences between these log messages lie in the planSummary,nscannedObjects, and scanAndOrder values.

Starting in MongoDB 3.2, the following key words in MongoDB log lines changed:

scanAndOrder —> hasSortStage

nscanned —> keysExamined

nscannedObjects —> docsExamined

Example query with sort

We want to find the number of babies named “Arya” each year in ascending year order. Our database query is:

The following information observed in the log line above shows a lot of room for improvement:

planSummary: The planSummary: COLLSCAN shows the database did not use an index and instead performed a full collection scan.

nscannedObjects: The nscannedObjects:1858689 shows the number of documents inspected. By comparing this to the number of returned documents (nreturned:58), we see how inefficient the query is. To return 58 documents, the database brought almost two million documents into memory. Note that in MongoDB 3.2+ this would look like docsExamined:1858689.

scanAndOrder: The scanAndOrder:1 shows that the results were sorted in memory after the data was retrieved. Note that in MongoDB 3.2+ this would look like hasSortStage:1.

The following information observed in the log line above shows an efficient query:

planSummary: The planSummary: IXSCAN { name: 1.0, year: 1.0 } indicates what index was used for the query.

nscannedObjects: The nscannedObjects:58 shows that only 58 documents were inspected. Because this number matches the number of documents in the result set (nreturned:58), the query is very efficient. Note that in MongoDB 3.2+ this would look like docsExamined:58.

scanAndOrder: The omission of scanAndOrder:1 means results were not sorted in memory; the index allowed documents to be read in the requested order. Note that the omission of hasSortStage:1 in MongoDB 3.2+ means the same thing.

Results were not sorted in memory, and the number of documents inspected is equal to the number of documents returned. This is an efficient query.

Further reading

One of MongoDB’s strengths is its flexible data model. Flexible data modeling is powerful, but dangerous without proper indexing. If your data contains complex fields such as arrays, embedded documents, or geospatial data, read the following for important information about index behavior:

Special indexing considerations

Avoid operators/modifiers that can never use indexes

We recommend against using the below operators and modifiers that cannot use indexes.

$where operator

$where does not use indexes and instead uses the server-side JavaScript engine to examine documents in memory. Importantly, the $where operator is usually only necessary to evaluate field values against each other. When applying conditions on field values independently, use the MongoDB query language instead.

The following two examples illustrate how the $where operator can be re-written as a query.

Sometimes it’s not possible to replace a $where with a query but it is possible using MongoDB’s aggregation framework. For example, db.collection.find( {"$where": "this.count >= this.maxCount"} ) cannot be rewritten using a MongoDB query but it can be evaluated using MongoDB’s aggregation framework.

If you absolutely must use $where, limit the number of documents evaluated to a very small number using standard query operators with supporting indexes.

$size operator

$size does not use indexes and instead must examine documents in memory to count the number of elements in embedded arrays. If you must use $size, limit the number of documents evaluated to a very small number using standard query operators with supporting indexes.

$nor operator

$nor will always perform a collection scan (i.e., examine every document in a collection), so the performance impact is based on the size of the collection. If at all possible, try and avoid the $nor operator so your query patterns can properly use indexes. Below are two examples of how the $nor operator can be re-written.

When refactoring queries, be sure to test that query results are as expected and logically correct for your application.

$natural modifier

Sorting by $natural does not use indexes and instead forces a collection scan, even if a query filter is provided.

If you must use $natural, also use a small limit value to keep the number of documents scanned to a minimum. Alternatively, size your cluster to ensure that the queried collection can be part of the working set without disrupting database performance.

Count operations can use more memory than expected

A count operation can use an index for selective conditions just like a query can, but note that it will need to scan documents if the query filter is not fully selective, or if it does not target a single continuous range of index keys (e.g., it has an $in clause).

Consider a collection with the following index:

{"students":1,"grade":1}

The first three count operation examples that follow will need to examine documents in addition to using the index:

This count operation does not target a single continuous range of index keys with the $in clause.

Example 2:

db.collection.find( {"students": {"$gt": 4}, "grade": 4} ).count()

This count operation will need to examine documents in addition to using the index. Keys for grade: 4 are not contiguous; they are separated by different students values because students occurs in the index before grade.

Example 3:

db.collection.find( {"students": 4, "grade": 4, "age": 10} ).count()

The {"students":1,"grade":1} index does not include the age field, so it is not fully selective for this query.

Example 4:

db.collection.find( {"students": 4, "grade": {"$gt": 4}} ).count()

In contrast to the first three examples above, this count operation is optimal in that it will use only the index.

Avoid using a sort with geospatial $near queries

Geospatial $near, $nearSphere, and $geoNear operators automatically return results in order (sorted by distance from the point specified). Providing an additional sort with the $near or $nearSphere operator will perform more work than necessary.

If your use case requires finding points within an area that are either not sorted or sorted by criteria other than distance from a central point, use the $geoWithin operator instead.

Use the $maxDistance operator with geospatial queries

Geospatial $near, $nearSphere, and $geoNear queries should take advantage of the optional $maxDistance operator wherever possible. By limiting the distance considered, you can avoid excess use of CPU and memory.

Provide a $maxDistance value that is appropriate for your use case, both in terms of the distance to cover and the number of results that should be considered.

Using embedded documents as _id values

Embedded document _ids allow you to encode complex values that aren’t
included in the default ObjectId-based _ids generated by MongoDB and MongoDB drivers. However, we recommend using an ObjectId whenever possible.

If you must use embedded documents as _ids, there are important considerations to be aware of:

The dynamic embedded document data type is less efficient than the default ObjectId data type.

We can improve the performance of the above regex operation in three different ways.

1) Case-sensitive prefix expression

To keep the number of index entries scanned to a minimum, you can make use of a case-sensitive, prefix expression. We make the regex case-sensitive by dropping the case insensitive modifier i at the end of the regex, and make the operation a prefix expression by adding a carrot ^ to the beginning of the regex.

Thus, the following case-sensitive, prefix expression scans a limited number of index entries:

db.collection.find( {"username": {"$regex": "^John"}} )

2) Storing a lowercase version of a field

In some cases, regexes can be avoided by adding a lowercase version of the field to each document and performing an equivalency check on that field. For example:

This query returns “John.Smith@email.com”, “john.smith@email.com”, “JohN.SmitH@EmaiL.CoM” and all other letter cases.

How to paginate efficiently

The skip() and limit() methods are often used together to support pagination. However, per this explanation of skip behavior, MongoDB examines more and more index keys or documents, as the skip value increases.

Even if indexed, the following query scans 100 index keys before returning the “eleventh page” of results:

Improve the performance of this operation by querying on a field that is appropriate for the list being rendered. This strategy is very useful when an app provides a “Next Page” option but does not allow users to skip to arbitrary pages.

The query above can then be amended to use the $gt operator to find documents from any point within a collection without the use of skip().

To retrieve the first page of results, use the orderby clause but omit the filter on “createdDate”. For each subsequent page, use the last value of the current page in a $gt (if ascending sort order) or $lt (if descending sort order) filter.

Index management

Viewing existing indexes

To view existing indexes for a particular collection, follow these steps:

From your account’s Home page, navigate to the deployment with the database whose indexes you want to view.

If necessary, navigate to the desired database after navigating to the deployment.

Navigate to a collection to view its indexes.

Click the “Indexes” tab to display all indexes created for the collection.

Alternately, you can use the db.collection.getIndexes() method via the mongo shell to view all indexes for a particular collection.

Creating a new index

Best practices when creating indexes

The process of creating an index can require a lot of system resources and can be long-running (hours if not days).

If uptime and performance are important for your application:

Build the index in the background (our management portal does this by default; see next section).

Build the index during a low traffic time.

Avoid simultaneous index builds by waiting for one to complete before starting another.

Index builds take significantly longer when there is insufficient RAM. In particular, if the index you’re building is close to or exceeds the RAM available, it could be impossible to build the index within a reasonable amount of time.

Background vs. foreground index builds

Always consider whether you want to build your indexes in the background or foreground. While building them in the foreground is faster, a foreground build blocks all other operations to the database. On a production system, you almost always want to build your indexes in the background to minimally impact your live database.

Creating an index through the mLab management portal

From your account’s Home page, navigate to the deployment with the database where an index will be added.

If necessary, navigate to the desired database after navigating to the deployment.

Navigate to the collection where the index(es) will be added.

Click the “Indexes” tab.

Click the “Add” button to open the “Add new index” window.

Add your index in the editor window.

Adjust the options and custom name if desired. Note that custom names are not required; it is usually best to let MongoDB auto-generate a name.

Click “Create in background” to build your index.

Alternately, you can use the db.collection.createIndex() method via the mongo shell to add an index to your collection. If you would like for your deployment to remain available during the index building operation, specify ({"background":true}). Indexes created through mLab’s management portal are always built in the background.

Killing a running index creation operation

It’s possible you may find yourself in the position where you want to kill an in-progress index build for one or both of the following reasons:

Accidentally created the wrong one or one that you no longer want

Performance impact is too great

In this case, you can kill the operation but be sure to proceed with caution. Killing a running index creation operation will often lead to replica sets with members having different sets of indexes as well as other unexpected issues.

Therefore, unless database performance has degraded to an emergency state, we strongly encourage you to allow the index build to complete and then optionally remove it once it’s done building.

Restarting the deployment to kill an index build won’t stop the index build and remove the index as you would expect - instead the process will start up again in an offline mode and build the index in the foreground as described in MongoDB’s documentation on interrupted index builds. This can lead to a loss of high availability or possibly downtime.

Identifying and removing unnecessary indexes

Overview

While it is very important that all queries are indexed, updating indexes slows the speed of inserts and updates. As such, unnecessary indexes may cause unnecessary use of CPU and RAM.

Managing indexes across environments

There are many ways to manage the process by which indexes are created and defined in your various environments and MongoDB deployments. Therefore you should carefully consider how you manage this process.

For databases that have gone into production, we recommend that you build your indexes in the background and at times that make the most sense for your app (e.g., during off-peak activity times). You can then maintain a separate script of createIndex calls to ensure these indexes are built in your development and staging environments as well.

If you decide to put createIndex calls in your code, they should be called when your application initializes. However, if you go this route you must be careful about how this fits with your architecture and your deployment processes or else you might end up with indexes being built at unplanned (and undesirable) times.

Performance analysis

Key performance tools

Not available for Sandbox databases

Use mLab Telemetry to view real-time and historical statistics of key MongoDB metrics.