The growing popularity of MongoDB means more and more people are thinking about data in ways divergent from traditional relational models. For this reason alone, it’s exciting to experiment with new ways of modelling data. However, with additional flexibility comes the need to properly analyze the performance impact of data model decisions.

Embedding arrays in documents is a great example of this. MongoDB’s versatile array operators ($push/$pull, $addToSet, $elemMatch, etc.) offer the ability to manage data sets within documents. However, one must be careful. Data models that call for very large arrays, or arrays with high rates of modification, can often lead to performance problems.

That’s because at high levels of scale, large arrays require relatively high CPU-overhead that leads to longer insert/update/query times than desired.

Let’s use car dealerships as an example and discuss why an array of cars in a dealer document isn’t necessarily the ideal data model.

Querying Cars

One of the advantages of MongoDB is its rich query language, which supports accessing documents by the contents of an array. If we want to locate cars by make and model using a query like {“make”: MAKE, “model”: MODEL}, we need a specific feature of MongoDB’s query language, $elemMatch. This operator, while optimizable by indexes, still needs to traverse the entire “cars” array of every eligible dealer document in order to execute the query. If we query documents with “cars” arrays that contain thousands of car entries, we are basically doing mini collection-scans. As a result, we’ll notice high CPU utilization and slow query execution.

What about more complex computation? The Aggregation framework — with the $unwind operator — could support many of the queries we’d like to perform, but indexes may not be used to their full effectiveness.

Finally, dealer documents can get very large with this data model. Although projections and atomic update operators cut down the size of the documents transmitted over the wire, there may still be scenarios where the system is hauling more baggage around than it should.

Adding and updating Cars

Adding and modifying car entries can require a scan of much or all of each array being updated, resulting in slow operations. For example, $addToSet, is a way of adding new elements to arrays that requires the database to scan through every array item to make sure the new element does not already exist. $pull can be similarly inefficient.

Furthermore, sometimes modifications to the “cars” array can grow a dealer document in size such that it must be moved in memory. These moves can be very expensive, particularly when the collection is heavily indexed as each index bucket that points to the document being relocated must also be updated to point to its new memory location.

Can we fix it?

Sometimes a data model that entails very large arrays can be reformulated into a data model that is much more efficient. In our case we could alternatively model dealers and cars like this:

In this data model we avoid the excessively large arrays and, with the right indexes, perform efficient queries on dealerships with even the largest of inventories. By keeping cars in their own collection, Eric’s Mongo Cars is ready to move inventory with crazy low prices, without fear that our volume is going to bring down the system for Eddie’s Junker Shack down the road. We love those guys.

Conclusion

Storing information in document arrays is an exciting capability available in MongoDB, but we want to avoid it under the following conditions:

The arrays can get very large, even if only in some documents

Individual elements must be regularly queried and computed on

Elements are added and removed often

None of the pitfalls described above are deal-breakers in and of themselves. It’s just that when summed together, the total overhead can become noticeable. So, be wary. In these high-volume cases, it is appropriate for us to use collections, not arrays, to store data. When we store such data using collections,

Regular computation is performed using simple, efficient methods

Adding and removing elements are simple insert/remove operations

Each element is accessible using simpler queries that can be effectively indexed to scale well

There’s a lot more detail under the hood, but if you’d like to discuss it, we’ll have to get there in the comments section below.

I attended a meetup with Trisha Gee and I had exactly this concern, and her reply was as your article mentions, to move to a more document model based approach.

This resolved the issues I was facing along with my concerns about array growth. It’s also changed the way I thought about incorporating MongoDB for future models.

Tom

Nice, but it’s a shame we need to go back to relational fundamentals to be performant

http://mongolab.com/ Eric Sedor

Thanks! It’s true that this feels similar to RDBMS optimizations. We feel an important point here is that MongoDB does not rely on enforcing predefined tabular schemas to sidestep the scalability problem of managing dynamic data on disk. It puts that flexibility and responsibility in the hands of the developer. To a developer, normalization of fast data is a valid performance tuning strategy in all computing applications. It is a superset of the methodologies that guide the RDBMS sphere, not a subset. It is also important, especially at high levels of scale, to work in harmony with the tool you’re using. For MongoDB, that means acknowledging that the primary access mode is ‘querying documents in collections.’

Geoffroy

Same as relational database….so what’s advantage of MongoDB!?

http://mongolab.com/ Eric Sedor

Hi!

I think the right way to think about it is: the data structures in
MongoDB are a superset of those you can express in a relational
database. So just because MongoDB lets you create “rows” that are
complex objects with nested arrays and sub-objects, it doesn’t eliminate
all relational concepts (like normalization via references).

http://twitter.com/jchlu/ jchlu

Nice to see a rational write-up suggesting DB normalisation for performance tuning is good practise, and that good practise is good practise – no matter what the underlying technology.

hpavc

This is great, however it would be nice to have both worlds, a symbolic reference to the collection Cars within the original unnamed document at times would be wonderful.

Gene Vilain

Isn’t one of the trade-offs the cost of doing joins when you need to see dealer and car information together? Any recommendations or best practices for measuring the cost of both in order to assess the trade-off? Great stuff! thanks!

Then it would be hard to find a DeLorean from a dealer with more than 100M revenue with the altered structure.

Also I guess this are some cases where one database call on a medium size array would be faster than two database calls when we want information for dealer and car.

Would be nice to know the boundary.

http://mongolab.com/ Eric Sedor

It’s a great question, Ryan. The normalization optimization presented in this blog is designed to address fast-moving “child” data while also serving heavy user-traffic. For serving such traffic while performing hardcore data-mining of arbitrary relationships like dealer revenue and car model, two queries may be better than one. In short, bring the information to the app-side and perform computation there, rather than relying on the database to perform computation for you.

The boundary of the tradeoff may not be easy to determine in advance, but is a function of total dealer count, average dealer inventory count, and rate of inventory change. Because the challenge of the outlying dealer with the largest or fastest-moving inventory remains, RANGE of dealer inventory count is also a factor.

An alternate strategy, if the desired data-relationships are known, is to annotate each car document with the necessary information about the dealership. So a rough car document for this example might look like:

{ vin: 123, model: “DeLorean”, dealer: { id: 1, revenue: “100M”}}

This would require two updates to modify dealer revenue (one to the dealer document and one to all cars matching dealer.id). In this example, that would probably be a semi-regular bulk/batch process, not a fast-moving real-time one. All updates to cars would still benefit from the efficiency of having their own collection (and the app could have the dealer document in memory to assist with all car document construction). Then, data-mining based on car model and dealership information could be performed with a simple query to an index like { model: 1, dealer.revenue: 1}.

Again, this annotation strategy is most useful if the relationships are known in advance, the parent information is slow-moving, and the goal is ideal performance for data-mining the known relationship. In terms of tradeoff, the cost is spent in disk space rather than query-time.

Thanks for the great write up! We’re happy MongoLab customers ourselves. Just curious, when you say “very large arrays”, what order of magnitude are we talking about? 1,000? 100,000? 1M+

Thanks!

http://mongolab.com/ Eric Sedor

Really good question, thank you! I definitely warn against 100k or higher, especially if array elements are coming and going. In general, begin being concerned in the 10,000s range. That said, your exact mileage may vary. For example, with a high $push/$pull/$addToSet load on an array of subdocuments–with multiple indexes on subdocument fields–could be problematic as early as the 1,000s. Please feel free to email support@mongolab.com if you have questions about specific operations on your MongoLab cluster(s)!

http://www.vp-zalec.net/ Jernej Jerin

What about if you are storing in the fields large chunks of text. Would not there be quite a large overhead when transmitting data back and forth even when having just a couple of hundred subdocuments? So this magnitude probably varies with regard to size of each subdocument in array, right?

http://mongolab.com/ Eric Sedor

One thing that dramatically helps a case in which network bandwidth is the concern, is that having “elements” in their own collection allows you to selectively query for the elements that you want, rather than querying for a document that contains those elements and asking the DB to extract the value with a query projection and the $slice operator, which may not even be applicable for some array use-cases.

Because projection with $slice is both a data copy AND requires some degree of iterating through the array, it is exactly the sort of operation we discuss–one that offers unpredictable performance depending on array length and size.

The sheer number of bytes devoted to the array incurs a cost during potentially any move or rewrite associated with any write operation, even if the operation doesn’t target the array itself.

By contrast, the number of elements in the array more directly impacts operations like $pull, $addToSet, and $elemMatch, which require iteration through the array.

For cases in which moving sub-elements to their own collection does NOT allow for more selective querying of those elements, queries will still require more or less the same network bandwidth because the same data has to be transmitted.

So a query to get dealerid, then another query using the id. What happens if you need to get a hundred ids and use them in another query? That would be a huge ‘match’ list.

http://mongolab.com/ Eric Sedor

This is the challenge in dealing with the relationship resolution component at high points of scale. In addition to large query bodies, resolving sets of ids incurs less efficient index traversal, and large result sets to boot.

For this reason, the slower-moving components of the data model, dealers, might well be cached in memory by the app, so reversing a multi-dealer extrapolation from a set of cars doesn’t involve the db directly.

Still, tuning large, multi-dealer operations on cars does become critical; that topic is explored in this blog.

https://www.linkedin.com/in/ianweisberger Ian Weisberger

Forgive me if I’m wrong, but this seems incorrect. Doing “joins” in mongodb is obscenely inefficient because of a lack of in-database join capabilities. So the application would have to do the “joins”. Why not just index the subdocument array?

Keep the first schema, and run db.cars.ensureIndex({‘cars.make':1})

http://mongolab.com/ Eric Sedor

Even including the cost of $elemMatch operations for more sophisticated car queries, the first strategy can be performant with the index you suggest, so you’re not wrong. Even fast-moving child relationships can be modeled this way at low points of scale.

The idea here is that as the speed at which dealers move inventory increases, so does the cost of frequent $push/$pull/$addToSet operations to change car inventories. The Power of 2 Sizes option mitigates that cost for an additional range of use-cases, but regularly rewriting arrays can still add up depending on how heavily-indexed they are.

The second model is intended for maximum query versatility (including aggregation pipelines that won’t require $unwinding into unmapped memory) at high scale. In this environment, the cost of two distinct queries to resolve the relationship is more easily paid for full query features and faster writes.

This even buys the ability to more readily embed and manipulate arrays in car documents, should it be necessary.

Designing your schema with MongoDB really doesn’t have to be a binary decision: (a) store it normalized or (b) store it in an array decision like that presented here. Depending on how the data is accessed you might also decide to perform any number of different partial denormalizations. For example, if your car dealership view needs a list of manufacturers they have in stock you might have that as an array on the dealership record. You would update it each time their inventory changes, or periodically (eventual consistency), or both. You might also store a count by type on the dealer record. Or you might keep the 10 most recent additions in an array so that the initial page load for a dealer can be accomplished in one round trip to the database, while requests to ‘show more’ are met by querying the full car collection. What you will need really depends on the query patterns not on some abstract storage model. The one caveat is that you must be clear in your code / design documents as to what’s ground-truth data and what’s computed or cached data so other developers aren’t confused.

Akash Gupta

How would a query run on such a structure? JOINS are not supported by MongoDB. I am just starting out with MongoDB and the word “denormalization” used in it’s context sends shivers down my spine.

Also, by linking the previous embedded sub-doc, don’t you think the total read time might get worse?

Randy

Good analysis. I was running into the same problem trying to keep thousands of recordset under one entity. It’s a tradeoff worth taking for massive data

jonnyonion

what about storing only the ids of the cars in the dealers collection as an array. And then use this field in combination with $in operator in cars.

Like: hey that’s me, and these are the keys to my cars ;)

In your case you need an additional index in cars “dealership”, and each index means RAM.

Antonio de Perio

Mate, I have this exact same modelling problem right now. This article was exactly what I needed. Thanks!

mason

Hi. Great article, thanks. We recently came across just this issue at our company and I am tasked with refactoring the data model. I am curious how you would go about maintaining the ordering of the cars objects in a separate collection as they would be in the embedded array?

About

This blog is brought to you by MongoLab, the fully-managed MongoDB Database-as-a-Service (DBaaS) platform that automates the operational aspects of running MongoDB in the cloud.