Schema Design in MongoDB vs Schema Design in MySQL

For people used to relational databases and doing MySQL database design, using NoSQL solutions such as MongoDB brings interesting challenges. One of them is schema design: while in the relational world, normalization is a good way to start, how should we design our collections when creating a new MongoDB application?

Let’s see with a simple example how we would create a data structure for MySQL (or any relational database) and for MongoDB. We will assume in this post that we want to store people information (their name) and the details from their passport (country and validity date).

Relational Design

In the relational world, the basic idea is to try to stick to the 3rd normal form and create two tables (I’ll omit indexes and foreign keys for clarity – MongoDB supports indexes but not foreign keys):

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

mysql>select*from people;

+----+------------+

|id|name|

+----+------------+

|1|Stephane|

|2|John|

|3|Michael|

|4|Cinderella|

+----+------------+

mysql>select*from passports;

+----+-----------+---------+-------------+

|id|people_id|country|valid_until|

+----+-----------+---------+-------------+

|4|1|FR|2020-01-01|

|5|2|US|2020-01-01|

|6|3|RU|2020-01-01|

+----+-----------+---------+-------------+

One of the good things with such a design is that it’s equally easy to run any query (as long as we don’t consider joins as something difficult to use):

SELECT name FROM people pl LEFT JOINpassports psON ps.people_id=pl.idWHERE ps.idISNULL

etc

MongoDB design

Now how should we design our collections in MongoDB to make querying easy?

Using the 3rd normal form is of course possible, but that would probably be inefficient as all joins should be done in the application. So out of the 3 queries above, only the query #1 could be easily run. So which other designs could we have?

A first option would be to store everything in the same collection:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

>db.people_all.find().pretty()

{

"_id":ObjectId("51f7be1cd6189a56c399d3bf"),

"name":"Stephane",

"country":"FR",

"valid_until":ISODate("2019-12-31T23:00:00Z")

}

{

"_id":ObjectId("51f7be3fd6189a56c399d3c0"),

"name":"John",

"country":"US",

"valid_until":ISODate("2019-12-31T23:00:00Z")

}

{

"_id":ObjectId("51f7be4dd6189a56c399d3c1"),

"name":"Michael",

"country":"RU",

"valid_until":ISODate("2019-12-31T23:00:00Z")

}

{"_id":ObjectId("51f7be5cd6189a56c399d3c2"),"name":"Cinderella"}

By the way, we can see here that MongoDB is schemaless: there is no problem in storing documents that do not have the same structure.

The drawback is that it is no longer clear which attributes belong to the passport, so if you want to get all passport information for Michael, you will need to correctly understand the whole data structure.

A second option would be to embed passport information inside people information – MongoDB supports rich documents:

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

>db.people_embed.find().pretty()

{

"_id":ObjectId("51f7c0048ded44d5ebb83774"),

"name":"Stephane",

"passport":{

"country":"FR",

"valid_until":ISODate("2019-12-31T23:00:00Z")

}

}

{

"_id":ObjectId("51f7c70e8ded44d5ebb83775"),

"name":"John",

"passport":{

"country":"US",

"valid_until":ISODate("2019-12-31T23:00:00Z")

}

}

{

"_id":ObjectId("51f7c71b8ded44d5ebb83776"),

"name":"Michael",

"passport":{

"country":"RU",

"valid_until":ISODate("2019-12-31T23:00:00Z")

}

}

{"_id":ObjectId("51f7c7258ded44d5ebb83777"),"name":"Cinderella"}

Or we could embed the other way (however this looks a bit dubious as some people may not have a passport like Cinderella in our example):

Shell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

>db.passports_embed.find().pretty()

{

"_id":ObjectId("51f7c7e58ded44d5ebb8377b"),

"country":"FR",

"valid_until":ISODate("2019-12-31T23:00:00Z"),

"person":{

"name":"Stephane"

}

}

{

"_id":ObjectId("51f7c7ec8ded44d5ebb8377c"),

"country":"US",

"valid_until":ISODate("2019-12-31T23:00:00Z"),

"person":{

"name":"John"

}

}

{

"_id":ObjectId("51f7c7fa8ded44d5ebb8377d"),

"country":"RU",

"valid_until":ISODate("2019-12-31T23:00:00Z"),

"person":{

"name":"Michael"

}

}

{

"_id":ObjectId("51f7c8058ded44d5ebb8377e"),

"person":{

"name":"Cinderella"

}

}

That’s a lot of options! How can we choose? Here is where you should be aware of a fundamental difference between MongoDB and relational databases when it comes to schema design:

Collections inside MongoDB should be designed with the most frequent access patterns of the application in mind, while in the relational world, you can forget how data will be accessed if your tables are normalized.

So…

If you read people information 99% of the time, having 2 separate collections can be a good solution: it avoids keeping in memory data is almost never used (passport information) and when you need to have all information for a given person, it may be acceptable to do the join in the application.

Same thing if you want to display the name of people on one screen and the passport information on another screen.

But if you want to display all information for a given person, storing everything in the same collection (with embedding or with a flat structure) is likely to be the best solution.

Conclusion

We saw in this post one of the fundamental differences between MySQL database design and MongoDB database design when it comes to creating the right data structure for an application: with MongoDB, you need to know the data access pattern of the application. This should not be neglected as creating a wrong schema design is a recipe for disaster: queries will be difficult to write and to optimize, they will be slow and they will sometimes need to be replaced by custom code. All that can lead to low performance and frustration.

The next question is: which way is better? And of course, there is no definite answer: MongoDB fans will say that by making all access patterns equal, normalization make them equally bad, and normalization fans will say that a normalized schema provides good performance for most applications and that you can always denormalize to help a few queries run faster.

Related

Stéphane joined Percona in July 2012, after working as a MySQL DBA for leading French companies such as Dailymotion and France Telecom.
In real life, he lives in Paris with his wife and their twin daughters. When not in front of a computer or not spending time with his family, he likes playing chess and hiking.

8 Comments

Your last two paragraphs minimize the design effect of the structural differences between a relational DBMS like MySQL and MongoDB. In MySQL with InnoDB data are clustered by the primary key, whereas in MongoDB they are clustered around full BSON documents. This is a more complex form of clustering and if you get it wrong at any point you can have significant performance problems that require app rewrites because in Mongo your object model determines your storage access patterns far more strongly than is the case in MySQL.

Also, MongoDB currently limits the scope of transactions to a single document. Making updates across multiple documents atomic and isolated is an application issue. This is *far* harder to get right than joins and can lead to effectively unsolvable bugs in applications. (I imagine this will eventually be corrected but probably not quickly as it’s a hard problem within MongoDB too.)

The real design issue with a DBMS like MongoDB is that you have to correctly predict both query patterns as well as the transactional update model relatively early on. This tends to be possible only for cases where a single application with a relatively simple object model is accessing the DBMS. So one effect of using MongoDB is that you need to partition application services to meet these conditions or face the possibility of expensive rewrites that ripple far into the application logic. In other words, MongoDB forces you to structure application service architecture around data. That is not necessarily a bad thing but it’s a significant design assumption and deserves to be highlighted when making the DBMS choice.

@Robert: you’re right, unlike relational DBMS, MongoDB forces you to predict your access patterns first, and this is by design. That can be hard for new applications, but if you simply want to migrate part of an existing application to MongoDB, that may be much easier. Anyway this is something you should always have in mind.

As for the transactions that are limited to a single document, it is something you should be aware of from the beginning as well. But it’s not really different than MyISAM (it’s even a bit better) and we still see people running successfully with MyISAM.

For more indepth information on comparison between mysql and nodejs in terms of querying, schema design, data representation, relationships, transaction and performance, read on Comparison between MySQL and MongoDB.

Hmm it looks like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I submitted and
say, I’m thoroughly enjoying your blog. I as well am an aspiring blog
writer but I’m still new to everything. Do you have any points for novice blog writers?
I’d genuinely appreciate it.