How to Create MongoDB Joins Using SQL

It has become the common way of talking about querying data that we can generally understand.

It isn’t intended for document databases, and it isn’t always appropriate. It has, for example, great difficulty with embedded arrays and other non-relational devices.

However, a SQL query is quite often a good way to start working on a MongoDB query.

It gets you started, and it is much easier to criticize and alter something already drafted than to create it from scratch.

In this article, I’ll show you various types of SQL join queries that benefit from this sort of approach and how you can fine-tune the automatically-generated MongoDB queries for better performance, or to get the output closer to what you need.

We’ll demonstrate this approach with INNER JOIN, queries with several JOINS, the IN clause, GROUP BY, and finally LEFT OUTER JOINS.

Reservations to SQL and joins

Although I’ve more years of experience of SQL than I like to admit publicly, I am worried about using a relational language for a document database.

XML and JSON are uneasy fits into the relational model because there are different ways of joining document collections, particularly when there are embedded lists and objects.

There is no SQL equivalent, for example, to the request to search through a collection, to look through an array of objects within each document, and, if you get a match, return just the items of the list that match, and whatever other items you specify within the documents as part of the returned collection.

However, it is very convenient to use SQL occasionally to avoid the drudgery of keying in the basic aggregation, and then taking the resulting mongo shell code and altering it to your requirements.

It used to just interpret a very small number of simple queries, but it now does joins.

I generally use it to get started with an aggregate query that I can then transfer to the Aggregation Editor to fine-tune.

In this article, just to provide a SQL playground for MongoDB, I’ll be using an old Sybase relational database from the eighties, called Pubs.

Because it originally had very little data in it, it isn’t much used now but a large number of queries have been written using it. I’ve not used the original data, but instead created a much larger volume of spoof data and added a few things to exercise a database.

There are more titles than there are publishers, so if each publisher has, say, five titles, then the publishers table will have five times as many documents as the titles collection.

Doing a lookup for every title is five times as many lookups as doing a lookup for every publisher. What we’ve done suggests that, when using this utility, it pays to put the smaller collection first.

In most relational databases, the order in which you specify the tables will make no difference because the query optimiser will work out the best way of doing it.

However, the optimiser has the advantage that it has a lot of extra information that it can use to make choices: any SQL interpreter can only translate literally from SQL to Javascript. We will need an index.

In the relational original of this database, the pub_id field was a primary key for the publishers and the foreign key for the titles. All MongoDB collections have an_id key that corresponds to a relational key. Unfortunately, our transfer of the PUBS database didn’t assign the primary key to the _id.

MongoDB aggregations generally can use just one index, but MongoDB can use a separate index under the covers for the $lookup stage. We can, and should, index the foreign key in a lookup, but we would do a lot better by drastically reducing the amount of data being accessed, and supporting that operation with an index.

When looking at the query, it is obvious that the best strategy is to locate the document in the publishers collection for the ‘Cavendish Academic Trust’ and get the pub_id value to get the titles that have that same value.

To help achieve that, we put the pub_name and pub_id in the index, in that order. This means that MongoDB can get the pub_name out of the index and altogether avoid pulling the publisher collection into memory.

We run the query again and we can see from the profiler that it refuses to use the index. (see ‘How to Investigate MongoDB Query Performance’ on how to use the profiler to do this). This is because we still have work to do.

I’ve highlighted the match stage and I’m moving it up the sequence by clicking the up-arrow.

It should be the first thing that mongo does when executing the aggregate query, so we can move it with the ordering arrows.

Click, click, click. Now we need to edit this stage that we’ve just moved.

{
"publishers.pub_name" : "Cavendish Academic Trust"
}

Needs to be …

{
"pub_name" : "Cavendish Academic Trust"
}

…because the publishers object is now created later in the $project stage.

So we now do that by clicking on the tab where you see the mouse-pointer and editing the stage.

We now run it from the aggregation window and check the profile to see if it now uses an index scan (IXSCAN) rather than a collection scan (COLLSCAN). Yes, the query is now lightning fast, and uses our index. From the system.profile:

"millis" : 0,
"planSummary" : "IXSCAN { pub_name: 1, pub_id: 1 "},

For the result, I would like something nearer what a relational database would do: provide a list of documents providing a title. In real life, we’d probably want more than the title. I’d rewrite stage 5 to add the fields we want in the result with an alias (BookName in this example) and add a new projection to take out the two objects containing the joins.

Now this may be enough for you. It is pretty fast, but I’m not easily satisfied.

Unless you are desperate for the last smidgen of performance, the existing indexes will do.

We can avoid a COLLSCAN by moving the $match stage up as before, and iron out those “authors” and “titles” objects since we don’t need them. (If we find a title with more than one author, we will need an embedded array.)

A quick check with the profiler shows that we are examining fewer documents and using the index.

The overall time hasn’t changed much because of all the unindexed joins going on – we can, and should, index a lookup with indexes on the foreign keys au_idand title_id. However, we have already tweaked the result to give us a more readable output.

So you can see that the SQL Query feature can get you started, but there is generally more work to be done to get the query exactly as you want it.

Using SQL Query to get a head start

Even when the SQL window can’t give us the entire query in the form we want, it can give us a really good start, and it is far quicker for someone with SQL skills than hand-cutting mongo shell aggregate queries.

Imagine we want a list of the number of titles published by year ordered by year.

We can’t group by year from the SQL window, but we can tap in a rough query that we can tidy up to our requirement

Yes, it isn’t what we want, because each title was published at a different date so the $count value will always be 1.

We put that mongo shell code in the Aggregation Editor as I’ve already demonstrated, and fix things, testing as we go, to produce this.

There are two facilities in Studio 3T that turn it into a pleasure: Firstly, using the SQL Query feature to rough out a working query via GROUP BY and JOIN, and secondly, fine tuning the resulting mongo shell query in a controlled way using the Aggregation Editor, where you can test every stage independently and change the order for best performance.

That way, you can get aggregations that perform well, and do it quickly.

Downloads

Download the collections used in this article to create your own Pubs database and try out the example queries.

Was this article helpful?

About The Author

Phil Factor

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.Visit https://studio3t.com/knowledge-base/author/phil_factor/ for the complete list of Phil Factor's MongoDB tutorials.