Note: An updated version of this post can be found HERE. I recommend you follow this link to read the updated version.

One of the key tenants of MongoDB schema design is to account for the absence of server-side joins. Data is joined all the time inside of application code of course, but traditionally there’s been no way to perform joins within the server itself.

This changed in 3.2 with the introduction of the $lookup operator within the aggregation framework. $lookup performs the equivalent of a left outer join – eg: it retrieves matching data from another document and returns null data if no match is found.

Here’s an example using the MongoDB version of the Sakila dataset that I converted from MySQL back in this post:

1: db.films.aggregate([

2: {$match:{"Actors.First name" : "CHRISTIAN",

3:"Actors.Last name" : "GABLE"}},

4: {$lookup: {

5: from: "customers",

6:as: "customerData",

7: localField: "_id",

8: foreignField: "Rentals.filmId"

9: }},

10: {$unwind: "$customerData"},

11: {$project:{"Title":1,

12:"FirstName":"$customerData.First Name",

13:"LastName" :"$customerData.Last Name"}},

14: ])

What we’re doing here is finding all customers who have ever hired a film staring “Christian Gable”; We start by finding those films in the films collection (lines 2-3), then use $lookup to retrieve customer data (lines 4-9). Films embeds actors in the “Actors” array; the customers collection embeds films that have been hired in the "Rentals" array.

The result of the join contains all the customers who have borrowed the movie returned as an array, so we use the $unwind operator to “flatten” them out (line 10). The resulting output looks like this:

One thing that we need to be careful here is with join performance. The $lookup function is going to be executed once for each document returned by our $match condition. There is - AFAIK - no equivalent of a hash or sort merge join operation possible here, so we need to make sure that we've used an index. Unfortunately, the explain() command doesn’t help us. It tells us only if we have used an index to perform the initial $match, but doesn't show us if we used an index within the $lookup.

Here's the explain output from the operation above (TL;DR):

1: > db.films.explain().aggregate([

2: ... {$match:{"Actors.First name" : "CHRISTIAN",

3: ... "Actors.Last name" : "GABLE"}},

4: ... {$lookup: {

5: ... from: "customers",

6: ... as: "customerData",

7: ... localField: "_id",

8: ... foreignField: "Rentals.filmId"

9: ... }},

10: ... {$unwind: "$customerData"},

11: ... {$project:{"Title":1,

12: ... "FirstName":"$customerData.First Name",

13: ... "LastName" :"$customerData.Last Name"}},

14: ...

15: ... ])

16: {

17:"waitedMS" : NumberLong(0),

18:"stages" : [

19: {

20:"$cursor" : {

21:"query" : {

22:"Actors.First name" : "CHRISTIAN",

23:"Actors.Last name" : "GABLE"

24: },

25:"fields" : {

26:"Title" : 1,

27:"customerData.First Name" : 1,

28:"customerData.Last Name" : 1,

29:"_id" : 1

30: },

31:"queryPlanner" : {

32:"plannerVersion" : 1,

33:"namespace" : "sakila.films",

34:"indexFilterSet" : false,

35:"parsedQuery" : {

36:"$and" : [

37: {

38:"Actors.First name" : {

39:"$eq" : "CHRISTIAN"

40: }

41: },

42: {

43:"Actors.Last name" : {

44:"$eq" : "GABLE"

45: }

46: }

47: ]

48: },

49:"winningPlan" : {

50:"stage" : "COLLSCAN",

51:"filter" : {

52:"$and" : [

53: {

54:"Actors.First name" : {

55:"$eq" : "CHRISTIAN"

56: }

57: },

58: {

59:"Actors.Last name" : {

60:"$eq" : "GABLE"

61: }

62: }

63: ]

64: },

65:"direction" : "forward"

66: },

67:"rejectedPlans" : [ ]

68: }

69: }

70: },

71: {

72:"$lookup" : {

73:"from" : "customers",

74:"as" : "customerData",

75:"localField" : "_id",

76:"foreignField" : "Rentals.filmId",

77:"unwinding" : {

78:"preserveNullAndEmptyArrays" : false

79: }

80: }

81: },

82: {

83:"$project" : {

84:"Title" : true,

85:"FirstName" : "$customerData.First Name",

86:"LastName" : "$customerData.Last Name"

87: }

88: }

89: ],

90:"ok" : 1

91: }

However, we can see the queries created by the $lookup function if we enable profiling. For instance if we turn profiling on can see a full collection scan of customers has have been generated for every film document that has been joined:

These “nested” collection scans are bad news. Below is the results of a benchmark in which I joined two collections using $lookup with and without an index. As you can see, the unindexed $lookup degrades steeply as the number of rows to be joined increases. The solution is obvious:

Always create an index on the foreignField attributes in a $lookup, unless the collections are of trivial size.

The MongoDB company is putting a lot of new features into the aggregation framework: they clearly intend to create a very powerful and flexible capability that matches and maybe even exceeds what can be done with SQL. Indeed, the aggregation framework seems poised to become a dataflow language similar to Pig. Anyone wanting to do any serious work in MongoDB should make sure they are very comfortable with aggregate. If you use $lookup to perform joins in aggregate, make sure there is an index on the ForiegnField attribute.

Reader Comments (4)

Very difficult to understand the context here because no one ever "hires a film." In the real-world, what does that mean? In order to understand mongo's attempt to improve relational aggregation in this example, we first need to understand the real-world relationship between a film and hiring... so what is that relationship? Because it's not obvious, because "hiring a film" is not a phrase people ever use.

Well, back in the "old days" there were stored that contained DVD and VCR movies inside little plastic packages. People would "hire" or "rent" these movies for a few days and watch them at home. Sort of like netflix but with physical media. The Sakila schema was created for MySQL based on the idea of a video rental store like this.

So admittedly the sample data is a bit archaic, though the concept might easily be applied to car rentals or any other business that rents things out.

However, the key point here is not the business use case, but rather the peformance of the $lookup operator. Regardless of WHY you use the operator, the performance is going to be relatively poor if the lookup is unindexed.