Entries in MongoDB
(9)

Specialized graph databases such as Neo4J specialize in traversing graphs of relationships – such as those you might find in a social network. Many non-graph databases have been incorporating Graph Compute Engines to perform similar tasks. In the MongoDB 3.3 release, we now have the ability to perform simple graph traversal using the $graphLookup aggregation framework function. This will become a production feature in the 3.4 release.

The new feature is documented in MongoDB Jira SERVER-23725. The basic syntax is shown here:

1: {$graphLookup:

2: from: <name of collection to look up into>,

3: startWith: <expression>,

4: connectFromField: <name of field in document from “from”>,

5: connectToField: <name of field in document from “from”>,

6:as: <name of field in output document>,

7: maxDepth: <optional - non-negative integer>,

8: depthField: <optional - name of field in output

9: documents>

10: }

I started playing with this capability originally using the POKEC dataset which represents data from a real social network in Slovakia. The relationship file soc-pokec-relationships.txt.gz contains the social network for about 1.2 million people. I loaded it into Mongo using this perl script. The following pipeline did the trick:

We can expand the social network for a single person using a syntax like this:

1: db.socialGraph.aggregate([

2: {

3: $match: {person:1476767}

4: },

5: {

6: $graphLookup: {

7: from: "socialGraph",

8: startWith: [1476767],

9: connectFromField: "friends",

10: connectToField: "person",

11:as: "socialNetwork",

12: maxDepth:2,

13: depthField:"depth"

14: }

15: },

16: {

17: $project: {_id:0,name:1,"Network":"$socialNetwork.name",

18:"Depth":"$socialNetwork.depth" }

19: },

20: {$unwind: {"Network"}}

21: ])

What we are doing here is starting with person 1476767, then following the elements of the friends array out to two levels – i.e.: to “friends of friends”.

Increasing the maxdepth exponentially increases the amount of data we have to cope with. This is the notorious “seven degrees of separation” effect – most people in a social network are linked by 6-7 hops, so once we get past that we are effectively traversing the entire set. Unfortunately, this meant that traversing more than 3 deep caused me to run out of memory:

1: assert: command failed: {

2:"ok" : 0,

3:"errmsg" : "$graphLookup reached maximum memory consumption",

4:"code" : 40099

5: } : aggregate failed

The graph lookup can only consume at most 100MB of memory, and currently doesn't spill to disk, even if theallowDiskUse : true clause is specified within the aggregation arguments. SERVER-23980 is open to correct this but it doesn't appear to have been scheduled yet.

So I tried building a “flatter” network so that I wouldn’t run out of memory. This JavaScript builds the network and this Javascript runs some performance tests. I tried expanding the network both with and without a supporting index on the connectToField (person) in this case. Here’s the results (note the logarithmic scale):

For shallow networks, having an index on the connectToField makes an enormous difference. But as the depth increases, the index performance advantage decreases and eventually performance matches that of the unindexed case. In this example data that just happens to be at the “7 degrees of separation” but it will clearly depend on the nature of the data.

The $graphLookup operator is a very powerful addition to the MongoDB aggregation framework and continues the trend of providing richer query capabilities within the server. Mastering the aggregation framework is clearly a high priority for anyone wanting to exploit the full power of MongoDB

Note: An updated version of this post can be found HERE. I recommend you follow this link to read the updated version.

One of the key tenants of MongoDB schema design is to account for the absence of server-side joins. Data is joined all the time inside of application code of course, but traditionally there’s been no way to perform joins within the server itself.

This changed in 3.2 with the introduction of the $lookup operator within the aggregation framework. $lookup performs the equivalent of a left outer join – eg: it retrieves matching data from another document and returns null data if no match is found.

Here’s an example using the MongoDB version of the Sakila dataset that I converted from MySQL back in this post:

1: db.films.aggregate([

2: {$match:{"Actors.First name" : "CHRISTIAN",

3:"Actors.Last name" : "GABLE"}},

4: {$lookup: {

5: from: "customers",

6:as: "customerData",

7: localField: "_id",

8: foreignField: "Rentals.filmId"

9: }},

10: {$unwind: "$customerData"},

11: {$project:{"Title":1,

12:"FirstName":"$customerData.First Name",

13:"LastName" :"$customerData.Last Name"}},

14: ])

What we’re doing here is finding all customers who have ever hired a film staring “Christian Gable”; We start by finding those films in the films collection (lines 2-3), then use $lookup to retrieve customer data (lines 4-9). Films embeds actors in the “Actors” array; the customers collection embeds films that have been hired in the "Rentals" array.

The result of the join contains all the customers who have borrowed the movie returned as an array, so we use the $unwind operator to “flatten” them out (line 10). The resulting output looks like this:

One thing that we need to be careful here is with join performance. The $lookup function is going to be executed once for each document returned by our $match condition. There is - AFAIK - no equivalent of a hash or sort merge join operation possible here, so we need to make sure that we've used an index. Unfortunately, the explain() command doesn’t help us. It tells us only if we have used an index to perform the initial $match, but doesn't show us if we used an index within the $lookup.

Here's the explain output from the operation above (TL;DR):

1: > db.films.explain().aggregate([

2: ... {$match:{"Actors.First name" : "CHRISTIAN",

3: ... "Actors.Last name" : "GABLE"}},

4: ... {$lookup: {

5: ... from: "customers",

6: ... as: "customerData",

7: ... localField: "_id",

8: ... foreignField: "Rentals.filmId"

9: ... }},

10: ... {$unwind: "$customerData"},

11: ... {$project:{"Title":1,

12: ... "FirstName":"$customerData.First Name",

13: ... "LastName" :"$customerData.Last Name"}},

14: ...

15: ... ])

16: {

17:"waitedMS" : NumberLong(0),

18:"stages" : [

19: {

20:"$cursor" : {

21:"query" : {

22:"Actors.First name" : "CHRISTIAN",

23:"Actors.Last name" : "GABLE"

24: },

25:"fields" : {

26:"Title" : 1,

27:"customerData.First Name" : 1,

28:"customerData.Last Name" : 1,

29:"_id" : 1

30: },

31:"queryPlanner" : {

32:"plannerVersion" : 1,

33:"namespace" : "sakila.films",

34:"indexFilterSet" : false,

35:"parsedQuery" : {

36:"$and" : [

37: {

38:"Actors.First name" : {

39:"$eq" : "CHRISTIAN"

40: }

41: },

42: {

43:"Actors.Last name" : {

44:"$eq" : "GABLE"

45: }

46: }

47: ]

48: },

49:"winningPlan" : {

50:"stage" : "COLLSCAN",

51:"filter" : {

52:"$and" : [

53: {

54:"Actors.First name" : {

55:"$eq" : "CHRISTIAN"

56: }

57: },

58: {

59:"Actors.Last name" : {

60:"$eq" : "GABLE"

61: }

62: }

63: ]

64: },

65:"direction" : "forward"

66: },

67:"rejectedPlans" : [ ]

68: }

69: }

70: },

71: {

72:"$lookup" : {

73:"from" : "customers",

74:"as" : "customerData",

75:"localField" : "_id",

76:"foreignField" : "Rentals.filmId",

77:"unwinding" : {

78:"preserveNullAndEmptyArrays" : false

79: }

80: }

81: },

82: {

83:"$project" : {

84:"Title" : true,

85:"FirstName" : "$customerData.First Name",

86:"LastName" : "$customerData.Last Name"

87: }

88: }

89: ],

90:"ok" : 1

91: }

However, we can see the queries created by the $lookup function if we enable profiling. For instance if we turn profiling on can see a full collection scan of customers has have been generated for every film document that has been joined:

These “nested” collection scans are bad news. Below is the results of a benchmark in which I joined two collections using $lookup with and without an index. As you can see, the unindexed $lookup degrades steeply as the number of rows to be joined increases. The solution is obvious:

Always create an index on the foreignField attributes in a $lookup, unless the collections are of trivial size.

The MongoDB company is putting a lot of new features into the aggregation framework: they clearly intend to create a very powerful and flexible capability that matches and maybe even exceeds what can be done with SQL. Indeed, the aggregation framework seems poised to become a dataflow language similar to Pig. Anyone wanting to do any serious work in MongoDB should make sure they are very comfortable with aggregate. If you use $lookup to perform joins in aggregate, make sure there is an index on the ForiegnField attribute.

I wanted to do some experimenting with MongoDB, but I wasn’t really happy with any of the sample data I could find in the web. So I decided that I would translate the MySQL “Sakila” schema into MongoDB collections as part of the learning process.

For those that don’t know, Sakila is a MySQL sample schema that was published about 8 years ago. It’s based on a DVD rental system. OK, not the most modern data ever, but DVDs are still a thing aren’t they??

You can get the MongoDB version of Sakilia here. To load, use unpack using tar zxvf sakilia.tgz then use mongoimport to load the resulting JSON documents. On windows you should be able to double click on the file to get to the JSON.

The Sakila database schema is shown below. There are 16 tables representing a fairly easy to understand inventory of films, staff, customers and stores.

When modelling MongoDB schemas, we partially ignore our relational modelling experience – “normalization” is not the desired end state. Instead of driving our decision on the nature of the data, we drive it on the nature of operations. The biggest decision is which “entities” get embedded within documents, and which get linked. I’m not the best person to articulate these principles – the O’Reilly book “MongoDB Applied Design Patterns” does a pretty good job and this presentation is also useful.

My first shot at mapping the data – which may prove to be flawed as I play with MongoDB queries – collapsed the 16 tables into just 3 documents: FILMS, STORES and CUSTOMERS. ACTORS became a nested document in FILMS, STAFF and INVENTORY were nested into STORES, while RENTALS and PAYMENTS nested into CUSTOMERS. Whether these nestings turn out to be good design decisions will depend somewhat on the application. Some operations are going to be awkward while others will be expedited.

Here’s a look at the FILMS collection:

Here is STORES:

And here is CUSTOMERS:

Looks like I have to fix some float rounding issues on customers.rentals.payments.amount .

The code that generates the schema is here. It’s pretty slow, mainly because of the very high number of lookups on rentals and payments. It would be better to bulk collect everything and scan through it but it would make the code pretty ugly. If this were Oracle I’m pretty sure I could make it run faster but with MySQL SQL tuning is much harder.

Code is pretty straight forward. To insert a MongoDB document we get the DBCollection, then create BasicDBObjects which we insert into the DBCollection. To nest a documnet we create a BasicDBList and insert BasicDBObjects into it. Then we add the BasicDBList to the parent BasicDBObject. The following snippit illustrates that sequence. It's mostly boilerplate code, with the only human decision being the nesting structure.

We recently added support for MongoDB in Toad for Cloud Databases, so I took the opportunity of writing my first MongoDB Java program and taking the Toad functionality for a test drive.

MongoDB is a non-relational, document oriented database that is extremely popular with developers (see for instance this Hacker news poll). Toad for cloud databases allows you to work with non-relational data using SQL by normalizing the data structures and converting SQL to the non-relational calls.

I wanted to get started by creating some MongoDB collections with familiar data. So I wrote a Java program that takes data out of the Oracle sample schema, and loads it into Mongo as documents. The program is here.

The key parts of the code are shown here:

1:while (custRs.next()) { // For each customer

2: String custId = custRs.getString("CUST_ID");

3: String custFirstName = custRs.getString("CUST_FIRST_NAME");

4: String custLastName = custRs.getString("CUST_LAST_NAME");

5:

6://Create the customer document

7: BasicDBObject custDoc = new BasicDBObject();

8: custDoc.put("_id", custId);

9: custDoc.put("CustomerFirstName", custFirstName);

10: custDoc.put("CustomerLastName", custLastName);

11:// Create the product sales document

12: BasicDBObject customerProducts = new BasicDBObject();

13: custSalesQry.setString(1, custId);

14: ResultSet prodRs = custSalesQry.executeQuery();

15: Integer prodCount = 0;

16:while (prodRs.next()) { //For each product sale

17: String timeId=prodRs.getString("TIME_ID");

18: Integer prodId = prodRs.getInt("PROD_ID");

19: String prodName = prodRs.getString("PROD_NAME");

20: Float Amount = prodRs.getFloat("AMOUNT_SOLD");

21: Float Quantity = prodRs.getFloat("QUANTITY_SOLD");

22:// Create the line item document

23: BasicDBObject productItem = new BasicDBObject();

24: productItem.put("prodId", prodId);

25: productItem.put("prodName", prodName);

26: productItem.put("Amount", Amount);

27: productItem.put("Quantity", Quantity);

28:// Put the line item in the salesforcustomer document

29: customerProducts.put(timeId, productItem);

30:if (prodCount++ > 4) { // Just 5 for this demo

31: prodCount = 0;

32:break;

33: }

34: }

35:// put the salesforcustomer document in the customer document

36: custDoc.put("SalesForCustomer", customerProducts);

37:

38: System.out.println(custDoc);

39: custColl.insert(custDoc); //insert the customer

40: custCount++;

41:

42: }

Here’s how it works:

Lines

Description

1-4

We loop through each customer, retrieving the key customer details

7-10

We create a basic MongoDB document that contains the customer details

12

We create another MongoDB document that will contain all the product sales for the customer

16-21

Fetching the data for an individual sale for that customer from Oracle

23-27

We create a document for that single sale

29

Add the sale to the document containing all the sales

36

Add all the sales to the customer

39

Add the customer document to the collection

The MongoDB API is very straight forward; much easier than similar APIs for HBase or Cassandra.

When we run the program, we create JSON documents in Mongo DB that look like this:

Toad for Cloud “renormalizes” the documents so that they resemble something that we might use in a more conventional database. So in this case, Toad creates two tables from the Mongo collection, one for customers, and one which contains the sales for a customer. You can rename the auto-generated foreign keys and the sub-table name to make this a bit clearer, as in the example below:

We can more clearly see the relationships in the .NET client by using Toad’s visual query builder (or we could have used the database diagram tool):

MongoDB has a pretty rich query language, but it’s fairly mysterious to those of us are used to SQL, and it’s certainly not as rich as the SQL language. Using Toad for Cloud, you can issue ANSI standard SQL against your MongoDB tables and quickly browse or perform complex queries. Later this year, this Mongo support will emerge in some of our commercial data management tools such as Toad for Data Analysts and our soon to be announced BI tools.