Aggregate grouping is what I’m titling this blog post, but I don’t know if it’s the best name. Have you ever used MySQL’s GROUP_CONCAT function or the FOR XML PATH('') workaround in SQL Server? That’s basically what I’m writing about today. With Couchbase Server, the easiest way to do it is with N1QL’s ARRAY_AGG function, but you can also do it with an old school MapReduce View.

I’m writing this post because one of our solution engineers was working on this problem for a customer (who will go unnamed). Neither of us could find a blog post like this with the answer, so after we worked together to come up with a solution, I decided I would blog about it for my future self (which is pretty much the main reason I blog anything, really. The other reason is to find out if anyone else knows a better way).

Before we get started, I’ve made some material available if you want to follow along. The source code I used to generate the "patient" data used in this post is available on GitHub. If you aren’t .NET savvy, you can just use cbimport on sample data that I’ve created. (Or, you can use the N1QL sandbox, more information on that later). The rest of this blog post assumes you have a "patients" bucket with that sample data in it.

Requirements

I have a bucket of patient documents. Each patient has a single doctor. The patient document refers to a doctor by a field called doctorId. There may be other data in the patient document, but we’re mainly focused on the patient document’s key and the doctorId value. Some examples:

Next, we can assume that each doctor can have multiple patients. We can also assume that a doctor document exists, but we don’t actually need that for this tutorial, so let’s just focus on the patients for now.

Finally, what we want for our application (or report or whatever), is an aggregate grouping of the patients with their doctor. Each record would identify a doctor and a list/array/collection of patients. Something like:

doctor

patients

58

01257721, 450mkkri, 8g2mrze2 …​

8

05woknfk, 116wmq8i, 2t5yttqi …​

…​ etc …​

…​ etc …​

This might be useful for a dashboard showing all the patients assigned to doctors, for instance. How can we get the data in this form, with N1QL or with MapReduce?

N1QL Aggregate grouping

Start by selecting the doctorId from each patient document, and the key to the patient document. Then, apply ARRAY_AGG to the patient document ID. Finally, group the results by the doctorId.

SELECT p.doctorId AS doctor, ARRAY_AGG(META(p).id) AS patients
FROM patients p
GROUP BY p.doctorId;

Note: don’t forget to run CREATE PRIMARY INDEX ON patients for this tutorial to enable a primary index scan.

Imagine this query without the ARRAY_AGG. It would return one record for each patient. By adding the ARRAY_AGG and the GROUP BY, it now returns one record for each doctor.

Here’s a snippet of the results on the sample data set I created:

If you don’t want to go through the trouble of creating a bucket and importing sample data, you can also try this in the N1QL tutorial sandbox. There aren’t patient documents in there, so the query will be a little different.

I’m going to group up emails by age. Start by selecting the age from each document, and the email from each document. Then, apply ARRAY_AGG to the email. Finally, group the results by the age.

SELECT t.age AS age, ARRAY_AGG(t.email) AS emails
FROM tutorial t
group by t.age;

Here’s a screenshot of some of the results from the sandbox:

Aggregate group with MapReduce

Similar aggregate grouping can also be achieved with a MapReduce View.

Start by creating a new View. From Couchbase Console, go to Indexes, then Views. Select the "patients" bucket. Click "Create Development View". Name a design document (I called mine "_design/dev_patient". Create a view, I called mine "doctorPatientGroup".

We’ll need both a Map and a custom Reduce function.

First, for the map, we just want the doctorId (in an array, since we’ll be using grouping) and the patient’s document ID.

function (doc, meta) {
emit([doc.doctorId], meta.id);
}

Next, for the reduce function, we’ll take the values and concatenate them into an array. Below is one way that you can do it. I do not claim to be a JavaScript expert or a MapReduce expert, so there might be a more efficient way to tackle this:

Summary

I think the N1QL method is easier, but there may be performance benefits to using MapReduce in some cases. In either case, you can accomplish aggregate grouping just as easily (if not more easily) as in a relational database.

This is a relatively complex query that involves the following steps (and more):

Identify and scan the correct index(es)

Fetch the corresponding data

Project the fields named in the SELECT clause

Find distinct results

UNION the results together

Stream the results back to the web console

In Couchbase Server 4.x, you could use the EXPLAIN N1QL command to get an idea of the query plan. Now, in Couchbase Server 5.0 beta, you can view the plan visually.

This tooling shows you, at a glance, the costliest parts of the query, which can help you to identify improvements.

Query monitoring

It’s important to have tooling to monitor your queries in action. Couchbase Server 5.0 beta has tooling to monitor active, completed, and prepared queries. In addition, you have the ability to cancel queries that are in progress.

Start by clicking "Query" on the Web Console menu, and then click "Query Monitor". You’ll see the "Active", "Completed", and "Prepared" options at the top of the page.

Let’s look at the "Completed" queries page. The query text and other information about the query is displayed in a table.

Next, you can sort the table to see which query took the longest to run (duration), return the most results (result count), and so on. Finally, if you click "edit", you’ll be taken to the Query Workbench with the text of that query.

New Couchbase Web Console

If you’ve been following along, you’ve probably already noticed the new Couchbase Web Console. The UI has been given an overhaul in Couchbase Server 5.0. The goal is to improve navigation and optimize the UI.

This new design maximizes usability of existing features from Server 4.x, while leaving room to expand the feature set of 5.0 and beyond.

cbimport and cbexport

New command line tooling includes cbimport and cbexport for moving data around.

cbimport supports importing both CSV and JSON data. The documentation on cbimport should tell you all you want to know, but I want to highlight a couple things:

Load data from a URI by using the -d,--dataset <uri> flags

Generate keys according to a template by using the -g,--generate-key <key_expr> flags. This gives you a powerful templating system to generate unique keys that fit your data model and access patterns

Summary

Tooling for Couchbase Server 5.0 beta is designed to make your life easier. These tools will help you whether you’re writing queries, integrating with data, monitoring, or performing administrative tasks.

We’re always looking for feedback. Inside of the Web Console, there is a feedback icon at the bottom right of the screen. You can click that to send us feedback about the tooling directly. Or, feel free to leave a comment below, or reach out to me on Twitter @mgroves.

There are two system catalogs that are already available to you in Couchbase Server 4.5 that I’ll be talking about today.

system:active_request - This catalog lists all the currently executing active requests or queries. You can execute the N1QL query SELECT * FROM system:active_requests; and it will list all those results.

system:completed_requests - This catalog lists all the recent completed requests (that have run longer than some threshold of time, default of 1 second). You can execute SELECT * FROM system:completed_requests; and it will list these queries.

New to N1QL: META().plan

Both active_requests and completed_requests return not only the original N1QL query text, but also related information: request time, request id, execution time, scan consistency, and so on. This can be useful information. Here’s an example that looks at a simple query (select * from `travel-sample`) while it’s running by executing select * from system:active_requests;

First, I want to point out that phaseTimes is a new addition to the results. It’s a quick and dirty way to get a sense of the query cost without looking at the whole profile. It gives you the overall cost of each request phase without going into detail of each operator. In the above example, for instance, you can see that parse took 500µs and primaryScan took 107.3891ms. This might be enough information for you to go on without diving into META().plan.

However, with the new META().plan, you can get very detailed information about the query plan. This time, I’ll execute SELECT *, META().plan FROM system:active_requests;

Note the new "plan" part. It contains a tree of operators that combine to execute the N1QL query. The root operator is a Sequence, which itself has a collection of child operators like Authorize, PrimaryScan, Fetch, and possibly even more Sequences.

Enabling the profile feature

To get this information when using cbq or the REST API, you’ll need to turn on the "profile" feature.

You can do this in cbq by entering set -profile timings; and then running your query.

You can also do this with the REST API on a per request basis (using the /query/service endpoint and passing a querystring parameter of profile=timings, for instance).

Notice the profile setting. It was previously set to off, but I set it to "timings".

You may not want to do that, especially on nodes being used by other people and programs, because it will affect other queries running on the node. It’s better to do this on a per-request basis.

It’s also what Query Workbench does by default.

Using the Query Workbench

There’s a lot of information in META().plan about how the plan is executed. Personally, I prefer to look at a simplified graphical version of it in Query Workbench by clicking the "Plan" icon (which I briefly mentioned in a previous post about the new Couchbase Web Console UI).

Let’s look at a slightly more complex example. For this exercise, I’m using the travel-sample bucket, but I have removed one of the indexes (DROP INDEX `travel-sample.def_sourceairport;`).

I then execute a N1QL query to find flights between San Francisco and Miami:

Executing this query (on my single-node local machine) takes about 10 seconds. That’s definitely not an acceptible amount of time, so let’s look at the plan to see what the problem might be (I broke it into two lines so the screenshots will fit in the blog post).

Looking at that plan, it seems like the costliest parts of the query are the Filter and the Join. JOIN operations work on keys, so they should normally be very quick. But it looks like there are a lot of documents being joined.

The Filter (the WHERE part of the query) is also taking a lot of time. It’s looking at the sourceairport and destinationairport fields. Looking elsewhere in the plan, I see that there is a PrimaryScan. This should be a red flag when you are trying to write performant queries. PrimaryScan means that the query couldn’t find an index other than the primary index. This is roughly the equivalent of a "table scan" in relational database terms. (You may want to drop the primary index so that these issues get bubbled-up faster, but that’s a topic for another time).

Let’s add an index on the sourceairport field and see if that helps.

CREATE INDEX `def_sourceairport` ON `travel-sample`(`sourceairport`);

Now, running the same query as above, I get the following plan:

This query took ~100ms (on my single-node local machine) which is much more acceptible. The Filter and the Join still take up a large percentage of the time, but thanks to the IndexScan replacing the PrimaryScan, there are many fewer documents that those operators have to deal with. Perhaps the query could be improved even more with an additional index on the destinationairport field.

Beyond Tweaking Queries

The answer to performance problems is not always in tweaking queries. Sometimes you might need to add more nodes to your cluster to address the underlying problem.

The servTime value indicates how much time is spent by the Query service to wait on the Key/Value data storage. If the servTime is very high, but there is a small number of documents being processed, that indicates that the indexer (or the key/value service) can’t keep up. Perhaps they have too much load coming from somewhere else. So this means that something weird is running someplace else or that your cluster is trying to handle too much load. Might be time to add some more nodes.

Similarly, the kernTime is how much time is spent waiting on other N1QL routines. This might mean that something else downstream in the query plan has a problem, or that the query node is overrun with requests and are having to wait a lot.

We want your feedback!

The new META().plan functionality and the new Plan UI combine in Couchbase Server 5.0 to improve the N1QL writing and profiling process.

Stay tuned to the Couchbase Blog for information about what’s coming in the next developer build.

We want feedback! Developer releases are coming every month, so you have a chance to make a difference in what we are building.

Bugs: If you find a bug (something that is broken or doesn’t work how you’d expect), please file an issue in our JIRA system at issues.couchbase.com or submit a question on the Couchbase Forums. Or, contact me with a description of the issue. I would be happy to help you or submit the bug for you (my Couchbase handlers high-five me every time I submit a good bug).

Feedback: Let me know what you think. Something you don’t like? Something you really like? Something missing? Now you can give feedback directly from within the Couchbase Web Console. Look for the icon at the bottom right of the screen.

In some cases, it may be tricky to decide if your feedback is a bug or a suggestion. Use your best judgement, or again, feel free to contact me for help. I want to hear from you. The best way to contact me is either Twitter @mgroves or email me matthew.groves@couchbase.com.

In this series of blog posts, I’m going to lay out the considerations when moving to a document database when you have a relational background. Specifically, Microsoft SQL Server as compared to Couchbase Server.

Data Types in JSON vs SQL

Couchbase (and many other document databases) use JSON objects for data. JSON is a powerful, human readable format to store data. When comparing to data types in relational tables, there are some similarities, and there are some important differences.

All JSON data is made up of 6 types: string, number, boolean, array, object, and null. There are a lot of data types available in SQL Server. Let’s start with a table that is a kind of "literal" translation, and work from there.

A string in SQL Server is often defined by a length. nvarchar(50) or nvarchar(MAX) for instance. In JSON, you don’t need to define a length. Just use a string.

A number in SQL Server varies widely based on what you are using it for. The number type in JSON is flexible, in that it can store integers, decimal, or floating point. In specialized circumstances, like if you need a specific precision or you need to store very large numbers, you may want to store a number as a string instead.

A boolean in JSON is true/false. In SQL Server, it’s roughly equivalent: a bit that represents true/false.

In JSON, any value can be null. In SQL Server, you set this on a field-by-field basis. If a field in SQL Server is not set to "nullable", then it will be enforced. In a JSON document, there is no such enforcement.

There are some other specialized data types in SQL Server, including hierarchyid, and xml. Typically, these would be unrolled in JSON objects and/or referenced by key (as explored in part 1 of this blog series on data modeling). You can still store XML/JSON within a string if you want, but if you do, then you can’t use the full power of N1QL on those fields.

Migrating and translating data

Depending on your organization and your team, you may have to bring in people from multiple roles to ensure a successful migration. If you have a DBA, that DBA will have to know how to run and manage Couchbase just as well as SQL Server. If you are DevOps, or have a DevOps team, it’s important to involve them early on, so that they are aware of what you’re doing and can help you coordinate your efforts. Moving to a document database does not mean that you no longer need DBAs or Ops or DevOps to be involved. These roles should also be involved when doing data modeling, if possible, so that they can provide input and understand what is going on.

For a naive migration (1 row to 1 document), you can write a very simple program to loop through the tables, columns, and values of a relational database and spit out corresponding documents. A tool like Dapper would handle all the data type translations within C# and feed them into the Couchbase .NET SDK.

Completely flat data is relatively uncommon, however, so for more complex models, you will probably need to write code to migrate from the old relational model to the new document model.

Here are some things you want to keep in mind when writing migration code (of any kind, but especially relational-to-nonrelational):

Give yourself plenty of time in planning. While migrating, you may discover that you need to rethink your model. You will need to test and make adjustments, and it’s better to have extra time than make mistakes while hurrying. Migrating data is an iterative cycle: migrate a table, see if that works, adjust, and keep iterating. You may have to go through this cycle many times.

Test your migration using real data. Data can be full of surprises. You may think that NVARCHAR field only ever contains string representations of numbers, but maybe there are some abnormal rows that contain words. Use a copy of the real data to test and verify your migration.

Be prepared to run the migration multiple times. Have a plan to cleanup a failed migration and start over. This might be a simple DELETE FROM bucket in N1QL, or it could be a more nuanaced and targeted series of cleanups. If you plan from the start, this will be easier. Automate your migration, so this is less painful.

ETL or ELT? Extract-Transform-Load, or Extract-Load-Transform. When are you going to do a transform? When putting data into Couchbase, the flexibility of JSON allows you to transfer-in-place after loading if you choose.

An example ETL migration

I wrote a very simple migration console app using C#, Entity Framework, and the Couchbase .NET SDK. It migrates both the shopping cart and the social media examples from the previous blog post. The full source code is available on GitHub.

This app is going to do the transformation, so this is an ETL approach. This approach uses Entity Framework to map relational tables to C# classes, which are then inserted into documents. The data model for Couchbase can be better represented by C# classes than by relational tables (as demonstrated in the previous blog post), so this approach has lower friction.

I’m going to to use C# to write a migration program, but the automation is what’s important, not the specific tool. This is going to be essentially "throwaway" code after the migration is complete. My C# approach doesn’t do any sort of batching, and is probably not well-suited to extremely large amounts of data, so it might be a good idea to use a tool like Talend and/or an ELT approach for very large scale/Enterprise data.

I created a ShoppingCartMigrator class and a SocialMediaMigrator class. I’m only going to cover the shopping cart in this post. I pass it a Couchbase bucket and the Entity Framework context that I used in the last blog post. (You could instead pass an NHibernate session or a plain DbConnection here, depending on your preference).

This is the simplest approach. A more complex approach could involve putting a temporary "fingerprint" marker field onto certain documents, and then deleting documents with a certain fingerprint in the cleanup. (E.g. DELETE FROM sqltocb WHERE fingerprint = '999cfbc3-186e-4219-ab5d-18ad130a9dc6'). Or vice versa: fingerprint the problematic data for later analysis and delete the rest. Just make sure to cleanup these temporary fields when the migration is completed successfully.

When you try this out yourself, you may want to run the console application twice, just to see the cleanup in action. The second attempt will result in errors because it will be attempting to create documents with duplicate keys.

What about the other features of SQL Server?

Not everything in SQL Server has a direct counterpart in Couchbase. In some cases, it won’t ever have a counterpart. In some cases, there will be a rough equivalent. Some features will arrive in the future, as Couchbase is under fast-paced, active, open-source development, and new features are being added when appropriate.

Also keep in mind that document databases and NoSQL databases often force business logic out of the database to a larger extent than relational databases. As nice as it would be if Couchbase Server had every feature under the sun, there are always tradeoffs. Some are technical in nature, some are product design decisions. Tradeoffs could be made to add relational-style features, but at some point in that journey, Couchbase stops being a fast, scalable database and starts being "just another" relational database. There is certainly a lot of convergence in both relational and non-relational databases, and a lot of change happening every year.

With that in mind, stay tuned for the final blog post in the series. This will cover the changes to application coding that come with using Couchbase, including:

SQL/N1QL

Stored Procedures

Service tiers

Triggers

Views

Serialization

Security

Concurrency

Autonumber

OR/Ms and ODMs

Transactions

Summary

This blog post compared and contrasted the data features available in Couchbase Server with SQL Server. If you are currently using SQL Server and are considering adding a document database to your project or starting a new project, I am here to help.

In this series of blog posts, I’m going to lay out the considerations when moving to a document database when you have a relational background. Specifically, Microsoft SQL Server as compared to Couchbase Server.

In three parts, I’m going to cover:

Data modeling (this blog post)

The data itself

Applications using the data

The goal is to lay down some general guidelines that you can apply to your application planning and design.

Why would I do this?

Before we get started, I want to spend a little bit of time on motivation. There are 3 main reasons why one might consider using a document data store instead of (or in addition to) a relational database. Your motivation may be one or all three:

Speed: Couchbase Server uses a memory-first architecture which can provide a great speed boost as compared to a relational databases

Scalability: Couchbase Server is a distributed database, which allows you to scale out (and scale back in) capacity by just racking up commodity hardware. Built-in Couchbase features like auto-sharding, replication, load balancing make scaling a lot smoother and easier than relational databases.

Flexibility: Some data fits nicely in a relational model, but some data can benefit from the flexibility of using JSON. Unlike SQL Server, schema maintenance is no longer an issue. With JSON: the schema bends as you need it to.

It should be noted that document databases and relational databases can be complimentary. Your application may be best served by one, the other, or a combination of both. In many cases, it simply is not possible to completely remove relational databases from your design, but a document database like Couchbase Server can still bring the above benefits to your software. The rest of this blog series will assume you have a SQL Server background and are either replacing, supplimenting, or starting a new greenfield project using Couchbase.

The ease or difficulty of transitioning an existing application varies widely based on a number of factors. In some cases it may be extremely easy; in some cases it will be time-consuming and difficult; in some (shrinking number of) cases it may not even be a good idea.

Understanding the differences

The first step is to understand how data is modeled in a document database. In a relational database, data is typically stored flat in a table and it is given structure with primary and foreign keys. As a simple example, let’s consider a relational database for a web site that has a shopping cart as well as social media features. (In this example, those features are unrelated to keep things simple).

In a document database, data is stored as keys and values. A Couchbase bucket contains documents; each document has a unique key and a JSON value. There are no foreign keys (or, more accurately, there are no foreign key constraints).

Here’s a high-level comparison of SQL Server features/naming as compared to Couchbase:

Table 1. SQL Server compared to Couchbase

SQL Server

Couchbase Server

Server

Cluster

Database

Bucket

Row(s) from table(s)

Document

Column

JSON key/value

Primary Key

Document Key

These comparisons are a metaphorical starting point. Looking at that table, it might be tempting to take a simplistic approach. "I have 5 tables, therefore I’ll just create 5 different types of documents, with one document per row." This is the equivalent of literally translating a written language. The approach may work sometimes, but it doesn’t take into account the full power of a document database that uses JSON. Just as a literal translation of a written language doesn’t take into account cultural context, idioms, and historical context.

Because of the flexibility of JSON, the data in a document database can be structured more like a domain object in your application. Therefore you don’t have an impedence mismatch that is often addressed by OR/M tools like Entity Framework and NHibernate.

There are two main approaches you can use when modeling data in Couchbase that we will examine further:

Denormalization - Instead of splitting data between tables using foreign keys, group concepts together into a single document.

Referential - Concepts are given their own documents, but reference other documents using the document key.

Denormalization example

Let’s consider the "shopping cart" entity.

To represent this in a relational database would likely require two tables: a ShoppingCart table and a ShoppingCartItem table with a foreign key to a row in ShoppingCart.

When creating the model for a document database, the decision has to be made whether to continue modeling this as two separate entities (e.g. a Shopping Cart document and corresponding Shopping Cart Item documents) or whether to "denormalize" and combine a row from ShoppingCart and row(s) from ShoppingCartItem into a single document to represent a shopping cart.

In Couchbase, using a denormalization strategy, a shopping cart and the items in it would be represented by a single document.

Based on these mappings and an analysis of the use cases, I could decide that it would be modeled as a single document in Couchbase. ShoppingCartItemMap only exists so that the OR/M knows how to populate the Items property in ShoppingCart. Also, it’s unlikely that the application will be doing reads of the shopping cart without also needing to read the items.

In a later post, OR/Ms will be discussed further, but for now I can say that the ShoppingCartMap and ShoppingCartItemMap classes are not necessary when using Couchbase, and the Id field from Item isn’t necessary. In fact, the Couchbase .NET SDK can directly populate a ShoppingCart object without an OR/M in a single line of code:

This isn’t to say that using Couchbase will always result in shorter, easier to read code. But for certain use cases, it can definitely have an impact.

Referential example

It’s not always possible or optimal to denormalize relationships like the ShoppingCart example. In many cases, a document will need to reference another document. Depending on how your application expects to do reads and writes, you may want to keep your model in separate documents by using referencing.

Let’s look at an example where referencing might be the best approach. Suppose your application has some social media elements. Users can have friends, and users can post text updates.

One way to model this:

Users as individual documents

Updates as individual documents that reference a user

Friends as an array of keys within a user document

With two users, two updates, we would have 4 documents in Couchbase that look like this:

The Update to FriendbookUser relationship can be modeled as either a Guid or as another FriendbookUser object. This is an implementation detail. You might prefer one, the other, or both, depending on your application needs and/or how your OR/M works. In either case, the underlying model is the same.

Here’s the mapping I used for these classes in Entity Framework. Your mileage may vary, depending on how you use EF or other OR/M tools. Focus on the underlying model and not the details of the OR/M mapping tool.

If, instead of storing these entities as separate documents, we applied the same denormalization as the shopping cart example and attempted to store a user and updates in one document, we would end up with some problems.

Duplication of friends: each user would store the details for their friends. This is not tenable, because now a user’s information would be stored in multiple places instead of having a single source of truth (unlike the shopping cart, where having the same item in more than one shopping cart probably doesn’t make any domain sense). This might be okay when using Couchbase as a cache, but not as a primary data store.

Size of updates: Over a period of regular use, an individual user could post hundreds or thousands of updates. This could lead to a very large document which could slow down I/O operations. This can be mitigated with Couchbase’s sub-document API, but also note that Couchbase has a ceiling of 20mb per document.

Note: There’s an N+1 problem here too (friends of friends, etc), but I’m not going to spend time on addressing that. It’s a problem that’s not unique to either database.

Additionally, it may not be the case that when the application reads or writes a user that it will need to read or write friends & updates. And, when writing an update, it’s not likely that the application will need to update a user. Since these entities may often be read/written on their own, that indicates that they need to be modeled as separate documents.

Note the array in the Friends field in the user document and the value in the User field in the update document. These values can be used to retrieve the associated documents. Later in this post, I’ll discuss how to do it with key/value operations and how to do it with N1QL.

To sum up, there are two ways to model data in a document database. The shopping cart example used nested objects, while the social media example used separate documents. In those examples, it was relatively straightforward to choose. When you’re making your own modeling decisions, here’s a handy cheat sheet:

Table 2. Modeling Data Cheat Sheet

If …​

Then consider…​

Relationship is 1-to-1 or 1-to-many

Nested objects

Relationship is many-to-1 or many-to-many

Separate documents

Data reads are mostly parent fields

Separate document

Data reads are mostly parent + child fields

Nested objects

Data reads are mostly parent or child (not both)

Separate documents

Data writes are mostly parent and child (both)

Nested objects

Key/value operations

To get document(s) in Couchbase, the simplest and fastest way is to ask for them by key. Once you have one of the FriendbookUser documents above, you can then execute another operation to get the associated documents. For instance, I could ask Couchbase to give me the documents for keys 2, 3, and 1031 (as a batch operation). This would give me the documents for each friend. I can then repeat that for Updates, and so on.

The benefit to this is speed: key/value operations are very fast in Couchbase, and you will likely be getting values directly from RAM.

The drawback is that it involves at least two operations (get FriendbookUser document, then get the Updates). So this may involve some extra coding. It may also require you to think more carefully about how you construct document keys (more on that later).

N1QL

In Couchbase, you have the ability to write queries using N1QL, which is SQL for JSON. This includes the JOIN keyword. This allows me to, for instance, write a query to get the 10 latest updates and the users that correspond to them.

N1QL allows you to have great flexibility in retrieving data. I don’t have to be restricted by just using keys. It’s also easy to pick up, since it’s a superset of SQL that SQL Server users will be comfortable with quickly. However, the tradeoff here is that indexing is important. Even more so than SQL Server indexing. If you were to write a query on the Name field, for instance, you should have an index like:

CREATE INDEX IX_Name ON `SocialMedia` (Name) USING GSI;

Otherwise the query will fail to execute (if you have no indexing) or it will not be performant (if you only have a primary index created).

There are pros and cons in deciding to use referencing or not. The values in friends and user are similar to foreign keys, in that they reference another document. But there is no enforcement of values by Couchbase. The management of these keys must be handled properly by the application. Further, while Couchbase provides ACID transactions for single document operations, there is no multi-document ACID transaction available.

There are ways to deal with these caveats in your application layer that will be discussed further in later blog posts in this series, so stay tuned!

Key design and document differentiation

In relational databases, rows of data (typically, not always) correspond to a primary key, which is often an integer or a Guid, and sometimes a composite key. These keys don’t necessarily have any meaning: they are just used to identify a row within a table. For instance, two rows of data in two different tables may have the same key (an integer value of 123, for instance), but that doesn’t necessarily mean the data is related. This is because the schema enforced by relational databases often conveys meaning on its own (e.g. a table name).

In document databases like Couchbase, there isn’t anything equivalent to a table, per se. Each document in a bucket must have a unique key. But a bucket can have a variety of documents in it. Therefore, it’s often wise to come up with a way to differentiate documents within a bucket.

Meaningful keys

For instance, it’s entirely possible to have a FriendbookUser document with a key of 123, and an Update document with a key of 456. However, it might be wise to add some more semantic information to the key. Instead of 123, use a key of FriendbookUser::123. The benefits to putting semantic information in your key include:

Readability: At a glance, you can tell what a document is for.

Referenceability: If you have a FriendbookUser::123 document, then you could have another document with a key FriendbookUser::123::Updates that has an implicit association.

If you plan on using N1QL, then you may not need keys to be this semantically meaningful. In terms of performance, the shorter the key is, the more of them can be stored in RAM. So only use this pattern if you plan on making heavy use of key/value operations instead of N1QL queries.

Discriminator fields

When using N1QL, another tactic that can be used in addition to or instead of meaningful keys is to add field(s) to a document that are used to differentiate the document. This is often implemented as a type field within a document.

There’s nothing magical about the type field. It’s not a reserved word within a document and it’s not treated specially by Couchbase Server. It could just as easily be named documentType, theType, etc. But it can be useful within your application when using N1QL to query documents of a certain kind.

SELECT d.*
FROM `default` d
WHERE d.type = 'address'

You may even take it a step further and add an embedded object to your documents to act as a kind of faux 'meta data':

That may be overkill for some applications. It’s similar to a pattern I’ve seen in relational databases: a 'root' table to simulate inheritence within a relational database, or perhaps the same fields tacked on to every table.

Conclusion of part 1

This blog post covered data modeling using denormalization, data modeling using referencing, key design, and discriminating fields. Modeling data in a document database is a thought process, something of an art form, and not a mechanical process. There is no prescription on how to model your data in a document database: it depends greatly on how your application interacts with your data.