Introduction to ReQL

ReQL is the RethinkDB query language. It offers a very powerful and convenient way to manipulate JSON documents. This document is a gentle introduction to ReQL concepts. You don’t have to read it to be productive with RethinkDB, but it helps to understand some basics.

ReQL is different from other NoSQL query languages. It’s built on three key principles:

ReQL embeds into your programming language. Queries are constructed by making function calls in the programming language you already know. You don’t have to concatenate strings or construct specialized JSON objects to query the database.

All ReQL queries are chainable. You begin with a table and incrementally chain transformers to the end of the query using the . operator.

All queries execute on the server. While queries are constructed on the client in a familiar programming language, they execute entirely on the database server once you call the run command and pass it an active database connection.

Let’s look at these concepts in more detail.

Note: the following examples use the Python driver, but most of them also apply to RethinkDB drivers for other languages.

ReQL embeds into your programming language

You start using ReQL in your program similarly to how you’d use other databases:

But this is where the similarity ends. Instead of constructing strings and passing them to the database server, you access ReQL by using methods from the rethinkdb package:

r.table_create('users').run(conn) # create a table `users`
r.table('users').run(conn) # get an iterable cursor to the `users` table

Every ReQL query, from filters, to updates, to table joins is done by calling appropriate methods.

This design has the following advantages:

You can use the same programming environment and tools you’re already used to.

Learning the language is no different from learning any other library.

There is little to no chance of security issues that arise from string injection attacks.

All ReQL queries are chainable

In ReQL, you can chain commands at the end of other commands using the . operator:

# Get an iterable cursor to the `users` table (we've seen this above)
r.table('users').run(conn)
# Return only the `last_name` field of the documents
r.table('users').pluck('last_name').run(conn)
# Get all the distinct last names (remove duplicates)
r.table('users').pluck('last_name').distinct().run(conn)
# Count the number of distinct last names
r.table('users').pluck('last_name').distinct().count().run(conn)

Almost all ReQL operations are chainable. You can think of the . operator similarly to how you’d think of a Unix pipe. You select the data from the table and pipe it into a command that transforms it. You can continue chaining transformers until your query is done. In ReQL, data flows from left to right.

Even if you have a cluster of RethinkDB nodes, you can send your queries to any node and the cluster will create and execute distributed programs that get the data from relevant nodes, perform the necessary computations, and present you with final results without you ever worrying about it.

This design has the following advantages:

The language is easy to learn, read, and modify.

It’s a natural and convenient way to express queries.

You can construct queries incrementally by chaining transformations and examining intermediary results.

ReQL is efficient

Server-side execution

While queries are built up on the client, they’re only sent to the server once you call the run command. All processing happens on the server—the queries don’t run on the client, and don’t require intermediary network round trips between the client and the server. For example, you can store queries in variables, and send them to the server later:

# Create the query to get distinct last names
distinct_lastnames_query = r.table('users').pluck('last_name').distinct()
# Send it to the server and execute
distinct_lastnames_query.run(conn)

Laziness

ReQL queries are executed lazily:

# Get up to five user documents that have the `age` field defined
r.table('users').has_fields('age').limit(5).run(conn)

For this query RethinkDB will perform enough work to get the five documents, and stop when the query is satisfied. Even if you don’t have a limit on the number of queries but use a cursor, RethinkDB will do just enough work to allow you to read the data you request. This allows queries to execute quickly without wasting CPU cycles, network bandwidth, and disk IO.

Like most database systems, ReQL supports primary and secondary indexes to allow efficient data access. You can also create compound indexes and indexes based on arbitrary ReQL expressions to speed up complex queries.

Parallelism

All ReQL queries are automatically parallelized on the RethinkDB server as much as possible. Whenever possible, query execution is split across CPU cores, servers in the cluster, and even multiple datacenters. If you have large, complicated queries that require multiple stages of processing, RethinkDB will automatically break them up into stages, execute each stage in parallel, and combine data to return a complete result.

Query optimization

While RethinkDB doesn’t currently have a fully-featured query optimizer, ReQL is designed with one in mind. For example, the server has enough information to reorder the chain for efficiency, or to use alternative implementation plans to improve performance. This feature will be introduced into future versions of RethinkDB.

ReQL queries are functional

So far we’ve seen only simple queries without conditions. ReQL supports a familiar syntax for building more advanced queries:

This query looks just like any other Python code you would normally write. Note that RethinkDB will execute this query on the server, and it doesn’t execute native Python code.

The client drivers do a lot of work to inspect the code and convert it to an efficient ReQL query that will be executed on the server:

Whenever possible, the client drivers use operator overloading to support expressions such as user['age'] > 30.

The lambda expression is executed only once on the client. Internally, the driver passes a special object to the lambda function which allows constructing a representation of the query. This representation is then sent to the server over the network and evaluated on the cluster.

This technology has limitations. While most operations allow you to write familiar code, you can’t use native language’s operations that have side effects (such as print) or control blocks (such as if and for). Instead, you have to use alternative ReQL commands:

RethinkDB will seamlessly evaluate the js command by calling into the V8 engine during the evaluation of the filter query. You can combine most queries this way into progressively more complex ones.

Subqueries

Let’s say we have another table authors, and we’d like to get a list of authors whose last names are also in the users table we’ve seen before. We can do it by combining two queries:

# Find all authors whose last names are also in the `users` table
r.table('authors').filter(lambda author:
r.table('users').pluck('last_name').contains(author.pluck('last_name'))).
run(conn)

Here, we use the r.table('users').pluck('last_name') query as the inner query in filter, combining the two queries to build a more sophisticated one. Even if you have a cluster of servers and both the authors table and the users table are sharded, RethinkDB will do the right thing and evaluate relevant parts of the query above on the appropriate shards, combine bits of data as necessary, and return the complete result.

A few things to note about this query:

We compose the query on the client and call run only once. Remember to call run only once on the complex query when you’re ready for it to be executed.

Expressions

Composing queries isn’t limited to simple commands and inner queries. You can also use expressions to perform complex operations. For example, suppose we’d like to find all users whose salary and bonus don’t exceed $90,000, and increase their salary by 10%: