Introducing Riak, Part 1: The language-independent HTTP API

Store and retrieve data using Riak's HTTP interface

This is Part 1 of a two-part series about Riak, a highly
scalable, distributed data store written in Erlang and based on Dynamo,
Amazon's high availability key-value store. Learn the basics about Riak and
how to store and retrieve items using its HTTP API. Explore how to use its
Map/Reduce framework for doing distributed queries, how links allow
relationships to be defined between objects, and how to query those
relationships using link walking.

03 Apr 2012 - Per author response to reader feedback about in paragraph immediately following Listing 9 of Example: Distributed grep, corrected third sentence to read: "Save the code in Listing 10 in a directory somewhere."

29 Mar 2012 - In response to reader feedback, updated original text of second paragraph in Introduction.

Simon Buckle is an independent consultant. His interests include
distributed systems, algorithms, and concurrency. He has a Masters Degree
in Computing from Imperial College, London. Check out his website at simonbuckle.com.

Introduction

Typical modern relational databases perform poorly on certain types of
applications and struggle to cope with the performance and scalability
demands of today's Internet applications. A different approach is needed.
In the last few years, a new type of data store, commonly referred to as
NoSQL, has become popular as it directly addresses some of the
deficiencies of relational databases. Riak is one such example of this
type of data store.

Other articles in this series

Riak is not the only NoSQL data store out there. Two other popular data stores are MongoDB and Cassandra. Although similar in many ways, there are also some significant differences. For example, Riak is a distributed system whereas MongoDB is a single system database—Riak has no concept of a master node, making it more resilient to failure. Though also based on Amazon's description of Dynamo, Cassandra omits certain features such as vector clocks. Cassandra uses timestamps instead for conflict resolution so it is important that clocks on the clients are synchronized.

Another strength of Riak is it is written in Erlang. MongoDB and Cassandra
are written in what can be referred to as general-purpose languages (C++
and Java, respectively), whereas Erlang was designed from the ground up to
support distributed, fault-tolerant applications, and as such is more
suited to developing applications such as NoSQL data stores that share
some characteristics with the applications that Erlang was originally
created for.

Map/Reduce jobs can only be written in either Erlang or JavaScript. For
this article, we have chosen to write the map
and reduce functions in JavaScript, but it is
also possible to write them in Erlang. While Erlang code may be slightly
quicker to execute, we have chosen JavaScript code because of its
accessibility to a larger audience. See Resources
for links to learn more about Erlang.

Getting started

If you want to try out some of the examples in this article, you need to
install Riak (see Resources) and Erlang on your
system.

You also need to build a cluster containing three nodes running on your
local machine. All data stored in Riak are replicated to a number of nodes
in the cluster. A property (n_val) on the bucket the data is stored in
determines the number of nodes to replicate. The default value of this
property is three, therefore, we need to create a cluster with at least
three nodes (after which you can create as many as you like) in order for
it to be effective.

After you download the source code, you need to build it. The basic steps
are as follows:

Unpack the source:
$ tar xzvf riak-1.0.1.tar.gz

Change directory: $ cd riak-1.0.1

Build: $ make all rel

This will build Riak (./rel/riak). To run multiple nodes locally you need
to make copies of ./rel/riak — one copy for each additional node.
Copy ./rel/riak to ./rel/riak2, ./rel/riak3 and so on, then make the
following changes to each copy:

In riakN/etc/app.config change the following values: the port
specified in the http{} section, handoff_port, and pb_port, to
something unique

Open up riakN/etc/vm.args and change the name, again to something
unique, for example,
-name riak2@127.0.0.1

The Riak API

There are currently three ways of accessing Riak: an HTTP API (RESTful
interface), Protocol Buffers, and a native Erlang interface. Having more
than one interface gives you the benefit of being able to choose how to
integrate your application. If you have an application written in Erlang
then it would make sense to use the native Erlang interface so you have
tight integration between the two. There are also other factors, such as
performance, that may play a part in deciding which interface to use. For
example, a client that uses the Protocol Buffers interface will perform
better than one that interacts with the HTTP API; less data is
communicated and parsing all those HTTP headers can be (relatively) costly
in terms of performance. However, the benefits of having an HTTP API are
that most developers today — particularly Web developers —
are familiar with RESTful interfaces plus most programming languages have
built-in primitives for requesting resources over HTTP, for example,
opening a URL, so no additional software is needed. In this article, we
will focus on the HTTP API.

All the examples will use curl to interact with Riak through its HTTP
interface. This is just to get a better understanding of the underlying
API. There are a number of client libraries available in various different
languages and you should consider using one of those when developing an
application that uses Riak as the data store. The client libraries provide
an API to Riak that makes it easy to integrate into your application; you
won't have to write code yourself to handle the kind of responses you will
see when using curl.

The API supports the usual HTTP methods: GET,
PUT, POST,
DELETE, which will be used for retrieving,
updating, creating and deleting objects respectively. Each one will be
covered in turn.

Storing objects

You can think of Riak as implementing a distributed map from keys (strings)
to values (objects). Riak stores values in buckets. There is no need to
explicitly create a bucket before storing an object in one; if an object
is stored in a bucket that doesn't exist, it will be created automatically
for us.

Buckets are a virtual concept in Riak and exist primarily as a means of
grouping related objects. Buckets also have properties and the value of
these properties define what Riak does with the objects that are stored in
it. Here are some examples of bucket properties:

n_val— The number of times an object should be replicated across the
cluster

allow_mult— Whether to allow concurrent updates

You can view a bucket's properties (and their current values) by making a
GET request on the bucket itself.

To store an object, we do an HTTP POST to one of
the URLs shown in Listing 3.

Listing 3. Storing an object

POST -> /riak/<bucket> (1)
POST -> /riak/<bucket>/<key> (2)

Keys can either be allocated automatically by Riak (1) or defined by the
user (2).

When storing an object with a user-defined key it's also possible to do an
HTTP PUT to (2) to create the object.

The latest version of Riak also supports the following URL format:
/buckets/<bucket>/keys/<key>, but we will use the older format
in this article in order to maintain backwards compatibility with earlier
versions of Riak.

If no key is specified, Riak will automatically allocate a key for the
object. For example, let's store a plain text object in the bucket "foo"
without explicitly specifying a key (see Listing
4).

By examining the Location header, you can see the key that Riak allocated
to the object. It's not very memorable, so the alternative is to have the
user provide a key. Let's create an artists bucket and add an artist who
goes by the name of Bruce (see Listing 5).

If the object was stored correctly using the key that we specified, we will
get a 204 No Content response from the server.

In this example, we are storing the value of the object as JSON but it
could just as easily have been plain text or some other format. It is
important to note that when storing an object that the Content-Type header
is set correctly. For example, if you want to store a JPEG image, then you
should set the content type to image/jpeg.

Retrieving an object

To retrieve a stored object, do a GET on the
bucket using the key of the object you want to retrieve. If the object
exists, it will be returned in the body of the response, otherwise a 404
Object Not Found response will be returned by the server (see Listing 6).

Listing 7. Adding Bruce's nickname

As mentioned earlier, Riak creates buckets automatically. The buckets have
properties. One of those properties, allow_mult, determines whether
concurrent writes are allowed. By default, it is set to false; however, if
concurrent updates are allowed then for each update, the X-Riak-Vclock
header should be sent as well. The value of this header should be set to
the value that was seen when the object was last read by the client.

Riak uses vector clocks to determine the causality of modifications to
objects. How vector clocks work is beyond the scope of this article but
suffice to say that when concurrent writes are allowed there is a
possibility that conflicts may occur so it will be left up to the
application to resolve these conflicts (see Resources).

Removing an object

Removing an object follows a similar pattern to the previous commands: we
simply do an HTTP DELETE to the URL that
corresponds to the object we want to delete:
$ curl -i -X DELETE
http://localhost:8098/riak/artists/Bruce.

If the object was removed successfully we will get a 204 No Content
response from the server; if the object we are trying to delete does not
exist, the server responds with a 404 Object Not Found.

Links

So far, we have seen how to store objects by associating an object with a
particular key so it can be retrieved later on. What would be useful is if
we could extend this simple model to be able to express how (and if)
objects are related to each other. Well we can and Riak achieves this via
links.

So, what are links? Links allow the user to create relationships between
objects. If you are familiar with UML class diagrams, you can think of a
link as an association between objects with a label describing the
relationship; in a relational database, the relationship would be
expressed using a foreign key.

Links are "attached" to objects via the "Link" header. Below is an example
of what a link header looks like. The target of the relationship, for
example, the object we are linking to, is the thing between the angled
brackets. The relationship type — in this case "performer" —
is expressed by the riaktag property:
Link:
</riak/artists/Bruce>; riaktag="performer".

Let's add some albums and associate them with the artist Bruce who
performed on the albums (see Listing 8).

Now that we have set-up some relationships, it's time to query them via
link walking — link walking is the name given to the process of
querying the relationships between objects. For example, to find the
artist who performed the album The River, you would do this:
$ curl -i
http://localhost:8098/riak/albums/TheRiver/artists,performer,1.

The bit at the end is the link specification. This is what a link query
looks like. The first part (artists) specifies
the bucket that we should restrict the query to. The second part
(performer) specifies the tag we want to use to
limit the results, and finally, the 1 indicates
that we do want to include the results from this particular phase of the
query.

It's also possible to issue transitive queries. Let's assume we have set-up
the relationships between albums and artists as in Figure
1.

Figure 1. Example relationship
between albums and artists

It's now possible to issue queries such as, "Which artists collaborated
with the artist who performed The River," by executing the following:
$ curl -i
http://localhost:8098/riak/albums/TheRiver/artists,_,0/artists,collaborator,1.
The underscore in the link specification acts like a wildcard character
and indicates that we don't care what the relationship is.

Running Map/Reduce
queries

Map/Reduce is a framework popularized by Google for running distributed
computations in parallel over huge datasets. Riak also supports Map/Reduce
by allowing queries that are more powerful to be performed on the data
stored in the cluster.

A Map/Reduce function consists of both a map phase and a reduce phase. The
map phase is applied to some data and produces zero or more results; this
is equivalent in functional programming terms to mapping a function over
each item in a list. The map phases occur in parallel. The reduce phase
then takes all of the results from the map phases and combines them
together.

For example, consider counting the number of each instance of a word across
a large set of documents. Each map phase would calculate the number of
times each word appears in a particular document. These intermediate
totals, once calculated, would then be sent to the reduce function that
would tally the totals and emit the result for the whole set of documents.
See Resources for a link to Google's Map/Reduce
paper.

Example: Distributed
grep

For this article, we are going to develop a Map/Reduce function that will
do a distributed grep over a set of documents stored in Riak. Just like
grep, the final output will be a set of lines that match the supplied
pattern. In addition, each result will also indicate the line number in
the document where the match occurred.

To execute a Map/Reduce query we do a POST to
the /mapred resource. The body of the request is a JSON representation of
the query; as in previous cases, the Content-Type header must be present
and always be set to application/json. Listing 9
shows the query that we will execute to do the distributed grep. Each part
of the query will be discussed in turn.

Each query consists of a number of inputs, for example, the set of
documents we want to do some computation on, and the name of a function to
run during both the map and reduce phases. It is also possible to include
the source of both the map and
reduce functions directly inline in the query
by using the source property instead of name but I have not done that
here; however, in order to use named functions you will need to make some
changes to Riak's default configuration. Save the code in Listing 10 in a
directory somewhere. For each node in the cluster, locate the file
etc/app.config, open it up and set the property js_source_dir to the
directory where you saved the code. You will need to restart all the nodes
in the cluster in order for the changes to take effect.

The code in Listing 10 contains the functions that
will be executed during the map and reduce phases. The
map function looks at each line in the document
and checks to see if matches the supplied pattern (the
arg parameter). The
reduce function in this particular example
doesn't do much; it behaves like an identity function and just returns its
input.

Before we can run the query, we need some data. I downloaded a couple of
Sherlock Holmes e-books from the Project Gutenberg Web site (see Resources). The first text is stored in the
"documents" bucket under the key "s1"; the second text in the same bucket
with the key "s2".

Listing 11 is an example of how you would load such a
document into Riak.

The arg property in the query contains the
pattern that we want to grep for in the documents; this value is passed in
to the map function as the
arg parameter.

The output from running the Map/Reduce job over the sample data is in Listing 13.

Listing 13. Sample output from running the Map/Reduce
job

[["1. Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan
Doyle","9. Title: The Adventures of Sherlock Holmes","62. To Sherlock Holmes
she is always THE woman. I have seldom heard","819. as I had pictured it from
Sherlock Holmes' succinct description,","1017. \"Good-night, Mister Sherlock
Holmes.\"","1034. \"You have really got it!\" he cried, grasping Sherlock
Holmes by" …]]

Streaming
Map/Reduce

To finish off this section on Map/Reduce, we'll take a brief look at Riak's
streaming Map/Reduce feature. It's useful for jobs that have map phases
that take a while to complete, since streaming the results allows you to
access the results of each map phase as soon as they become available, and
before the reduce phase has executed.

We can apply this to good effect to the distributed grep query. The reduce
step in the example doesn't actually do much. In fact, we can get rid of
the reduce phase altogether and just emit the results from each map phase
directly to the client. To achieve this, we need to modify the query by
removing the reduce step and adding
?chunked=true to end of the URL to indicate
that we want to stream the results (see Listing
14).

The results of each map phase — in this example, lines that match
the query string — will now be returned to the client as each map
phase completes. This approach would be useful for applications that need
to process the intermediary results of a query when they become
available.

Conclusion

Other articles in this series

Riak is an open source, highly scalable key-value store based on principles
from Amazon's Dynamo paper. It's easy to deploy and to scale. Additional
nodes can added to the cluster seamlessly. Features such as link walking
and support for Map/Reduce allow for queries that are more complex. In
addition to the HTTP API there is also a native Erlang API and support for
Protocol Buffers. In Part 2 of this series, we'll explore a number of
client libraries available in various different languages and show how
Riak can be used as a highly scalable cache.

Introduction to programming in Erlang (Martin Brown,
developerWorks, May 2011) explains how Erlang's functional programming
style compares with other programming paradigms such as imperative,
procedural and object-oriented programming.

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.