README.mdown

Luwak

The Riak key/value store is extremely fast and reliable, but has limitations on the size of object that preclude its use for some applications. For example, storing high definition still images or video is difficult or impossible. Luwak is a service layer on Riak that provides a simple, file-oriented abstraction for enormous objects.

Installation

To use Luwak, you must build Riak from source after modifying a few of Riak's configuration files. Start with a clean clone of riak:

git clone git://github.com/basho/riak

In the top-level directory of your clone, edit the rebar.config file. Add luwak as a dependency to make rebar fetch it at build time:

Opening a get stream actually kicks off a riak map reduce job. The job will recurse down the tree, following links until it gets to the actual data blocks. It will then send all of these data blocks to the intermediary get stream process. While it is possible to close a get stream before it is finished, there is no way currently to cancel the map reduce job in process. So when you close a get stream before it is finished, the intermediary process exits and Erlang ought to drop the messages bound for it from the map reduce job.

Operational Considerations

Luwak is purely library code. It does not have a supervisor tree nor does it need to be started like a tradition erlang application. Luwak only requires that its code be on the load path of the client as well as on the load path of all of your Riak nodes.

Let It Crash Design

Luwak is designed according to Erlang's "let it crash" design philosophy. Most of the publicly accessible functions in Luwak will intentionally crash in the case of a failure. Luwak, however, will never corrupt your files. If a write operation is aborted at any step before it returns to the user then no changes will have actually occurred to the file. Likewise, concurrent reads can happen to a file during write operations. If the reads start before the write is completed then they will simply read the previous version of the file.

Internal Architecture

Luwak stores large objects in Riak as a metadata document, named for the object being stored, and a sequence of immutable block documents (henceforth referred to simple as 'blocks'), each of which is named with a hash of its contents. All blocks for an object are the same size, except the last which may be shorter to accommodate objects of lengths not evenly divisible by the block size. All hash operations use Skein 512.

When a new object is passed to Luwak for storage in Riak, a new metadata document is created for it. This metadata document uses riak's internal concept of links in order to link to a top-level document describing a merkle tree for the data contents of the object.

Node and block documents are by definition immutable in luwak. This is because their key is computed according to their contents. Therefore, in order to mutate the contents of a file a new tree must be created. As with most immutable data structures, this can be accomplished by only computing new nodes and trees for the parts which were actually changed, and keeping references to the subtrees which stay the same.

This may seem slow, however it has several benefits:

Breaks up the read - update - write cycle that is necessary with most KV stores. Assuming that a node document's child hashes are already in memory, or a data document's data is already in memory, we need merely execute a create - write cycle, cutting a database round-trip out of the picture. The only part of luwak that mutates is the top level document, and this is only to update the root tree hash.

Efficient versioning and conflict detection of luwak objects. Two documents which have many common blocks can share storage for those blocks. Showing a diff via the interface then merely becomes an exercise in tree walking. Conflict resolution is even easier. When a conflict is detected between 2 TLD's in riak, the conflict resolution code must merely add the hash of the older trees to the ancestors list.

Tree density is kept optimal. Luwak trees are immutable and overwrite only, which means that a tree can only be appended to or overwritten. This allows us to tune the B+Tree algorithm to keep luwak tree nodes completely full except for the very tail of the tree. Optimally full tree nodes means that retrieval is guaranteed to only have to travel down log(n) / log(tree_order) levels before reaching the requested block.