We all remember the great key-value wars of the early 21st century. People, in their tiny dawn of the Internet-era minds, conflated data models, distribution, replication, and stability requirements into a mish-mash of ideologies and technological philosophies.

"Give us replication or... give us an acceptable alternative!"

"If I have to write another schema migration, I swear to Monty, I'm going to become a hardware store day labourer."

"Serialized JSON is the answer to everything!""But what if I want to search by a property of the JSON object and not just the id?""You just write a map-reduce function and re-reduce until you get your answer.""Did you just tell me to go computationally fuck myself?""I believe I did, Bob. I believe I did." [q]

The core of their debate came down to representation versus distribution. The squealers claimed to have perfect representation, but hacked together distribution methods. The kvalers claimed to have perfect distribution, but limited representation models.

Trees sitting next to tablesWhat was the problem with representation? Why is perfect representation diametrically opposed to perfect distribution? Perfect representation came down to one issue: asking for ranges of things. Who is between 18 and 29? Which orders completed between last Thursday and today?

The squealers worshiped at the altar of their B+ tree, offering up mathematically perfect schemas in return for logarithmic access times to their data. Squealers ignored the one downside of their Lord of All Data Structures: B+ trees were manipulated in-place on disk, requiring huge amounts of logically contiguous disk space for large datasets. Distributing and replicating a B+ tree across multiple machines was impossible. Squealers took to chopping up their data into smaller and smaller collections until entire copies of B+ trees could be kept on multiple machines. Squealers claimed they never needed needed cross-table joins in the first place and they actually enjoyed the additional administration of dealing with statement based replication [a].

The kvalers worshiped at the altar of their hash table, offering up non-colliding keys in return for separation of data enabling perfect distribution and replication. The kvalers ignored the one downside of their Lord of All Data Structures: hash tables provide no data locality. Kvalers happily maintained hand-crafted indexes if range queries or search-within-the-dataset queries were required. Kvalers pretended to never have heard of ad-hoc queries or business intelligence requirements.

Who put my tree on the table?Trees are made of nodes. Nodes must have names to be found. Traditionally, B+ trees were stored in files with nodes being referenced by positions within the B+ tree file. Arbitrarily chopping a B+ tree in half and distributing it to another node was not practical. Even with a B+ tree designed to reference outside its own file [1], the entire collection of mini-B+ trees needed to be present for operations to complete, resulting in no advantage in distribution or replication.

The B+ tree scalability problem came down to one issue: how can you name a node and later reference it independent of its storage location?

One mild summer's day in Goolgonia, historically California, someone stumbled upon the answer: a new way of naming nodes [2]. What if, instead of offsets within a file, a Central Naming Service could provide a globally unique node identifier serving as a recall key for a node's location? Instead of storing nodes within a sequentially allocated data file, B+ tree nodes could be stored on anything accepting the name of the node and storing its contents.

The kvalers squealed.

The squealers begged the kvalers, "Please, may we have access to your infinite-storage-capacity, automatically-redundant-with-self-managing-failover, and incorruptible storage system?" The response was delivered with a wry smile: "We've been waiting for you."

In a sigh of ecstasy, the squealers and kvalers realized what was born. The squealer's toes curled at the thought of being free from routine sharding, free from identifying hot spots and spinning off copies, free from arbitrarily chopping tables in half, and free from single points of failure. The kvaler's eyes rolled upwards while synthesizing thoughts of sequential data access, retrieving records by ranges, and iterating over database snapshots just by holding on to a root node.

After an exhaustive journey, the kvalers and squealers drifted off to a sound sleep that night. The story of the squealers versus kvalers came to a close, but the journey of the sqvalers had just begun.

wtfbbq? (aka Back to Life, Back to Reality)Over the past few years I've been looking for a stable, low maintenance, zero up front decision making, scalable sorted data platform. I looked high and low to no avail. I even went as far as writing a single-purpose BigTable clone called TinyTable for my own use, but it still had annoying edge cases when I overflowed a single disk or RAID array.

A potential solution dawned on me after looking at an R-tree implementation using CouchDB's append-only storage format. To store a node, you pass in the data and get back its position. To retrieve a node, you pass in a position and get back the data. It's that simple.

All file writing and reading is completely encapsulated within the couch_file module. The two storage calls are straight forward. No seeking, block management, or any other file-level details are leaked through the interface.

Important point: Position is an opaque type. The code doesn't care if Position is an integer or a tuple or a binary. Position is passed through to the storage layer on lookups and returned to the caller on appends.

Let's do thisTo make CouchDB store documents remotely, we only have to replace the implementation of the two functions listed above. For our remote storage let's use Riak as our Key-Value store (because it's awesome). CouchDB persists Erlang terms to disk and Riak persists Erlang terms to disk. We get to remove redundant code from CouchDB since Riak is converting terms for us. Riak also automatically replicates everything we store, easily handles adding more machines to increase capacity, and deals with failures transparently.

append_term's implementation becomes riak_write(Bucket, Key, Value).pread_term's implementation becomes riak_read(Bucket, Key).We replaced Fd with a Bucket name here. Bucket is the database filename as if it were stored locally (e.g. <>).

We also have to generate unique keys for our nodes. The CouchDB file format is append-only, so we never have to worry about updating a value once it is stored [e]. Where does Key come from? Key is equivalent to Position in the original couch_file calls. Key is now simply {node(), now()}. That's the Erlang way of making a globally unique value within your cluster. node() is the name of your VM (original in a cluster) and now() returns the current time (guaranteed to be monotonically increasing) [k].

We turn our nice {node(), now()} into a binary and use it as a key: term_to_binary({node(), now()}), et voila we have a globally unique key to store values in Riak [m].

All the tricky couch_file page alignment code is now gone. The only files written to the local file system are couch.uri and couch.log.

We did thisDoes it work? Sure, it works. Check out my CouchDB Riak Integration for yourself [p]. Modify your bin/couchdb script to include Riak client libraries (riak_kv) and connect to a local Riak cluster:-pa ../riak/deps/riak_kv/ebin -pa ../riak/deps/riak_core/ebin -setcookie riak -name 'couchdb@127.0.0.1'Insert those after $ERL_START_OPTIONS and before -env ERL_LIBS

What did it take to convert CouchDB to use completely remote storage for documents? It took: 7 files changed, 87 insertions(+), 502 deletions(-)

See the commit log for quirks, caveats, and how to use the admin interface properly in my proof-of-concept implementation.

Patches welcome? Maybe? To make this mergeable upstream, we need a way to have both local-disk and Riak storage. The original couch_file needs to be restored with flags about when to use remote storage versus local storage per-DB. A dozen other quirks and deficiencies would have to be fixed and accounted for as well. It makes more sense to use the CouchDB B+ tree code (and/or the R-tree code) with a modified couch_file to create an independent data platform [L].

[a]: Yes, this is the shitty MySQL way of doing things. Let's not tell them about PostgreSQL's log shipping replication.

[1]: e.g. Replacing your node reference from {Offset} to {Filename, Offset} still means all referenced files must be replicated.

[2]: ANWONN: The most advanced, world-changing, node naming scheme to ever be conceived by an almost god-like being. Do you question it? Here, read a 600 page autobiography about how amazing, unique, and intelligent I am.

[3]: This provides more location independence than the the Sequential Append-Only Write-Once-Read-Many Just-Try-To-Corrupt-Me-I-Dare-You B+ tree file format. [[Yes, this footnote isn't referenced in the article above. I'm not sure when the source to the footnote got cut, but I like the footnote anyway.]]

[k]: Yeah, I just re-wrote Twitter's Snowflake in one line of Erlang. And we can deal with unsigned longs too because we're Not Java. Let's ignore the fact that {node(), now()} is about 300 bits instead of 64 bits.

[m]: Yes, you can (and should) make it much more efficient storage-wise. I'm going for concise and readable here.

[p]: The CouchDB code tree is very messy. It really needs to be cleaned up. I wish they would discover Rebar already. I don't give a shit about the bulk of it, still, I keep it professional.