Let's Build Something Using Amazon's DynamoDB

06 Feb 2012

A couple weeks ago, Amazon released DynamoDB as part of AWS. DynamoDB is a NoSQL database with a focus on scalability, reliability and performance. DynamoDB has generated a lot of excitement, if for no other reason than the fact that Amazon is an authoritative figure in the NoSQL space. Their Dynamo paper, published in 2007, has been exceedingly influential.

DynanoDB At A Glance

The most important thing to understand about DynamoDB is that it doesn't support secondary indexes. Data can only be retrieved by the key. However, there are two types of keys. The first is called a hash. It's a single value and it's what you would normally think of when you are talking about a key. The second is a composite of a hash and a range. This type of key lets you query data by either the hash component or the hash and range component. Additionally, records are automatically sorted by the range component.

I know, that's pretty vague and it sounds a little crazy. If you've never dealt with this sort of system, you might think it far too limiting. You will have to model your data differently, but hopefully when we get our hands dirty, not only will it make sense, but it won't seem so odd.

Beyond this technical point, the draw of DynamoDB is all about the infrastructure. You get fast and reliable performance (which has historically been a major shortcoming of storage solution using AWS/EBS), transparent scalability and reliability via replication. It essentially makes it possible to scale up to extreme levels without having to do anything special.

<Application>

The application that we are going to build is a simple API that can be used to store and retrieve change logs. A change log record would look something like:

Once saved, we'll be able to retrieve a change log item by id. We'll also be able to get all the change log items made by a user, or for an asset, or a combination of the two. If you think about it, it's the core of what we'd need if we were building an audit log for a system.

We'll only look at the DynamoDB-related parts, but if you are interested, you can get the working example from github. It's written using node.js + CoffeeScript along with a 3rd party DynamoDB driver which I've contributed to.

Saving Change Logs

The first thing we'll do is save a change log. To do that, we must first create a table in DynanoDB:

The above code creates a table named logs which will use a hash key (as opposed to a hash+range key). We've named the key field id and said it'll be a string. The read and write values have to do with how DynamoDB distributes workload and scales. It's the expected read and write capacity; measured by what Amazon calls a capacity unit, which is 1KB read or write per second. For the purpose of this post, it really doesn't matter.

The most important line in all of that is ddb.putItem('logs', this.serialize(), {}, callback) which is where the data is actually sent to DynamoDB. putItem can be used to do inserts or upserts, which is what the 3rd parameter controls (we left it blank which makes it default to upsert).

There are a couple things worth taking a good look at. First of all, DynamoDB only supports strings and numbers, which is where the serialize method comes into play. Our created date is converted to a number, and changes turned into a string (a real app might be interested in storing this as a compressed value). DynamoDB doesn't supported embedded objects like some other NoSQL solution, so changes can't be stored as-is. Besides this, all DynamoDB really cares about is that we provide a field with the name and type that we defined our table key as; which we do as the id field.

Getting a Change Log

Next we want to make it so that change logs can be retrieved by id. So, given the following code:

The first parameter is the name of the table, next is the hash value we want to retrieve. The next two parameters are the range key value (which we'll never have with this table since it uses a hash key only) and an options parameters, to specify things such as which fields to get. Our deserialize method undoes the work serialize did when we first stored our record.

Searching

So far we've kept things simple. Creating a table involved identifying our key, inserting involved sending along a bunch of attributes, and to get a specific item we submitted its id. Few apps can be built with just that functionality. In fact, even for our simple demo app, it's unlikely that we'll ever want individual change logs. Rather, we'll want change logs belonging to an asset, or possibly a user.

To achieve this, we need to maintain our own indexes. An index is nothing more than the value we are indexing and a reference to the record the value belongs to. We can achieve this by using another table. And, while we are at it, it makes sense to get change logs back ordered by creation date. Let's visualize what we are talking about:

Now, if we set the key of logs_by_asset to be a hash on asset and a range on created we'll be able to find change logs by asset. How? Well, first we'll get all the ids which belong to a certain asset via logs_by_asset, then we can retrieve those records by id in logs.

There's a bit going on here. First, we get all the ids for a specific asset. If we wanted to, we could also specify a created value (a table with a composite key can be queried by hash or hash and range). We transform those ids into an array because they come back looking like [{id: '1', id:'2', id:'3'}]) and we want to just query via [1,2,3]. Finally we use batchGetItem to get all the change logs that match the ids. As you can probably guess, batchGetItem works a lot like getItem except that it takes an array of keys (in fact, it can even batch get from multiple tables at once).

</Application>

The real focus here is to introduce DynamoDB and show how to deal with the restrictions it imposes. Namely, how to create your own indexes to support more advanced queries. If you also want to query by user you'll need another table and if you want to query by asset+user you'll need yet another (in this case the hash key can be a @asset + ':' + @user). If you want a different sort, you guessed it, you'll need another table.

There's more you can do with DynamoDB (like deleting), or even doing a linear scan for arbitrary fields (which is expensive and won't scale, so I'm not sure when you'd do it). But understanding that records are retrieved by hash key or hash+range key, and what that means with respect to modeling, is the best place to start.

My Thoughts on DynamoDB

From a infrastructure point of view, DynamoDB is a dream come true. Take everything you know about scaling a database and throw it out. Stop worrying about RAID, or worse, RAIDED EBS, replication, availability zones and so on. I generally like to manage all my own stuff and run my own servers, but there's something simply awesome about DynamoDB's potential.

However, beyond the infrastructure, the actual storage engine leaves a lot to be desired. It's where a other NoSQL solutions were 1-2 years ago. Which is significant when you consider how fast the field has evolved. The lack of secondary indexes isn't a deal breaker for me, but it's an increasingly rare limitations.

For me, paging records is always a good measure of how helpful a database wants to be. Paging records in SQL Server or Oracle, for example, feels a lot like being given the finger. DynamoDB doesn't fair any better. Commands that can return multiple items take a Limit option, which is good. But for an offset you need to provide a ExclusiveStartKey, which is the last key that you received. Worse, even when you don't provide a limit you might still get a partial result if the full result is too big (>1MB). DynamoDB will let you know this happened by also providing a LastEvaluatedKey in the reply. In other words, if you are hoping for a limit and offset, which I believe every database solution should strive to provide, you'll be as disappointed as I am.

There's also the fact that it only supports strings and integers and doesn't support embedded objects. This isn't too uncommon, but I think we can agree more type support is better than less.

Then there's the pricing. Billing per write and read compute units doesn't bother me. Sure, it's ambiguous at first, but I can see how it better reflects Amazon's actual cost than say, charging $X for Y RAM and Z HDD. They are essentially charging by I/O, which is probably a better all around measure of CPU, HDD and RAM usage. What does bother me though is that they round up to the nearest 1KB. Now, maybe in a real world app that would just be a blip. However, given that a high number of queries will likely go to a secondary index (and thus only return short ids), I have a feeling it really could add up. It kinda feels like they are providing an inferior experience (lack of secondary indexes), and forcing you to pay more because of it.

My last point is about the communication protocol. Admittedly, this is something most devs won't have to worry/know. I'm quite familiar with the MongoDB and Redis protocols, and I can safely say that, in comparison, I hate the DynamoDB protocol. First of all, even though it's JSON, they've somehow made it feel like XML. Not only is it incredibly verbose, but whenever you send attributes over, you have to encode them as such: {"S": "MyStringValue"} or {"N":"MyNumericalValue"}. If only JSON had a built-in way to distinguish strings from numbers...There are also a couple inconsistencies, which is a shame to see in such a young protocol. These inconsistencies are quite evident in the way errors are handled. I tried build a local emulator backed by MongoDB for development purposes, but abandoned the project after being frustrated with DynamoDB's error handling.

Ultimately, I think the idea is great, but the execution is a couple years behind what's currently available. The real question is where do they plan on taking it and when do they plan on getting there.