Tag Archives: dynamodb

If you’re like a lot of folks you’re building an application in AWS & using a NoSQL database for persistent data. Dynamodb fits the bill nicely. Little or no ops to worry about, at least in the traditional sense.

However there are knobs to turn & dials to set. Here are a few you should be thinking about.

1. You can replicate across regions

Dynamodb introduced a feature in 2015 called streams. If you come from the relational database world, you can think of streams like a transaction log. It captures before & after image of your data. Couple those with useful lambda functions, and you have triggers that can do anything you want.

2. You can manage retrieval costs

Dynamodb automatically creates and manages an index on the primary key. But chances are that your application will read data based on other columns too. You can create secondary indexes on these other columns, reducing your data access patterns. Without an index Dynamodb would have to scan every row to find your data, but the index can dramatically reduce this, and making data retrieval faster too!

3. You can do SQL Like queries

That’s right, if you thought NoSQL meant no SQL you were only half right. By loading your Dynamodb data into HDFS, you can allow elastic map reduce to have at it. And thus open the door to use HiveQL to query the data the way you wanted to in the first place.

4. Partitions are handy & useful

By default dynamo is partitioning your data behind the scenes. Because that’s what good distributed databases are supposed to do. It does so using the primary key to figure out where the data should go. And just like with Redshift you have option of also using sort key to help the optimizer figure out how to distribute the data. This is important. Going across those different instances brings a lot of latency costs that will surprise you.

I was attending an excellent talk recently called Data at Scale, part of the Database Month series that Eric Benari hosts. In it Mark Uhrmacher presented some phenomenal solutions which worked for flash site ideeli. It allowed them to support their incredible business model, where 15% of traffic would happen in 15 minutes everyday. As he called it a “self-imposed denial of service attack”. Interesting analogy.

What occurred to me though, is that a lot of companies and startups struggle to understand which database solutions will work for them, and what the strengths and weaknesses of each are, and further what tradeoffs they’ll grapple with.

One concept that we hear a lot is “eventually consistent”. Many of the new NoSQL databases achieve their speed & availability this way. But what’s it all about?

Let’s change a smartphone contact

I’m sure you have a smartphone in your pocket, and for demonstration sake I’ll use the iphone configured with iCloud.

Let’s go ahead and dial up your *OWN* contact card. Click “Edit” and go ahead and change something. Let’s change your title to “rock star”. Now click “Done”. We’ll wait a minute. Now go to your desktop and open up Contacts. Scroll through to your contact and verify that the Title field now shows “rock star”.

How does all this happen? When you click the “Done” button, the iphone sends changes up to iCloud. iCloud then lets your laptop know a change has happened and those then sync up.

Now let’s run through the same exercise, but change it in two places. We’ll change the smartphone contact to “Founder” and the desktop Contacts record title to “Consultant”. Wait a little bit and you’ll notice they will both eventually show “Consultant”.

How long were laptop & phone out of sync?

As you probably noticed, the iCloud seems to lean in favor of the desktop client. It’s not clear to me what rules it uses here, nor does it seem to be configurable. Nevertheless eventually both the desktop and smartphone with have the same contact card for you. Quite a feat of magic!

Handling collisions

There is only one *YOU* and presumably your digital rolodex reflects that too. You have one and only one contact card. Or do you? As far as these digital tools are concerned there are actually THREE! One on your desktop, one in iCloud and one on your phone. Each time you change in any of those places, it syncs *UP* to iCloud and then down to the other devices.

Collisions happen if you make changes in two places. Imagine if you’re a road warrior and your laptop was offline for some days, or your smartphone for that matter. In those cases that syncing would happen much later, and collisions more likely.

In the high frequency world of online databases

With online databases, all of this becomes vastly more complex. Web based applications may have 100,000 simultaneous users. Some may be coming from IMEA while others the Americas. It gets pretty darn complex when you have databases in each of those regions.

We deploy applications this way, so one datacenter, say the East Coast region one version, can fail, but all the others still operate. They can still change data, read and write, without being impacted by the New York outage.

Once that datacenter is restored, the databases will then sync up and reconcile missing data.

MariaDB and Amazon RDS read replicas

MySQL and it’s variants of MariaDB, Percona and Amazon RDS can do something like this with read-replicas. The read-only copies of the database are asynchronous and take time to catch up to changes. You can have the read-only copies in different regions.

Although you can try to do the same for writes by sharding your MySQL instances, this starts to get very messy very fast. Imagine backing up 10 shards, 10x the complexity, and even more when you want to go and do a restore.

Amazon’s Dynamo DB

Amazon’s DynamoDB is a technology based around the original Dynamo whitepaper which attempts to solve a whole class of problems by easing eventually consistent constraints.

What you get is more availability, it’s hard for the whole cluster to go down. That’s great for applications because they can continue to operate if one or more nodes fails. It also scales writes, which is a sort of holy grail in the database world as it’s typically hard to do.

But remember all this comes at a cost. Traditionally scaling writes is hard to do because all changes are kept in one place. You maintain a single authoritative master. If you want to imagine why this matters, think back to our smartphone example. We changed our contact card on our phone and our desktop at the same time. One of those two changes won the battle. But that’s a case where we’re not overly concerned.

If you imagine a bank doing the same thing, and you wire $1000 via phone and desktop, you can quickly see that there is a whole class of applications that won’t be happy with eventually consistent. Your web application may be one of those. Or it may not. Consider carefully before you go with Amazon RDS or DynamoDB as your datastore.