Rate-Limited Scans in Amazon DynamoDB

Today we’re lucky to have another guest post by David Yanacek from the Amazon DynamoDB team. David is sharing his deep knowledge on Amazon DynamoDB to help explain how to manage performance and throughput usage on your DynamoDB tables.

When you scan your table in Amazon DynamoDB, you should follow the DynamoDB best practices for avoiding sudden bursts of read activity. You may also want to limit a background Scan job to use a limited amount of your table’s provisioned throughput, so that it doesn’t interfere with your more important operations. Fortunately, the Google Guava libraries for Java include a RateLimiter class, which makes it easy to limit the amount of provisioned throughput you use.

Let’s say that you have an application that scans a DynamoDB table once a day in order to produce reports, take a backup, compute aggregates, or do something else that involves scanning the whole table. It’s worth pointing out that Amazon DynamoDB is also integrated with Amazon Elastic Map Reduce, and with Amazon Redshift. These integrations let you export your tables to other locations like Amazon S3, or to perform complex analytics and queries that DynamoDB does not natively support. However, it’s also common to do this sort of scan activity in the application instead of using EMR or Redshift, so let’s go into the best practices for doing this scan without interfering with the rest of the application.

To illustrate, let’s say that you have a table that is 50 GB in size and is provisioned with 10,000 read capacity units per second. Assume that you will perform this scan at night when normal traffic to your table consumes only 5,000 read capacity units per second. This gives you plenty of extra provisioned throughput for scanning your table, but you still don’t want it to interfere with your normal workload. If you allow your scan to consume 2,000 read capacity units, it will take about an hour to complete the scan, according to following calculation:

To make the most of your table’s provisioned throughput, you’ll want to use the Parallel Scan API operation so that your scan is distributed across your table’s partitions. But be careful that your scan doesn’t consume your table’s provisioned throughput and cause the critical parts of your application to be throttled. To avoid throttling, you need to rate limit your client application—something Guava’s RateLimiter class makes easy:

The preceding code example limits the consumed capacity to 25.0 read capacity units per second, as determined by the following algorithm:

Initialize a RateLimiter object with a target rate of 25.0 capacity units per second.

Initialize a pagination token to null. We use this token for looping through each “page” of the Scan results.

Acquire read capacity units from the rate limiter. The first time through, we consume “1” because we don’t know how much throughput each “page” of the scan will consume. This pauses the application until we have “recharged” enough throughput.

Perform the scan, passing in the ExclusiveStartKey, and also a Limit. If unbounded, the scan will consume 128 read capacity units, which could cause an uneven workload on the table. Also pass in “TOTAL” to ReturnConsumedCapacity so that DynamoDB will return the amount of throughput consumed by the request.

Record the amount of consumed throughput, so that next time around the loop, we will ask for more or fewer permits from the rate limiter.

Process the results of that “page” of the scan.

The preceding algorithm shows a good basic approach to scanning a table “gently” in the background without interfering with production traffic. However, it could be improved upon. Here are a few other best practices you could build into such a background scan job:

Parallel scan – To distribute the workload uniformly across the partitions of the table, pass the Segment and TotalSegments parameters into the Scan operation. You can use multiple threads, processes, or machines to scale out the scan work on the client side.

Estimating page sizes – The code above uses a limit of “100” on every page of the scan. A more sophisticated approach could involve computing a Limit based on the throughput consumed by each page of the scan. Ideally, each page would consume a fairly small number of read capacity units so that you avoid sudden bursts of read activity.

Rounding to 4 KB boundaries – Every 4 KB of data scanned consumes 0.5 read capacity units (0.5 and not 1.0 because Scan uses eventually consistent reads). Therefore if you specify a Limit that results in scanning a size that isn’t divisible by 4 KB, you waste some throughput. Ideally the algorithm estimates how many items fit into a 4 KB chunk and adjusts the Limit accordingly.

Recording progress – If the server process were to crash, or if errors occurred beyond the automatic retries in the SDK, we want to resume where we left off next time around. Imagine a 2 hour scan job getting to be 99% done and crashing. You could build a “stateful cursor” in a DynamoDB table by saving the LastEvaluatedKey in an item after every page. Be careful, though, since that will only save how far the scan got. Your application will have to be able to deal with the possibility of processing a page multiple times.

The Scan operation in DynamoDB is useful and necessary for performing various occasional background operations. However, applications that perform scans should do so by following the DynamoDB best practices. Hopefully, Google Guava’s RateLimiter makes doing so a bit easier. Also, you might want to check out our earlier blog post on using Google Guava’s Map builder API for writing shorter code when working with maps in the AWS SDK for Java.