Another Octopress blog about programming and infrastructure.

Analysing DynamoDB Index Usage in Hive Queries

May 30th, 2015

Elastic Map Reduce allows you to conveniently run SQL-like queries against DynamoDB using Hive.
This overcomes many of the limitations of the built-in DynamoDB query functionality and makes it significantly
more useful for storing raw analytical data.

While the abstraction provided by this handler is pretty good, it is still subject to the same underlying throughput
and indexing limitations faced when accessing data through the DynamoDB API directly. In particular, access efficiency
is extremely sensitive to the use of appropriate indexes - full table scans are both slow and expensive.

The documentation provides some guidance
with regard to performance optimisation, however it does not go into how the handler maps a Hive query to
a DynamoDB scan or query, nor under what circumstances indexes will be used to avoid scanning the entire table.

In this blog post you’ll find several Hive queries run against an example DynamoDB table, along with the resulting
DynamoDB request to observe which indexes are used.

No luck on this one either, it doesn’t recognise that the only key value can be 3 - although I’m not too surprised about that.
Be sure that your key values are explicit enough for the handler to recognise.

Surprisingly only one scan request was made as far as I could tell, so while unfortunately it wasn’t able to issue
this as two queries at least it didn’t issue two scans.

Conclusion

Unfortunately it seems that neither global secondary indexes or local indexes are supported, however scenarios involving
a query on a single primary key are recognised pretty well. This makes it practical to use a primary key as a method
of partitioning your data to avoid EMR queries taking longer over time as the table grows.

With that in mind you may also be able to design queries which avoid a full scan but still achieve the same outcome.

It’s also worth noting that filtering on the DynamoDB side is never used, although that isn’t a huge issue seeing
as throughput consumption is calculated before any filters are applied.

You can also use the EXPLAIN Hive command to give you some clue about how a query will be executed, however as far as
I’m aware the raw DynamoDB request isn’t exposed in either the EXPLAIN output or the query logs. I’d be glad to know if there
is a more convenient way to view the raw DynamoDB request than inspecting the network traffic.