Friday, March 2, 2012

Cassandra Triggers for Indexing and RDBMS Synchronization

We love Cassandra as a data store, but unfortunately it doesn't support all of our use cases natively. We need to support ad hoc structural queries, online analytics, and full-text searches of the data stored in Cassandra. These use cases are best supported by other storage mechanisms: indexes for search and RDBMS + BI for reporting and analysis.

Initially we took a batch approach to the problem, relying on Hadoop and Map/Reduce jobs to keep the external systems up to date. We would perform map/reduce jobs over column families to bulk update the external systems. This had obvious draw backs. Until the batch process completes, the index and the RDBMS are out of synch with Cassandra. Additionally, we would run over large portions of the column family even though only a small number of records had changed.

To keep the other systems synchronized, we could have complicated the cassandra clients, embedding the logic to orchestrate updates to all of the relevant systems, but that seemed like a nightmare. In the end, we decided to go for real-time trigger-like functionality. This removes the burden off of the client and allows us to keep other systems in synch in near real-time.

In the end, we decided to implement our own trigger mechanism using Aspect-Oriented Programming (AOP). Our mechanism is roughly based on Jonathan Ellis's Crack-Smoking Commit Log (CSCL). For each column family mutation, we write to a commit log. The log entries are then processed asynchronously by the triggers. Upon successful execution, the log entry is removed.
We've released the project at github:
https://github.com/hmsonline/cassandra-triggers

The design is certainly heavy and the documentation is still a bit rough around the edges, but its small amount of code and it is working like a champ.
We've setup installation and configuration instructions. Let us know if you have any trouble getting started.