Thursday, November 8, 2012

Big Data Quadfecta: (Cassandra + Storm + Kafka) + Elastic Search

In my previous post, I discussed our BigData Trifecta, which includes Storm, Kafka and Cassandra. Kafka played the role of our work/data queue. Storm filled the role of our data intake / flow mechanism, and Cassandra our system of record for all things storage.

Cassandra is a fantastic for flexible/scalable storage, but unless you purchase Datastax Enterprise, you're on your own for unstructured search. Thus, we wanted a mechanism that could index the data we put in Cassandra.

Initially, went with our cassandra trigger mechanism, connected to a SOLR backend. (https://github.com/hmsonline/cassandra-indexing) That was sufficient, but as we scale our use of Cassandra, we anticipate a much greater load on SOLR, which means additional burden to manage slave/master relationships. Trying to get ahead of that, we wanted to look at other alternatives.

We evaluated Elastic Search (ES) before choosing SOLR. ES was better in almost every aspect: performance, administration, scalability, etc. BUT, it still felt young. We did this evaluation back in mid-2011, and finding commercial support for ES was difficult compared to SOLR.

We now have Storm solidly in our technology stack. With Storm acting as our intake mechanism, we decided to move away from a trigger-based mechanism, and instead we decided to orchestrate the data flow between Cassandra and ES using Storm.

We simply tacked that bolt onto the end of our Storm topology and with little effort, we have an index of all the data we write to Cassandra.

For the bolt, we implemented the same "mapper" pattern that we put in place when we refactored the Storm Cassandra bolt. To use the bolt, you just need to implement, TupleMapper, which has the following methods:

Similar to the Cassandra Bolt, where you map a tuple into a Cassandra Row, here you simply map the tuple to a document that can be posted to Elastic Search (ES). ES needs four pieces of information to index a document: the documents itself (JSON), the index to which the document should be added, the id of the document and the type.