Re: Best practice for integrating Kafka and HBase

Ok Thanks Thomas for the information, and sorry for the delayed response.

You are putting about 150k messages/second into Kafka, 250byte messages.

You would like to do some light processing/formatting, but not a complex topology.

There are a number of considerations for this architecture. On the HBase side, you'll need anywhere from 10-15 or so Region Servers at minimum. I've definitely seen a single Region Server do 30k writes/second, but with mixed reads as well this number can be lower. It's a safe bet to assume that a typical Region Server on bare-metal can do at least 10k sustained writes per second. Particulalry if you pre-split regions and have a good distribution of rowkeys across regions.

So make sure you have a good rowkey design that can be well-distributed, you can also kind of bucketthe rowkeys based on projected volume and # of region servers.

Regarding delivery semantics. There are all sorts of reasons why you can get duplicates in a Kafka-based pipeline, but using HBase actually mitigates this, assuming you have a unique ID per message, and that ID is part of your rowkey. Otherwise, as you mentioned you'll have to de-dup at some point.

In order to keep up with 150k messages per second, you'll need to do some testing to see how many messages you can read and then write to HBase per partition.

I'm not a huge fan of Storm, for a number of reasons, not the least of which is deploying and managing another distributed system for what amounts to a simple transformation (protobuf deserialization). Flume might be an ok choice, you could certainly code your transformation in the interceptor, but you'll likely need the same number of agents as region servers in order to process the the data effectively. Note the AsyncHBaseSink doesnt's support kerberos, so your mileage may vary. (Note I'm a pretty big fan of the Flume-Kafka integration in general, but it may not be appropriate for all use cases)

Simplified Parallelism: No need to create multiple input Kafka streams and union-ing them. With directStream, Spark Streaming will create as many RDD partitions as there is Kafka partitions to consume, which will all read data from Kafka in parallel. So there is one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

The other option is just to write a java application. A relatively simple standalone java process can do this with no problems, though you'd certainly need multiple threads and multiple processes from multiple machines. Still 150k/second should be achievable on 2-4 nodes if done correctly. You should probably read this if you roll your own.

In any of the cases, make sure you up your handler threads in HBase and tune HBase for writes. Also make sure that you have enough Kafka partitions to properly paralleize your consumption.

Currently, its not clear how the data will be accessed/queried later on by data scientists (we assume they will use tools like R, Spark machine learning or even pose some kind of SQL like queries). Also, we do not need random writes, i.e., only new data will be added, there are no updates. Therefore, I'm not sure whether we need HBase at all.