Applied Research. Big Data. Distributed Systems. Open Source.

Spark Streaming has been getting some attention lately as a real-time data
processing tool, often mentioned alongside Apache Storm. If you ask me, no real-time data
processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to
kafka-storm-starter that demonstrates how to read from Kafka and write
to Kafka, using Avro as the data format and
Twitter Bijection for handling the data serialization.

In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state
of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with
Spark Streaming.

The only thing that’s even better than Apache Kafka and
Apache Storm is to use the two tools in combination. Unfortunately, their
integration can and is still a pretty challenging task, at least judged by the many discussion threads on the respective
mailing lists. In this post I am introducing kafka-storm-starter,
which contains many code examples that show you how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using
Apache Avro as the data serialization format. I will also briefly summarize the current
state of their integration on a high level to give you additional context of where the two projects are headed in this
regard.

I am happy to announce the first public release of Wirbelsturm, a Vagrant and
Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure.
Wirbelsturm’s goal is to make tasks such as “I want to deploy a multi-node Storm cluster” simple, easy, and fun.
In this post I will introduce you to Wirbelsturm, talk a bit about its history, and show you how to launch a multi-node
Storm (or Kafka or …) cluster faster than you can brew an espresso.