Getting Started with Kafka Streams – building a streaming analytics Java application against a Kafka Topic

Kafka Streams is a light weight Java library for creating advanced streaming applications on top of Apache Kafka Topics. Kafka Streams provides easy to use constructs that allow quick and almost declarative composition by Java developers of streaming pipelines that do running aggregates, real time filtering, time windows, joining of streams. Results from the streaming analysis can easily be published to Kafka Topic or to external destinations. Despite the close integration with Kafka and the many out of the box library elements, the application created with Kafka Streams is just a Java application, that can be deployed and run wherever Java applications can run (which is of course virtually anywhere).

In this article I will show you my first steps with Kafka Streams. I will create a simple Kafka Streams application that streams messages from a Kafka Topic, processes them after dimensioning them (grouping by a specific key) and then keeps a running count. The running count is produced to a second Kafka Topic (as well as written to the console). Anyone interested in the outcome of this streaming analysis can consume this topic – without any dependency on the Kafka Streams based logic.

This next figure shows the application. Country messages – simple JSON messages that describe a country with properties such as name, continent, population and size – are produced to a Kafka Topic. (this is done from a simple Node.JS application that reads the data from a CSV file and publishes the records to a Kafka Topic- this application is described in this article: NodeJS – Publish messages to Apache Kafka Topic with random delays to generate sample events based on records in CSV file). The Kafka Streams application consists of a single Java Class that creates a stream from the Kafka Topic. Elements in the stream are assigned a key – the continent – and are then counted-by-key. The result (the running count of countries per continent) is routed to an outbound stream that produces messages to a second Kafka Topic.

My starting point in this article is:

a running Kafka Cluster somewhere on a server (or as is actually the case, in a VM running on my laptop)

optional: Run NodeJS application to produce country messages to Kafka Topic countries Alternatively: manually publish country messages, create another application to publish country messages or use Kafka Connect to bring country messages across to the countries Topic

Run the Java application using the Maven generated JAR file and all JARs downloaded by Maven; this will produce messages on the Kafka Topic (which we can inspect, for example using Kafka Tool) and print messages to the console (that are even easier to inspect).

Note: the Serde is an object that carries a serializer and a deserializer for a specific data type, used to serialize and deserialize keys and values into and from messages on a Kafka topic. Whenever our Java client consumes or produces elements, the Serde for those elements has to be provided. In this case, I have crafted the countryMessageSerde for the CountryMessage Java Class that is instantiated from a JSON message that is the value of consumed Kafka messages. This Serde carries a serializer and deserializer based on the JsonPOJODeserializer and JsonPOJOSerializer that are generic JSON to Java mappers, using the Jackson library for doing so.

4. Compile the application and build a JAR file – using mvn package

(note: we will later on do a little tweaking on the exact dependencies and set the correct Java version)

Add the following plugin in the Maven pom file, to ensure that compilation is done for Java version 1.8 (8.0); this is required for Lambda expressions (Java 8) and use of Generics (Java 7)

All JAR files that follow from the dependencies defined in the pom.xml file are downloaded to the directory Kafka-Streams-Country-Counter\target\dependency

6. Produce Country Messages to Kafka Topic

optional: Run NodeJS application to produce country messages to Kafka Topic countries Alternatively: manually publish country messages, create another application to publish country messages or use Kafka Connect to bring country messages across to the countries Topic

7. Run the Java application using the Maven generated JAR file and all JARs downloaded by Maven

(note: on Linux, the semi colon separating the jar files should be a colon)

I ran into several exceptions at this point. I will list them and show the resolutions:

Exception in thread “StreamThread-1” org.apache.kafka.streams.errors.StreamsException: Failed to rebalance at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:299) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:218)

Exception in thread “StreamThread-1” java.lang.ExceptionInInitializerError at org.rocksdb.RocksDB.loadLibrary(RocksDB.java:47) at org.rocksdb.RocksDB.<clinit>(RocksDB.java:23) at org.rocksdb.Options.<clinit>(Options.java:21) at org.apache.kafka.streams.state.internals.RocksDBStore.<init>(RocksDBStore.java:126)

This exception can occur on Windows and is caused by the fact that the version of RocksDB that Kafka Streams 0.10.0.0 has a dependency on does not include the required Windows DLL; RockDB 4.9 does include that DLL.

Note: the topic is in the blue rectangle – countries-streaming-analysis-app-Counts-changelog – is created by the Kafka Streams library as an intermediate change log for the running count. Instead of keeping the temporary results [only]in memory, they are produced to a Kafka Topic as well.

4 Comments

Nice article, thanks. Do you have maybe an idea how to do ntile on kafka streams, not only Top N: e.g Best Third, Second Best Third, Last Third per dimension. I am actually trying to do RFM calculation on the fly and store the calculated scores in KTable as result.

Hi Lucas, This is a very nice article. The example is very good. i like that you use node and Java in the same time. I suggest use localhost instead of ubuntu as host in your code and it will work to everyone without modification. I’ve run it on WIndows 7, the same for kafka server and zookeeper. Regards,