Contents

Streaming query processing with Apache Kafka and Apache Spark (Java)

Introduction

This is intended to be an example of how an application can process
data from Apache Kafka with Apache Spark on OpenShift. This application is
based on the Graf Zahl tutorial with the primary
difference being the choice of language, in this case it is written in Java.

jGraf Zahl will count, as she does, the words on an Apache Kafka topic
and display the top-k in a web page for the user. There isn’t much
more to her, as you might expect.

Architecture

jGraf Zahl is composed of a single pod that serves both as a stream
processor as well as web server, using
Spark Framework(not to be confused with Apache Spark).
A production-ready application would separate the processor from the web UI
by an operational data store, i.e. a database or in-memory data grid.

For jGraf Zahl to have anything to do she needs some data to
consume. To help her out here we also provide the following services.

Apache Kafka is provided by Strimzi.
Basic instructions for setting up Strimzi 0.1 are provided in this
document. To experiment with the latest Strimzi version please
refer to the official documentation.

A source of some data to count. For this we provide a word
fountain, generating words to the topic that jGraf Zahl will
consume.

Installation

Installing and deploying jGraf Zahl utilizes
Oshinko S2I, specifically the
Oshinko java
builder. S2I
is a technology for taking a source repository that has a specific
layout and building it into a container image that is then deployed
as a pod on OpenShift.

Apache Kafka is more similar to infrastructure for the other
components and not an application itself, so instead of using S2I, it
is directly deployed from a template and pre-built container images.

First, make sure you are connected to an OpenShift cluster and are in
a project with Oshinko installed. See Get Started if
you need help.

Second, load the Apache Kafka infrastructure components into your
project and start them. Since the following command will initialize
both the Kafka and Zookeeper servers, you might want to wait a moment
before proceeding to the next step.

Third, launch the word fountain, so jGraf Zahl will have something to
count. The word fountain uses the SERVERS environment variable to
find the Apache Kafka deployment to use. In the second step, when you
created strimzi you created a service named kafka on
port 9092. Note: The first time this step and the next run you’ll have
to wait for the builder images to be pulled down from the internet, so
if you’re on a thin pipe you may want to start both at the same time
and grab a drink.