Contents

Introduction

This is intended to be an example of how an application can process
data from Apache Kafka with Apache Spark on OpenShift.

Graf Zahl will count, as he does, the words on an Apache Kafka topic
and display the top-k in a web page for the user. There isn’t much
more to him, as you might expect.

Architecture

Graf Zahl is composed of a single pod that serves both as a stream
processor as well as web server, using
Flask. A production-ready application would
separate the processor from the web UI by an operational data store,
i.e. a database or in-memory data grid.

For Graf Zahl to have anything to do he needs some data to
consume. To help him out here we also provide the following services.

Apache Kafka is provided by Strimzi.
Basic instructions for setting up Strimzi 0.1 are provided in this
document. To experiment with the latest Strimzi version please
refer to the official documentation.

A source of some data to count. For this we provide a word
fountain, generating words to the topic that Graf Zahl will
consume.

Installation

Installing and deploying Graf Zahl utilizes
Oshinko S2I, specifically the
Oshinko pyspark
builder. S2I
is a technology for taking a source repository that has a specific
layout and building it into a container image that is then deployed
as a pod on OpenShift.

Apache Kafka is more similar to infrastructure for the other
components and not an application itself, so instead of using S2I, it
is directly deployed from a template and pre-built container images.

First, make sure you are connected to an OpenShift cluster and are in
a project with Oshinko installed. See Get Started if
you need help.

Second, load the Apache Kafka infrastructure components into your
project and start them. Since the following command will initialize
both the Kafka and Zookeeper servers, you might want to wait a moment
before proceeding to the next step.

Third, launch the word fountain, so Graf Zahl will have something to
count. The word fountain uses the SERVERS environment variable to
find the Apache Kafka deployment to use. In the second step, when you
created strimzi you created a service with the name kafka on
port 9092. Note: The first time this step and the next run you’ll have
to wait for the builder images to be pulled down from the internet, so
if you’re on a thin pipe you may want to start both at the same time
and grab a drink.