DataStream API Tutorial

In this guide we will start from scratch and go from setting up a Flink project to running
a streaming analysis program on a Flink cluster.

Wikipedia provides an IRC channel where all edits to the wiki are logged. We are going to
read this channel in Flink and count the number of bytes that each user edits within
a given window of time. This is easy enough to implement in a few minutes using Flink, but it will
give you a good foundation from which to start building more complex analysis programs on your own.

Setting up a Maven Project

We are going to use a Flink Maven Archetype for creating our project structure. Please
see Java API Quickstart for more details
about this. For our purposes, the command to run is this:

Note: For Maven 3.0 or higher, it is no longer possible to specify the repository (-DarchetypeCatalog) via the commandline. If you wish to use the snapshot repository, you need to add a repository entry to your settings.xml. For details about this change, please refer to Maven official document

You can edit the groupId, artifactId and package if you like. With the above parameters,
Maven will create a project structure that looks like this:

There is our pom.xml file that already has the Flink dependencies added in the root directory and
several example Flink programs in src/main/java. We can delete the example programs, since
we are going to start from scratch:

$ rm wiki-edits/src/main/java/wikiedits/*.java

As a last step we need to add the Flink Wikipedia connector as a dependency so that we can
use it in our program. Edit the dependencies section of the pom.xml so that it looks like this:

The program is very basic now, but we will fill it in as we go. Note that I’ll not give
import statements here since IDEs can add them automatically. At the end of this section I’ll show
the complete code with import statements if you simply want to skip ahead and enter that in your
editor.

The first step in a Flink program is to create a StreamExecutionEnvironment
(or ExecutionEnvironment if you are writing a batch job). This can be used to set execution
parameters and create sources for reading from external systems. So let’s go ahead and add
this to the main method:

This creates a DataStream of WikipediaEditEvent elements that we can further process. For
the purposes of this example we are interested in determining the number of added or removed
bytes that each user causes in a certain time window, let’s say five seconds. For this we first
have to specify that we want to key the stream on the user name, that is to say that operations
on this stream should take the user name into account. In our case the summation of edited bytes in the windows
should be per unique user. For keying a Stream we have to provide a KeySelector, like this:

This gives us a Stream of WikipediaEditEvent that has a String key, the user name.
We can now specify that we want to have windows imposed on this stream and compute a
result based on elements in these windows. A window specifies a slice of a Stream
on which to perform a computation. Windows are required when computing aggregations
on an infinite stream of elements. In our example we will say
that we want to aggregate the sum of edited bytes for every five seconds:

The first call, .timeWindow(), specifies that we want to have tumbling (non-overlapping) windows
of five seconds. The second call specifies a Fold transformation on each window slice for
each unique key. In our case we start from an initial value of ("", 0L) and add to it the byte
difference of every edit in that time window for a user. The resulting Stream now contains
a Tuple2<String, Long> for every user which gets emitted every five seconds.

The only thing left to do is print the stream to the console and start execution:

result.print();see.execute();

That last call is necessary to start the actual Flink job. All operations, such as creating
sources, transformations and sinks only build up a graph of internal operations. Only when
execute() is called is this graph of operations thrown on a cluster or executed on your local
machine.

The number in front of each line tells you on which parallel instance of the print sink the output
was produced.

This should get you started with writing your own Flink programs. To learn more
you can check out our guides
about basic concepts and the
DataStream API. Stick
around for the bonus exercise if you want to learn about setting up a Flink cluster on
your own machine and writing results to Kafka.

Note how we first transform the Stream of Tuple2<String, Long> to a Stream of String using
a MapFunction. We are doing this because it is easier to write plain strings to Kafka. Then,
we create a Kafka sink. You might have to adapt the hostname and port to your setup. "wiki-result"
is the name of the Kafka stream that we are going to create next, before running our program.
Build the project using Maven because we need the jar file for running on the cluster:

$ mvn clean package

The resulting jar file will be in the target subfolder: target/wiki-edits-0.1.jar. We’ll use
this later.

Now we are ready to launch a Flink cluster and run the program that writes to Kafka on it. Go
to the location where you installed Flink and start a local cluster:

$ cd my/flink/directory
$ bin/start-cluster.sh

We also have to create the Kafka Topic, so that our program can write to it:

You can see how the individual operators start running. There are only two, because
the operations after the window get folded into one operation for performance reasons. In Flink
we call this chaining.

You can observe the output of the program by inspecting the Kafka topic using the Kafka
console consumer: