Apache Flink: past, present and future

Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink Forward 2018.

Apache Flink is an in memory streaming engine. It provides the foundation and the main building blocks to allow stateful computation over data streams between multiple sources and sinks and on various platforms:

Past

It all started in August 2014 when Flink 0.6.0 was released as a Batch Processor on top of a streaming runtime. Before that, it was named StratoSphere and started in 2009 as an academic research project in Berlin’s Technical University.

Version 0.6 exposed the DataStream (streaming) and DataSet (batch) APIs on top of a common pipelined streaming processing engine. It means that operators do not wait for their predecessors to finish in order to start processing data. It results in a very efficient engine capable of handling large data sets and with a better memory handling. Batch is thus treated as a special case of streaming where the streams are bounded (finite).

Flink then acquires over time multiple skills to enhance the streaming experience.

Exactly-once Fault Tolerance

API 0.9 introduced the concept of states and allowed the creation of a new wave of applications able to remember incoming things. Indeed, for stateless jobs it was easy to handle fault: just retry on the failed data. However it gets more complicated for long running distributed stateful jobs, you have to re-initialize the state. Flink learned to remember with the introduction of continuous snapshots to recover and resume state.

However Flink does streaming, and snapshots involve taking a full picture at an instant and persisting it to storage. And snapshots usually take resources and slow down computing. To tackle this challenge, data Artisans implemented asynchronous snapshotting with stream barriers:

Barrier messages are injected into the stream, splitting it into messages batches, and once the barrier reaches an operator, a snapshot is created with the messages batch part of the checkpoint. Flink also has mechanisms to handle merged streams and states snapshotting.

Upon failure, Flink selects the latest completed checkpoint, it redeploys the entire distributed dataflow with the states that were snapshotted as part of this checkpoint and it re-starts the sources from the position of the checkpoint.

This snapshotting process allows high throughput and low latency streaming.

SQL and Machine Learning

API 0.9 also introduced three new modules:

Flink SQL and the Table API adds a higher level of abstraction on structured data sources allowing users to execute SQL-like queries on both batch and streaming distributed datasets.

Gelly Graph processing API contains an utility set for graph analysis that wraps datasets for vertices and edges to apply graph algorithms.

Timekeeping

Event time support came with Flink 0.10. Messages could now be processed in the order they were emitted as long as they had a timestamp. Such datasets include logs, financial transaction and sensor data.

Very often, events are collected from several sources. It is usually not guaranteed that events arrive in the exact order of their timestamps into the stream processor. Consequently, stream processors must take out-of-order elements into account in order to produce results which are correct and consistent with respect to the timestamps of the events.

Flink learned to understand event-time processing as well as ingestion-time and processing-time processing. For example, taking the Star Wars movie series, Flink can process them in different orders:

Processing-time: as they were created: first the originals (4, 5, 6), then the prequels (1, 2, 3), rogue one, and then the new trilogy (7, 8 and 9 to come)

Event-time: defined by their numbering, prequels, rogue one, originals and finally the new trilogy

Present

This all brings us to a processing engine with the following properties:

Continuous & real-time

Stateful & exactly-once

High throughput & low latency

Event time & process functions

and multiple levels of abstraction:

It is now possible to develop data-driven applications:

Modelize and automate all the processes, problems and opportunities

Operate in realtime on multi-channel data sources (and sinks)

Consume all variety of datasets, structured, semi-structured and unstructured

Flink as a library, and not just a framework, makes deploying an app as easy as starting a process with the application code and Flink sources bundled together.

Flink SQL was drastically improved up to supporting unified sources and sink definition in YAML, which allows user to run a SQL job with just YAML configuration and a SQL query through the SQL client CLI, no coding skills required.

Flink is Hadoop-free and supports more and more platforms and technologies including Kubernetes.

Future

Streaming and batch convergence

In Flink’s contexts, the current API expose batch and streaming as different side-by-side abstractions whereas SQL allows to have a unified semantic on stream and batch inputs.

Flink is heading on a convergence for both batch and stream processing, today through SQL apis and tomorrow hopefully with the Java and others APIs.

One very interesting project to notice on this subject is Alibaba’s Blink, a fork of Flink’s internal using Flink SQL to seamlessly process batch and stream.

SQL pattern detection

The SQL API already supports a lot of features. data Artisans is currently working on supporting pattern detection over streams with SQL using the MATCH_RECOGNIZE clause.

Let’s take a practical example. We have an incoming stream of messages on taxi rides with multiple informations, including a unique ride ID and an indicator if the message is the start of a ride. We want to do some pattern matching and retrieve all the rides with mid-stops.

It would be over complicated to implement with existing SQL features, even on a fixed dataset. Here’s how we would select rides with mid-stops with MATCH_RECOGNIZE:

In terms of performances, the given numbers from this early first version were pretty impressive (counting in multiple hundred thousands to millions of events per seconds) and will surely get better.

A short demonstration showed a distributed setup coordinating transactions between Europe, Brazil, the US, Australia and Japan at around 20K/updates per second. Of course, as distributed systems rely heavily on network latency, distance takes a toll on performance.

The only unfortunate point is that right now the ledger is only available as a library on the data Artisans platform, or under Apache 2.0 license in single-node mode for local development and testing.

A world of microservices

If we take a step back, we can look at the history of data processing as following:

First came in offline batch processing, manually launch, usually single node

Then he had Big Data with map/reduce distributed processing

And finally the lambda architecture, trying to solve the latency issue of map/reduce to treat the last events

and multiple steps in services:

Client-server

CORBA & DCOM

EJB & N-Tier architecture (N being 3)

SOA & ESB

HTTP-API & REST

Today, we are coming up to live (near real-time) stream processing and microservices architecture. We are observing a convergence of the two different subjects:

Reactive systems need to be available (responsive) under failure (resilience) and under load (elasticity) by message-driven.

The next step would be to lift the service into a stream stage, input Events into commands and to capture responses and convert them into output events, thus deriving microservices from data streams:

Microservices would be part of a streaming pipeline.

A pipeline could be exposed as microservices.

We could independently upgrade parts of the pipeline.

Flink is already able to handle this, we only need to shift our perspective.

Conclusion

We now have a good overview of Apache Flink. It’s already a fully matured platform and framework for data-driven application and is heading in to become the de facto standard of evented data-driven applications.