Stateful stream processing with Apache Flink

Apache Flink is made for applications that continuously apply the same business logic to a series of inputs. That’s most business applications

Consistency is a fundamental aspect of all systems that store data. In streaming systems, consistency is often classified by the terms “at least once” and “exactly once,” which describe how often input events may be processed in case of a failure. Flink’s fault tolerance mechanism is based on a sophisticated algorithm that copies consistent checkpoints of an application’s state to remote persistent storage. While executing an application, Flink periodically takes checkpoints of the application’s state. In case of a failure, Flink restarts the application and resets its state to the last successful checkpoint that was loaded from the remote storage. Because a checkpoint includes the reading position of all sources (given that the sources can be reset), the state of the application remains consistent as if the failure never happened, i.e., as if each input event was processed exactly once even though some input events might have been processed twice. For many egress connectors and storage systems, Flink is also able to achieve end-to-end exactly-once semantics.

Because application state can grow to several terabytes in size, and because shorter checkpoint intervals reduce the time to recover from failures, it is important that the checkpointing mechanism add as little overhead and latency to processing as possible. Flink reduces checkpointing overhead by using state back ends that perform checkpoints asynchronously and incrementally. When a checkpoint is taken, the state back end creates a local snapshot of state updates since the last checkpoint and immediately continues processing. The local snapshot is then asynchronously copied to the remote storage location.

Data Artisans

Local state management and checkpointing in Apache Flink.

The state handling feature that most distinguishes Flink from other stream processors are savepoints. Savepoints are the foundation for many features that manage the lifecycle of a stateful application. In essence, a savepoint is a consistent snapshot of the state of an application. Whereas checkpoints are triggered by Flink and automatically discarded once they are superseded by a newer checkpoint, savepoints are triggered by the user and remain under the user’s control. Savepoints are used to initialize the state of an application while the application is started. While this sounds much like the purpose of checkpoints, starting an application from a savepoint offers many more degrees of freedom.

For instance, an application can be started from a savepoint that was taken with a previous version of the application. The new version can contain bug fixes or other improvements and will repair or improve all results that have been produced since the savepoint was taken. Savepoints can be used to upgrade an application to a newer Flink version or migrate it to a different cluster. An application can also be rescaled by starting it from a savepoint with more or fewer compute resources. Finally, savepoints can be used to run A/B tests by starting two different versions with the same initial state from the same point in time.

Why use Apache Flink

In this article we discussed stateful stream processing as a generic design pattern that can be applied to many business use cases. We presented common use cases that are solved with stateful streaming applications and introduced Apache Flink. Because Flink is able to maintain very large state with exactly-once consistency guarantees, perform local state accesses with low latency, and manage the lifecycle of applications via savepoints, it is an excellent choice to power stateful streaming applications.

In the second article of our series, we will show in detail how to implement stateful stream processing applications with Apache Flink’s APIs. We will present a realistic stream processing scenario and walk through the source code of some interesting Flink applications.

Fabian Hueske and Stephan Ewen are committers and Project Management Committee members of the Apache Flink project. They are two of the three original creators of Apache Flink and co-founders of Data Artisans, a Berlin-based startup devoted to fostering Flink. Fabian is currently writing a book, Stream Processing with Apache Flink, to be published by O’Reilly later this year. Follow them on Twitter at @fhueske and @StephanEwen.

—

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.