Production Readiness Checklist

The production readiness checklist provides an overview of configuration options that should be carefully considered before bringing an Apache Flink job into production.
While the Flink community has attempted to provide sensible defaults for each configuration, it is important to review this list and ensure the options chosen are sufficient for your needs.

Set An Explicit Max Parallelism

The max parallelism, set on a per-job and per-operator granularity, determines the maximum parallelism to which a stateful operator can scale.
There is currently no way to change the maximum parallelism of an operator after a job has started without discarding that operators state.
The reason maximum parallelism exists, versus allowing stateful operators to be infinitely scalable, is that it has some impact on your application’s performance and state size.
Flink has to maintain specific metadata for its ability to rescale state which grows linearly with max parallelism.
In general, you should choose max parallelism that is high enough to fit your future needs in scalability, while keeping it low enough to maintain reasonable performance.

You can explicitly set maximum parallelism by using setMaxParallelism(int maxparallelism).
If no max parallelism is set Flink will decide using a function of the operators parallelism when the job is first started:

Set UUIDs For All Operators

As mentioned in the documentation for savepoints, users should set uids for each operator in their DataStream.
Uids are necessary for Flink’s mapping of operator states to operators which, in turn, is essential for savepoints.
By default, operator uids are generated by traversing the JobGraph and hashing specific operator properties.
While this is comfortable from a user perspective, it is also very fragile, as changes to the JobGraph (e.g., exchanging an operator) results in new UUIDs.
To establish a stable mapping, we need stable operator uids provided by the user through setUid(String uid).

Choose The Right State Backend

Currently, Flink’s savepoint binary format is state backend specific.
A savepoint taken with one state backend cannot be restored using another, and you should carefully consider which backend you use before going to production.

In general, we recommend avoiding MemoryStateBackend in production because it stores its snapshots inside the JobManager as opposed to persistent disk.
When deciding between FsStateBackend and RocksDB, it is a choice between performance and scalability.
FsStateBackend is very fast as each state access and update operates on objects on the Java heap; however, state size is limited by available memory within the cluster.
On the other hand, RocksDB can scale based on available disk space and is the only state backend to support incremental snapshots.
However, each state access and update requires (de-)serialization and potentially reading from disk which leads to average performance that is an order of magnitude slower than the memory state backends.
Carefully read through the state backend documentation to fully understand the pros and cons of each option.

Configure JobManager High Availability

The JobManager serves as a central coordinator for each Flink deployment, being responsible for both scheduling and resource management of the cluster.
It is a single point of failure within the cluster, and if it crashes, no new jobs can be submitted, and running applications will fail.

Configuring High Availability, in conjunction with Apache Zookeeper, allows for a swift recovery and is highly recommended for production setups.