Details

Description

There is currently basically no difference between savepoints and checkpoints in Flink and both are created through exactly the same process.

However, savepoints and checkpoints have a slightly different meaning which we should take into account to keep Flink efficient:

Savepoints are (typically infrequently) triggered by the user to create a state from which the application can be restarted, e.g. because Flink, some code, or the parallelism needs to be changed.

Checkpoints are (typically frequently) triggered by the System to allow for fast recovery in case of failure, but keeping the job/system unchanged.

This means that savepoints and checkpoints can have different properties in that:

Savepoint should represent a state of the application, where characteristics of the job (e.g. parallelism) can be adjusted for the next restart. One example for things that savepoints need to be aware of are key-groups. Savepoints can potentially be a little more expensive than checkpoints, because they are usually created a lot less frequently through the user.

Checkpoints are frequently triggered by the system to allow for fast failure recovery. However, failure recovery leaves all characteristics of the job unchanged. This checkpoints do not have to be aware of those, e.g. think again of key groups. Checkpoints should run faster than creating savepoints, in particular it would be nice to have incremental checkpoints.

For a first approach, I would suggest the following steps/changes:

In checkpoint coordination: differentiate between triggering checkpoints
and savepoints. Introduce properties for checkpoints that describe their set of abilities, e.g. "is-key-group-aware", "is-incremental".

In state handle infrastructure: introduce state handles that reflect incremental checkpoints and drop full key-group awareness, i.e. covering folders instead of files and not having keygroup_id -> file/offset mapping, but keygroup_range -> folder?

Backend side: We should start with RocksDB by reintroducing something similar to semi-async snapshots, but using BackupableDBOptions::setShareTableFiles(true) and transferring only new incremental outputs to HDFS. Notice that using RocksDB's internal backup mechanism is giving up on the information about individual key-groups. But as explained above, this should be totally acceptable for checkpoints, while savepoints should use the key-group-aware fully async mode. Of course we also need to implement the ability to restore from both types of snapshots.

One problem in the suggested approach is still that even checkpoints should support scale-down, in case that only a smaller number of instances is left available in a recovery case.