We have noticed that Checkpoint Duration (Async) is taking most of checkpoint time compared to Checkpoint Duration (Sync). I thought that Async checkpoints are only offered by RocksDB backend state. We use filesystem state.

Re: Flink weird checkpointing behaviour

Hi Pawel

First of all, I don't think the akka timeout exception has relationship with checkpoint taking long time. And both RocksDBStateBackend and FsStateBackend could have the async part of checkpoint, which would upload data to DFS in general. That's why async part
would take more time than sync part of checkpoint in most cases.

You could try to notice whether the checkpoint alignment time is much longer than before, back pressure of a job would cause tasks in downstream received checkpoint barrier later and tasks must receive all barriers from its inputs to trigger checkpoint [1].
If the long checkpoint alignment time mainly impact the overall checkpoint duration, you should check the tasks which cause back pressure.

Also, the long time of checkpoint might also be caused by the low write performance of DFS.

Apache Flink offers a fault tolerance mechanism to consistently recover the state of data streaming applications. The mechanism ensures that even in the presence of failures, the program’s state will eventually reflect every record from the data stream exactly
once. Note that there is a switch to ...

We have noticed that Checkpoint Duration (Async) is taking most of checkpoint time compared to Checkpoint Duration (Sync).
I thought that Async checkpoints are only offered by RocksDB backend state. We use filesystem state.

Re: Flink weird checkpointing behaviour

I think it is definitely worth checking the alignment time as Yun
Tang suggested. There were some changes in the network stack that
could influence this behavior between those version.

I've also added Stefan as cc, who might have more ideas what
would be worth checking.

Best,

Dawid

On 31/10/2018 16:51, Yun Tang wrote:

Hi Pawel

First of all, I don't think the akka timeout exception has
relationship with checkpoint taking long time. And both
RocksDBStateBackend and FsStateBackend could have the async part
of checkpoint, which would upload data to DFS in general. That's
why async part would take more time than sync part of checkpoint
in most cases.

You could try to notice whether the checkpoint alignment time is
much longer than before, back pressure of a job would cause
tasks in downstream received checkpoint barrier later and tasks
must receive all barriers from its inputs to trigger checkpoint
[1]. If the long checkpoint alignment time mainly impact the
overall checkpoint duration, you should check the tasks which
cause back pressure.

Also, the long time of checkpoint might also be caused by the
low write performance of DFS.

Apache Flink offers a fault tolerance mechanism to
consistently recover the state of data streaming
applications. The mechanism ensures that even in the
presence of failures, the program’s state will
eventually reflect every record from the data stream
exactly once. Note that there is a switch to ...

We have noticed that Checkpoint Duration (Async)
is taking most of checkpoint time compared to Checkpoint
Duration (Sync).
I thought that Async checkpoints are only offered by
RocksDB backend state. We use filesystem state.