1.6 UI issues

1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either
it takes too long to load the information of the job or it never loads at all.

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job,
either it takes too long to load the information of the job or it never loads at all.

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job,
either it takes too long to load the information of the job or it never loads at all.

From our experience,
you could check the jobmanager.log first to see whether
existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytesIf you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.

We are migrating the the last 1.6 version
and all the jobs seem to work fine, but when we check
individual jobs through the web interface we encounter
the issue that after clicking on a job, either it takes
too long to load the information of the job or it never
loads at all.

From our experience,
you could check the jobmanager.log first to see whether
existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytesIf you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.

We are migrating the the last 1.6 version
and all the jobs seem to work fine, but when we check
individual jobs through the web interface we encounter
the issue that after clicking on a job, either it takes
too long to load the information of the job or it never
loads at all.

Re: 1.6 UI issues

Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call:
restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased
web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak
web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed.
We will have a look closer on this behavior.

thanks for the log. The log file does not contain anything suspicious.
Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms].
Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after
clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Re: 1.6 UI issues

Hi Oleksandr,

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call:
restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased
web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak
web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed.
We will have a look closer on this behavior.

thanks for the log. The log file does not contain anything suspicious.
Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms].
Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after
clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Re: 1.6 UI issues

First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.

Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.

So I think we have two related problems in Flink, which can be reproduced in our set up:

UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).

In case one of refresh calls take more than
web.refresh-interval next request is made.

After a while first requests started to complete, but UI is not rendered correctly in this case

Only name tabs are shown and no graph, not metrics were requested and rendered.
What do you think if I create a Jira bug for this issue?

Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.

In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is
that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.

If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:

Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the
Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it
and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent
through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the
jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call:
restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased
web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in
SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and
don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the
attached log, the job seems to run without problems.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms].
Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information
of the job or it never loads at all.

Re: 1.6 UI issues

Hi Oleksandr,

thanks a lot for the kind wishes and for the detailed investigation.

1. I think if the cluster cannot serve the information within the web.refresh-interval, it would be best to increase it. I quickly looked into the `ExecutionGraphCache` which is used for storing the `ArchivedExecutionGraph` and it looks one could change the logic a bit. What we do at the moment is to invalidate the ExecutionGraph cache entries after the web.refresh-interval and request an update from the cluster. This has the benefit (given that the response is fast) that we see faster the updated state. Instead one could also invalidate the old ExecutionGraph cache entry only after the response for the new request has arrived. This would prevent your situation because you would keep the old state as long as the request is in flight. The downside of this approach would be that you might wait another UI refresh interval until you see the results if the response is very fast. For that you could open a JIRA issue to further discuss it.

2. The high load caused by the MetricStore is indeed a problem. For that we should also open a JIRA issue to investigate what we could improve here. One thing we should definitely do is to make the fetching interval configurable so that one doesn't have to recompile Flink in order to change it. I actually quickly added it [1,2].

First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.

Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.

So I think we have two related problems in Flink, which can be reproduced in our set up:

UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).

In case one of refresh calls take more than
web.refresh-interval next request is made.

After a while first requests started to complete, but UI is not rendered correctly in this case

Only name tabs are shown and no graph, not metrics were requested and rendered.
What do you think if I create a Jira bug for this issue?

Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.

In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is
that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.

If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:

Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the
Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it
and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent
through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the
jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call:
restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased
web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in
SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and
don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the
attached log, the job seems to run without problems.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms].
Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information
of the job or it never loads at all.

As a workaround for GC pressure can be the usage of more predictable GC that G1 with ergonomics. We have switched to Parallel GC for JM and hope it will be good enough for all our use-cases. While on the TM side we still
prefer to use G1 due to the latency promises it has.

1. I think if the cluster cannot serve the information within the web.refresh-interval, it would be best to increase it. I quickly looked into the `ExecutionGraphCache` which is used for storing the `ArchivedExecutionGraph` and it looks one could change
the logic a bit. What we do at the moment is to invalidate the ExecutionGraph cache entries after the web.refresh-interval and request an update from the cluster. This has the benefit (given that the response is fast) that we see faster the updated state.
Instead one could also invalidate the old ExecutionGraph cache entry only after the response for the new request has arrived. This would prevent your situation because you would keep the old state as long as the request is in flight. The downside of this approach
would be that you might wait another UI refresh interval until you see the results if the response is very fast. For that you could open a JIRA issue to further discuss it.

2. The high load caused by the MetricStore is indeed a problem. For that we should also open a JIRA issue to investigate what we could improve here. One thing we should definitely do is to make the fetching interval configurable so that one doesn't have
to recompile Flink in order to change it. I actually quickly added it [1,2].

First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.

Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.

So I think we have two related problems in Flink, which can be reproduced in our set up:

UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).

In case one of refresh calls take more than web.refresh-interval next request is made.

After a while first requests started to complete, but UI is not rendered correctly in this case

Only name tabs are shown and no graph, not metrics were requested and rendered.
What do you think if I create a Jira bug for this issue?

Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.

In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is
that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.

If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:

Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the
Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating
it and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be
sent through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call:
restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased
web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak
web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed.
We will have a look closer on this behavior.

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information
of the job or it never loads at all.