Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources.

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Hi Cliff,

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources.

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources.

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Hi Cliff,

the TaskManger fail to start with exit code 31 which indicates an initialization error on startup. If you check the TaskManager logs via `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs don't start up.

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources.

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Hi Till,

Yes, it turns out the problem was having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess Queriable State bootstraps itself and, in my situation, it brought the task manager down when it found no available ports. What's a little troubling is that I had not configured Queriable State at all, so I would not expect it to get in the way. I haven't looked further into it but I think that if Queriable State wants to enable itself then it should at worst take an unused port by default, especially since many folks will be running in shared environments like YARN.

the TaskManger fail to start with exit code 31 which indicates an initialization error on startup. If you check the TaskManager logs via `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs don't start up.

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources.

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?

Re: Task Manager allocation issue when upgrading 1.6.0 to 1.6.2

Good to hear Cliff.

You're right that it's not a nice user experience. The problem with queryable state is that one would need to take a look at the actual user job to decide whether the user uses queryable state or not. But then it's already too late for starting the respective infrastructure needed for querying the state. You're right, though, that we should at least take a random port per default. I've created a corresponding issue for this: https://issues.apache.org/jira/browse/FLINK-10866.

Yes, it turns out the problem was having flink-queryable-state-runtime_2.11-1.6.2.jar in flink/lib. I guess Queriable State bootstraps itself and, in my situation, it brought the task manager down when it found no available ports. What's a little troubling is that I had not configured Queriable State at all, so I would not expect it to get in the way. I haven't looked further into it but I think that if Queriable State wants to enable itself then it should at worst take an unused port by default, especially since many folks will be running in shared environments like YARN.

the TaskManger fail to start with exit code 31 which indicates an initialization error on startup. If you check the TaskManager logs via `yarn logs -applicationId <APP_ID>` you should see the problem why the TMs don't start up.

this sounds not right. Could you share the logs of the Yarn cluster entrypoint with the community for further debugging? Ideally on DEBUG level. The Yarn logs would also be helpful to fully understand the problem. Thanks a lot!

I'm running a YARN cluster of 8 * 4 core instances = 32 cores, with a configuration of 3 slots per TM. The cluster is dedicated to a single job that runs at full capacity in "FLIP6" mode. So in this cluster, the parallelism is 21 (7 TMs * 3, one container dedicated for Job Manager).

When I run the job in 1.6.0, seven Task Managers are spun up as expected. But if I run with 1.6.2 only four Task Managers spin up and the job hangs waiting for more resources.

Our Flink distribution is set up by script after building from source. So aside from flink jars, both 1.6.0 and 1.6.2 directories are identical. The job is the same, restarting from savepoint. The problem is repeatable.

Has something changed in 1.6.2, and if so can it be remedied with a config change?