Flink will request too more containers than it actually needs

Details

Description

As known, flink will request new containers when it was notified that some allocated container is completed. Let me say, maybe one container failed, and Flink tries to request one container from NM, but actually Flink will request n+1 containers, the n refers to the number that ever requested after cluster is created.It is not graceful.

When requesting a container, Flink will send a ContainerRequest to RM through AMRM Client, and AMRMClient will save the ContainerRequest in itself, and hopes the ContainerRequest will be removed in future, but Flink never removes the ContainerRequest, so one by one, the number of ContainerRequest accumulates to a unexpected value.

In our environment, a cluster initially allocated 100 containers, and later on，it requests one container from RM, RM returns more than 2000 containers to it as the request actually has more than 2000 ContainerRequest. Although Flink will return the excess containers, this request behavior waste time and resource on yarn.

So, maybe Flink can remove the ContainerRequest after the request has been sent to RM, then Flink will get exactly numbers of containers as it explicitly did.