Hello, For the last couple days I have been working on tuning our cluster HDFS rebalance. The cluster had a large disparity in data due to not running the rebalance tool and needed to move ~150tb of data to rebalance. I managed to tune the settings to move around 180gb every 5 minutes so it should be able to rebalance in about 3 days. The job failed over the weekend due to the kerberos ticket expiring so after restarting it the transfer rate is now around 60gb every 5mins. I restarted the balancer again and now it's around 25gb every 5mins. At this rate it will take over 2 weeks to balance the cluster. I am running the balancer with debug INFO level and do not see any block failures during any iterations of the balancer. I am monitoring the disk and network io during the balancer runs and we are not anywhere near saturation of either. Has anyone experienced behaviors with the balancer like this? For reference this is the settings I'm using: balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) balancer.Balancer: dfs.balancer.moverThreads = 12000 (default=1000) balancer.Balancer: dfs.balancer.dispatcherThreads = 400 (default=200) balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 32 (default=5) balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 104857600 (default=10485760) balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) balancer.Balancer: dfs.blocksize = 268435456 (default=134217728) hdfs dfsadmin -setBalancerBandwidth 50000000 Here is some of the output for the balancer runs with the same settings. *************************************************************************************************** May 25, 2019 6:00:19 AM 0 186.66 GB 137.33 TB 300 GB May 25, 2019 6:05:10 AM 1 358.00 GB 137.27 TB 300 GB May 25, 2019 6:09:00 AM 2 536.24 GB 137.10 TB 300 GB *************************************************************************************************** May 27, 2019 4:01:46 AM 0 63.69 GB 112.29 TB 300 GB May 27, 2019 4:05:33 AM 1 132.05 GB 112.28 TB 300 GB May 27, 2019 4:09:24 AM 2 213.01 GB 112.25 TB 300 GB *************************************************************************************************** May 28, 2019 4:35:39 PM 0 25.34 GB 95.48 TB 300 GB May 28, 2019 4:39:34 PM 1 54.76 GB 95.46 TB 300 GB May 28, 2019 4:43:23 PM 2 88.69 GB 95.45 TB 300 GB *************************************************************************************************** Thanks, Joey
... View more