What I've observed is that only one broker is able to finish the above successfully the first time around. For the rest of the iterations, no broker is able to shutdown using the admin command and every single time it fails with the error message stating the same number of leaders on every broker.

I set up a local cluster of three brokers and created a bunch of topics, replication factor = 2. I was able to do multiple iterations of rolling bounces without
issue. Since this was local, I did not use your py script as it kills pid's returned by ps.

Would you by any chance be able to provide a scenario to reproduce this locally? That said, I believe John Fung also tried to reproduce this in a
distributed environment but was unable to do so; so I'll probably need to take a look at logs in your environment.

Joel Koshy
added a comment - 16/Jan/13 19:15 I set up a local cluster of three brokers and created a bunch of topics, replication factor = 2. I was able to do multiple iterations of rolling bounces without
issue. Since this was local, I did not use your py script as it kills pid's returned by ps.
Would you by any chance be able to provide a scenario to reproduce this locally? That said, I believe John Fung also tried to reproduce this in a
distributed environment but was unable to do so; so I'll probably need to take a look at logs in your environment.

Neha Narkhede
added a comment - 16/Jan/13 19:17 >> Would you by any chance be able to provide a scenario to reproduce this locally?
I would suggest you try out on a distributed environment that is setup on a large amount of partitions and traffic. Since it is internal, I can pass on the connection url to you.

Joel Koshy
added a comment - 18/Jan/13 19:00 I think this is why it happens:
https://github.com/apache/kafka/blob/03eb903ce223ab55c5acbcf4243ce805aaaf4fad/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L150
It could occur as follows. Suppose there's a partition 'P' assigned to brokers x and y; leaderAndIsr = y,
{x, y}
1. Controlled shutdown of broker x; leaderAndIsr -> y,
{y}
2. After above completes, kill -15 and then restart broker x
3. Immediately do a controlled shutdown of broker y; so now y is in the list of shutting down brokers.
Due to the above, x will not start its follower to 'P' on broker y.
Adding sufficient wait time between (2) and (3) seems to address the issue (in your script there's no sleep), but we should handle it properly in the shutdown code.
Will think about a fix for that.

Joel Koshy
added a comment - 18/Jan/13 21:37 Here's a simple fix.
I don't really see any good reason why we shouldn't allow starting
a fetcher to a broker that is shutting down but not completely
shut down yet if a leader still exists on that broker.

+1 on the fix. And there is a problem with the script I wrote. This fix is correct, but the script will fail because it uses the shutdown command in a way that is not recommended or intended. It shuts down one broker, restarts it, doesn't wait until the restart is completed and the first broker re-registers itself in zookeeper and proceeds to shutting down the next broker. Since the replication factor is 2, if both these brokers were the replicas for some partitions, they go into the under replicated state and the script is never able to shut any other broker down after that.

Neha Narkhede
added a comment - 20/Jan/13 18:45 +1 on the fix. And there is a problem with the script I wrote. This fix is correct, but the script will fail because it uses the shutdown command in a way that is not recommended or intended. It shuts down one broker, restarts it, doesn't wait until the restart is completed and the first broker re-registers itself in zookeeper and proceeds to shutting down the next broker. Since the replication factor is 2, if both these brokers were the replicas for some partitions, they go into the under replicated state and the script is never able to shut any other broker down after that.
I think we should include this fix.

I committed the fix to 0.8 with a small edit: used the liveOrShuttingDownBrokers field.

Another small issue is that we send a stop replica fetchers to the shutting down broker even if
controlled shutdown did not complete. This "prematurely" forces the broker out of the ISR of those
partitions. I think it should be safe to avoid sending the stop replica request if controlled shutdown
has not completely moved leadership of partitions off the shutting down broker.

Joel Koshy
added a comment - 21/Jan/13 21:37 I committed the fix to 0.8 with a small edit: used the liveOrShuttingDownBrokers field.
Another small issue is that we send a stop replica fetchers to the shutting down broker even if
controlled shutdown did not complete. This "prematurely" forces the broker out of the ISR of those
partitions. I think it should be safe to avoid sending the stop replica request if controlled shutdown
has not completely moved leadership of partitions off the shutting down broker.