Kafka Replica out-of-sync for over 24 hrs

Is there a way that I can force the replica to catch up the leader? The replica has been out of sync for over 24 hrs. Tried restarting and i dont see any movement. Tried moving replica to a different brokers it does not work reassignment stuck . Created an additional replica and that command also stuck waiting for the out-of-sync to catch up to leader.

Re: Kafka Replica out-of-sync for over 24 hrs

I'm not sure there's a way to force it so sync here. From what you're describing and the error you shared, I think what's happening here is that the replica fetcher thread fails and the broker stops replicating data from the leader. That would explain why you see the broker out of sync for a long time.

Are you using Cloudera's distribution of Kafka or is this Apache Kafka?

Re: Kafka Replica out-of-sync for over 24 hrs

It's a fairly new issue that I personally haven't seen before with any of the current customers running on the Cloudera Distribution of Kafka, but the latest versions released (Cloudera Distribution of Kafka 3.1.1) and Kafa in CDH 6.0 is based on Apache Kafka 1.0.1. The plan for CDH 6.1 is to rebase Cloudera Kafka to Apache Kafka 2.0, so it's probably just a matter of time till this becomes a more common issue.

You mentioned that restarting the Kafka service would then cause the problematic partitions to change. Is that the case when you only shutdown a single broker and start it up again? I'm asking because one potential way to work around this is to identify which broker is lagging behind and not joining the ISR, shutdown the broker, delete the topic partition data (for the affected partitions) from disk and then start up the broker again.

The broker will start and self heal by replicating all the data from the current leader of those partitions. Obviously this can take a long time depending on how many partitions are affected and how much data needs to be replicated.

Re: Kafka Replica out-of-sync for over 24 hrs

Just to be clear, you're only deleting data for the specific partitions that are impacted and not everything under the broker's data directory. I just wasn't sure what you meant by rm -rf here so wanted to clarify.