One question to consider is how we want to handle retry backoff. Currently this code retries immediately. There are some exceptions like TimeoutException and NotLeaderForPartition that should actually retry immediately and others like LeaderNotAvailableException where you might want to wait a bit.

If you think about it I'm not really sure if the user actually cares about the number of retries. Currently they care whether retries are possible because retries implies possible duplicates/loss. If we fixed this then I suspect you would always want to retry within the timeout you had specified. To avoid banging the server it might be good to backoff, but this is ultimately not really the concern of the client.

The current implementation is simple, though. We might want to just take that as is.

The ideal implementation would be (1) make retries idempotent, (2) always retry, (3) have an "exponential" backoff of something like 0, 1, 2, 4, 8, 16, etc to avoid killing the server, (4) bound this by the produce time limit the client specifies.

Jay Kreps
added a comment - 17/Feb/14 23:47 One question to consider is how we want to handle retry backoff. Currently this code retries immediately. There are some exceptions like TimeoutException and NotLeaderForPartition that should actually retry immediately and others like LeaderNotAvailableException where you might want to wait a bit.
If you think about it I'm not really sure if the user actually cares about the number of retries. Currently they care whether retries are possible because retries implies possible duplicates/loss. If we fixed this then I suspect you would always want to retry within the timeout you had specified. To avoid banging the server it might be good to backoff, but this is ultimately not really the concern of the client.
The current implementation is simple, though. We might want to just take that as is.
The ideal implementation would be (1) make retries idempotent, (2) always retry, (3) have an "exponential" backoff of something like 0, 1, 2, 4, 8, 16, etc to avoid killing the server, (4) bound this by the produce time limit the client specifies.

Cool, I posted an update that addresses your comments and will checkin with that. If you do have a chance I would appreciate a post-review on the metadata fetch node selection logic as that is a bit complex.

Jay Kreps
added a comment - 19/Feb/14 01:17 Cool, I posted an update that addresses your comments and will checkin with that. If you do have a chance I would appreciate a post-review on the metadata fetch node selection logic as that is a bit complex.

/Users/jrao/Intellij/kafka_gradle/core/src/main/scala/kafka/tools/newproducer/MirrorMaker.scala:87: value BLOCK_ON_BUFFER_FULL is not a member of object org.apache.kafka.clients.producer.ProducerConfig
producerProps.setProperty(ProducerConfig.BLOCK_ON_BUFFER_FULL, "true")
^
four warnings found
one error found
:core:compileScala FAILED

Jun Rao
added a comment - 19/Feb/14 02:42 There is a compilation error after the checkin.
/Users/jrao/Intellij/kafka_gradle/core/src/main/scala/kafka/tools/newproducer/MirrorMaker.scala:87: value BLOCK_ON_BUFFER_FULL is not a member of object org.apache.kafka.clients.producer.ProducerConfig
producerProps.setProperty(ProducerConfig.BLOCK_ON_BUFFER_FULL, "true")
^
four warnings found
one error found
:core:compileScala FAILED