The caller of Producer.send() should also have the ability to know whether a send failure is recoverable (that might succeed with more retries). It may be hard for the client developer to guess the right number of message.send.max.retries, otherwise (since a transient error, like a restarting broker, could take an unknown amount of time). If I want to implement guaranteed semantics, then the client needs to be able to have information on whether to continue retrying a message, or else give up.

This could be done by having Producer.send() throw different exception types (e.g. different versions of FailedToSendMessageException), e.g. UnrecoverableFailedToSendException or RetriesExhaustedFailedToSendException (perhaps shorter names for these exceptions). These could both be sub-classes of FailedToSendException.

Another approach might be to have the FailedToSendException return information, such as how many retries were attempted, whether or not the message might be recoverable with more retries, and it should wrap the root cause, so debugging is possible.

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

Also, a producer can throw QueueFullException. From the client's point of view, it would make sense that this should also be a retryable situation (depending on load). Thus, QueueFullException might make sense to be a sub-class of FailedToSendException (and more likely a sub-class of RetriesExhaustedFailedToSendException (or whatever name that might better be renamed to).

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

Ok, I filed KAFKA-1025 to track the issue for reasoning about whether a failed send should be recoverable.

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

Apologies for the late review. Couple of comments:* I think this could reset needRetry back to false if subsequent partitions in the iteration do need a retry: needRetry = needRetry && !fatalException(topicPartitionAndError._2). The logic is actually a bit confusing. Instead, it might be clearer to just do: failedTopicPartitions.exists(<some entry for which we need to retry>)* Can you enhance the logging a bit to indicate that there were fatal sends that will not be retried? e.g., "Dropping messages to topic x due to message size limit.." or something like that.* Can you rebase?

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

Thanks for the patch Joel. Do you mean rebase on 0.8 (it was originally on trunk)?

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

Oh I thought this was for 0.8 - it does apply on trunk. Do people think this is small and important enough to apply to 0.8? Another comment after thinking about the patch: in dispatchSerializedData - would it be better to just drop data that have hit the message size limit? That way, there is no need to return the needRetry, so the dispatchSerializedData signature remains the same. The disadvantage is that we won't propagage a failedtosendmessage exception for such messages to the caller - for the producer in async mode that is probably fine (since right now the caller can't really do much with that exception) - in sync mode the caller could perhaps decide to send fewer messages at once. Even in that case we don't really say which topics/messages hit the message size limit so I think it is fine in that case as well. Furthermore, this would be covered by KAFKA-1026 to a large degree.

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

>> Do people think this is small and important enough to apply to 0.8?

+1.

Guozhang, do you mind submitting a reviewboard ?

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

10. ErrorMapping.fatalException(): Should we rename it to unrecoverableException? MessageSizeTooLarge doesn't seems like a fatal exception.

I am not sure if it's worth patching this in 0.8. The workaround is to reduce the batch size, as well reducing retry times and retry intervals.

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

Yeah, sorry, I forgot I had filed KAFKA-1025 to address my concerns about exposing recoverability.

Thanks,

Jason

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

[~junrao]Would it be an easier short term fix, to at least include the root cause set on the FailedToSendMessageException. So, we could see the MessageSizeTooLargeException as the cause of the FTSME? Or is that not easy to do?

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

My feeling is that it may not be very easy to do a quick fix. Currently, the cause exceptions are eaten at several places just so that we can pass back unsuccessfully sent messages.

> Producer should not retry on non-recoverable error codes> -------------------------------------------------------->> Key: KAFKA-998> URL: https://issues.apache.org/jira/browse/KAFKA-998> Project: Kafka> Issue Type: Bug> Affects Versions: 0.8, 0.8.1> Reporter: Joel Koshy> Assignee: Guozhang Wang> Attachments: KAFKA-998.v1.patch>>> Based on a discussion with Guozhang. The producer currently retries on all error codes (including messagesizetoolarge which is pointless to retry on). This can slow down the producer unnecessarily.> If at all we want to retry on that error code we would need to retry with a smaller batch size, but that's a separate discussion.

--This message is automatically generated by JIRA.If you think it was sent incorrectly, please contact your JIRA administratorsFor more information on JIRA, see: http://www.atlassian.com/software/jira

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext