We have 3 c1.xlarge instances with Kafka brokers installed behind a elasticload balancer in AWS.Every minute we loose some events because of the following exception

- Disconnecting from dualstack.kafka-xyz.us-east-1.elb.amazonaws.com:9092- Error in handling batch of 64 eventsjava.io.IOException: Connection timed out at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104) at sun.nio.ch.IOUtil.write(IOUtil.java:75) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) atkafka.network.BoundedByteBufferSend.writeTo(BoundedByteBufferSend.scala:51) at kafka.network.Send$class.writeCompletely(Transmission.scala:76) atkafka.network.BoundedByteBufferSend.writeCompletely(BoundedByteBufferSend.scala:25) at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:88) at kafka.producer.SyncProducer.send(SyncProducer.scala:87) at kafka.producer.SyncProducer.multiSend(SyncProducer.scala:128) atkafka.producer.async.DefaultEventHandler.send(DefaultEventHandler.scala:52) atkafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:46) atkafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:119) atkafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:98) atkafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:74) at scala.collection.immutable.Stream.foreach(Stream.scala:254) atkafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:73) atkafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:43)- Connected to dualstack.kafka-xyz.us-east-1.elb.amazonaws.com:9092 forproducing

Has anybody faced this kind of timeouts before? Do they indicate anyresource misconfiguration? The CPU usage on broker is pretty small.Also, in spite of setting batch size to 100, the failing batch usually onlyhave 50 to 60 events. Is there any other limit I am hitting?

> Hi all,>> We are sending our ad impressions to Kafka 0.7.0. I am using async> prouducers in our web app.> I am pooling kafak producers with commons pool. Pool size - 10. batch.size> is 100.>> We have 3 c1.xlarge instances with Kafka brokers installed behind a elastic> load balancer in AWS.> Every minute we loose some events because of the following exception>> - Disconnecting from dualstack.kafka-xyz.us-east-1.elb.amazonaws.com:9092> - Error in handling batch of 64 events> java.io.IOException: Connection timed out> at sun.nio.ch.FileDispatcher.write0(Native Method)> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)> at sun.nio.ch.IOUtil.write(IOUtil.java:75)> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)> at> kafka.network.BoundedByteBufferSend.writeTo(BoundedByteBufferSend.scala:51)> at kafka.network.Send$class.writeCompletely(Transmission.scala:76)> at>> kafka.network.BoundedByteBufferSend.writeCompletely(BoundedByteBufferSend.scala:25)> at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:88)> at kafka.producer.SyncProducer.send(SyncProducer.scala:87)> at kafka.producer.SyncProducer.multiSend(SyncProducer.scala:128)> at> kafka.producer.async.DefaultEventHandler.send(DefaultEventHandler.scala:52)> at>> kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:46)> at>> kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:119)> at>> kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:98)> at>> kafka.producer.async.ProducerSendThread$$anonfun$processEvents$3.apply(ProducerSendThread.scala:74)> at scala.collection.immutable.Stream.foreach(Stream.scala:254)> at>> kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:73)> at> kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:43)> - Connected to dualstack.kafka-xyz.us-east-1.elb.amazonaws.com:9092 for> producing>> Has anybody faced this kind of timeouts before? Do they indicate any> resource misconfiguration? The CPU usage on broker is pretty small.> Also, in spite of setting batch size to 100, the failing batch usually only> have 50 to 60 events. Is there any other limit I am hitting?>> Any help is appreciated.>>> Regards,> Vaibhav> GumGum>

1) Reduce the producer pool size to 1 or 2 because looks like connectionsare sitting idle. My volume does not desire that big pool.2) Reduce the batch size so that the webapp frequently dumps the data tobrokers. It's better for us anyways.

I wrote a test producer to test if num.retries working or not. But I foundthat it's not working. No matter how many retries I set, whenever amessage send fails, it always never gets to the broker.I am using Kafka 0.7.0

Is this a known problem? Do I need to file a JIRA issue?

Because we are using Async producer we have no way to catch the exceptionourselves and act on it. Is that right? Any ideas how we can ensure thatevery single message is sent with retries?

Do producers currently leave the sockets to the brokers open indefinitely?

It might make sense to add a second producer config param similar to"reconnect.interval" which limits on time instead of message count.(And then reconnect based on whichever criteria is hit first). Forfolks going through ELBs on AWS, they'd set the reconnect.interval.secto something like 50 sec as a workaround for low-volume producers.

> Do producers currently leave the sockets to the brokers open indefinitely?>> It might make sense to add a second producer config param similar to> "reconnect.interval" which limits on time instead of message count.> (And then reconnect based on whichever criteria is hit first). For> folks going through ELBs on AWS, they'd set the reconnect.interval.sec> to something like 50 sec as a workaround for low-volume producers.>> - Niek>>>> On Tue, Jun 26, 2012 at 4:52 PM, Jun Rao <[EMAIL PROTECTED]> wrote:> > Set num.retries in producer config property file. It defaults to 0.> >> > Thanks,> >> > Jun> >>

I don't think the num.retries (0.7.1) is working. Here is how I tested it.

I wrote a simple producer that sends messages with the following strings -"____1_____", "_____2_____"..... . As you can see all the messages aresequential.I tailed the topic log on broker. After sending every message, I have addedThread.sleep for 15 seconds.

Everytime I send the message, it immediately appears in the broker log. Butif I restart the broker to simulate producer connection drop (in the 15seconds producer sleep period), it prints the following message in the logs:

But the message that was sent right after the broker restart never reachesthe broker. The message after that (2nd message after restart) gets tobroker fine and the sequence continues. Thus if I restart the broker in thesleep period between message 4 and 5. I don't get the message 5. I getmessage 1,2,3,4,6,7,.....

I tried setting num.retries to 1 and 2 thinking that in the first retry itmight reconnect and the second retry is where it's resending the message.But that doesn't work. Number of retries doesn't improve the situation.

Can you see any flaw in my testing? What can I do to better test thisscenario? How can I ensure that no messages are dropped? I don't think I amloosing the message because it's in broker memory. Please correct me if Iam wrong.

(This batch size does't work because I think the some flush time is small- 5 seconds - It sends every message as it comes). I am sleeping for 15seconds between each messages.

Here is my broker output:_____0_____#17;#1;{�D�_____1_____#17;#1;�&6c_____2_____#17;#1;6z��_____3_____#17;#1;+�~_____4_____#17;#1;f�tu_____6_____#17;#1;����_____7_____#17;#1;\�#21;_____8_____#17;#1;��Ơ_____9_____Notice number 5 is missing. I restarted broker between 4 and 5. You can seethat the message 5 is missing. On producer for some reason the errorappears between 6 and 7. Don't know why.

Just to clarify: num.retries > 0 does not guarantee that all messages willbe received at the broker. It guarantees retry on exceptions - so it cannothandle the corner case when the broker goes down after the message iswritten to the socket buffer but before the buffer is flushed (in whichcase no exceptions are thrown). This is addressed in 0.8 with producer acks.

That said, you have a fairly large interval between messages so it's rathersurprising. It might help to correlate this with broker-side logs to see ifthe "Message sent" for message 5 was actually received on the broker.

Just to remove all the variables regarding me restarting the broker, I dida test with Amazon ELB. (0.7.1 producer and 0.7.0 broker)Thus, no broker restarts. The connection was getting broken because AmazonELB was closing all the connections.

I found the exact same result. In spite of specifying num.retries andreconnect.time.interval.ms = 50000, we loose one batch. I understand thatnum.retries does not gurantee that all the messages will be sent.But I feel like it should do it in this case though. Please let me know ifmy expectation is unjust.

>From the log, it seems that while message 6 was being sent, it hit anexception (potentially due to broker down) and caused a resend. And itseems that message 6 reached the broker. When message 5 was sent, it didn'thit any exception. So there is no resend. The reason that message 5 didn'treach broker could be that the broker was shut down before producer socketbuffer was flushed.