I wish to use either the ConsumerConnector or the SimpleConsumer toread messages from all partitions across multiple brokers using afixed number of threads (hopefully leveraging asynchronous IO for highperformance).

I know that the ConsumerConnector sounds like this, but thedocumentation was not clear about a few things. Would somebody who hasused it (or knows the code) be willing to answer these questions aboutit:

- Does the ConsumerConnector manage connections to multiple brokers,or just a single broker?- Does the ConsumerConnector require a thread for each partition oneach broker? (If not, how many threads does it require?)- Does the ConsumerConnector use actual asynchronous IO, or does itmimic it by using a dedicated behind-the-scenes thread (and thetraditional java socket API)?

If the ConsumerConnector won't serve my purpose I think I could usethe SimpleConsumer to implement a version of this, but I worry thatit's not thread safe; That is, can I use the same SimpleConsumerinstance on Thread-A, then on Thread-B, and then again on Thread-A(though never at the same time)?Thanks in advance!

In general, you should use the consumer connector - unless you have a goodreason to load balance and manage offsets manually (which is taken care ofin the consumer connector).- Does the ConsumerConnector manage connections to multiple brokers,> or just a single broker?>

Multiple brokers.> - Does the ConsumerConnector require a thread for each partition on> each broker? (If not, how many threads does it require?)>

You can specify how many streams you want - if there are more partitionsthan threads, then a given thread can consume from multiple partitions. Ifthere are more threads than available partitions, there will be idlethreads.> - Does the ConsumerConnector use actual asynchronous IO, or does it> mimic it by using a dedicated behind-the-scenes thread (and the> traditional java socket API)?>

The consumer connector uses SimpleConsumers for each broker that itconnects to. These consumers fetch from each broker and insert chunks intoblocking queues which the consumer iterators then dequeue.

“unless you have a good reason to load balance and manage offsets manually”

In general one consumer connector consumes more than one partition.In client side, we want to get all partitions offset for any message, ifcrash happens(some message is fetched from kafka but the result is notflushed to disk)happens we can use offset info to rewind kafka consumer.

Do you think this is a good reason to use SimpleConsumer rather thanConsumerConnector？

I think this is a common request, so is there any existed solution?

Thanks,Yonghui

On 12-12-20 上午3:16, "Joel Koshy" <[EMAIL PROTECTED]> wrote:

>In general, you should use the consumer connector - unless you have a good>reason to load balance and manage offsets manually (which is taken care of>in the consumer connector).>>>- Does the ConsumerConnector manage connections to multiple brokers,>> or just a single broker?>>>>Multiple brokers.>>>> - Does the ConsumerConnector require a thread for each partition on>> each broker? (If not, how many threads does it require?)>>>>You can specify how many streams you want - if there are more partitions>than threads, then a given thread can consume from multiple partitions. If>there are more threads than available partitions, there will be idle>threads.>>>> - Does the ConsumerConnector use actual asynchronous IO, or does it>> mimic it by using a dedicated behind-the-scenes thread (and the>> traditional java socket API)?>>>>The consumer connector uses SimpleConsumers for each broker that it>connects to. These consumers fetch from each broker and insert chunks into>blocking queues which the consumer iterators then dequeue.>>Joel

“unless you have a good reason to load balance and manage offsets manually”>> In general one consumer connector consumes more than one partition.> In client side, we want to get all partitions offset for any message, if> crash happens(some message is fetched from kafka but the result is not> flushed to disk)> happens we can use offset info to rewind kafka consumer.>> Do you think this is a good reason to use SimpleConsumer rather than> ConsumerConnector？An alternative to using simpleconsumer in this use case is to use thezookeeper consumer connector and turn off auto commit. After your consumerprocess is done processing a batch of messages you can all commitOffsets -the main caveat to be aware of is that if your consumer processes batchesvery fast you would write to zookeeper that often - so in fact setting anautocommit interval and being willing to deal with duplicates is almostequivalent. KAFKA-657 would help I think - since once that API is availableyou can store your offsets anywhere you like.

Joel>> On 12-12-20 上午3:16, "Joel Koshy" <[EMAIL PROTECTED]> wrote:>> >In general, you should use the consumer connector - unless you have a good> >reason to load balance and manage offsets manually (which is taken care of> >in the consumer connector).> >> >> >- Does the ConsumerConnector manage connections to multiple brokers,> >> or just a single broker?> >>> >> >Multiple brokers.> >> >> >> - Does the ConsumerConnector require a thread for each partition on> >> each broker? (If not, how many threads does it require?)> >>> >> >You can specify how many streams you want - if there are more partitions> >than threads, then a given thread can consume from multiple partitions. If> >there are more threads than available partitions, there will be idle> >threads.> >> >> >> - Does the ConsumerConnector use actual asynchronous IO, or does it> >> mimic it by using a dedicated behind-the-scenes thread (and the> >> traditional java socket API)?> >>> >> >The consumer connector uses SimpleConsumers for each broker that it> >connects to. These consumers fetch from each broker and insert chunks into> >blocking queues which the consumer iterators then dequeue.> >> >Joel>>>

> An alternative to using simpleconsumer in this use case is to use the> zookeeper consumer connector and turn off auto commit.>

Keep in mind that this works only if you don't care about controlling perpartition rewind capability.The high level consumer will not give you control over which partitionsyour consumer consumes andwhich partitions it commits the offsets for. If you need to rewindconsumption for a subset of those partitions,then ZookeeperConsumerConnector will not work for you.

In order to support rollbacks and checkpoints, there would have to bea way to both supply partition offsets to the consumer before reading,as well as retrieve partition offsets from them consumer once readingis complete.

From what I've read here, it appears that neither theConsumerConnector nor the ZookeeperConsumerConnector have either ofthose capabilities. In order to finely manage offsets, only theSimpleConsumer will work. Is that the correct interpretation?

--Tom

On Thu, Dec 20, 2012 at 11:13 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote:>> An alternative to using simpleconsumer in this use case is to use the>> zookeeper consumer connector and turn off auto commit.>>>> Keep in mind that this works only if you don't care about controlling per> partition rewind capability.> The high level consumer will not give you control over which partitions> your consumer consumes and> which partitions it commits the offsets for. If you need to rewind> consumption for a subset of those partitions,> then ZookeeperConsumerConnector will not work for you.>> Thanks,> Neha

1. Offset stored in zk is only used when the consumer is connected again.2. Joel's suggestion "in fact setting an autocommit interval and beingwilling to deal with duplicates is almost equivalent. " makes sense. But if crash happens just after offset committed, then unprocessed messagein consumer will be skipped after reconnected.

Please correct me if I am wrong.In ConsumerConnector, if ConsumerIterator can return partition offset withmessage together, then we save offset in client side and commit offset onlyafter all the message before this offset is done(turn off autoCommit).I roughly go through the code, if use this option I need change some code.

Another option is use simpleConnector as we discussed before, but thisoption required more code work in client side, since one consumer may hasmore than 1 simpleConnector.We need manage these connector with Zk and merge result for each connector.

I tend to option 1.

Thanks,Yonghui From: Neha Narkhede <[EMAIL PROTECTED]>Date: 2012年12月21日星期五 上午2:13To: <[EMAIL PROTECTED]>Cc: 永辉 赵 <[EMAIL PROTECTED]>Subject: Re: Proper use of ConsumerConnector> An alternative to using simpleconsumer in this use case is to use the> zookeeper consumer connector and turn off auto commit.

Keep in mind that this works only if you don't care about controlling perpartition rewind capability.The high level consumer will not give you control over which partitions yourconsumer consumes andwhich partitions it commits the offsets for. If you need to rewindconsumption for a subset of those partitions,then ZookeeperConsumerConnector will not work for you.

Does the ConsumerConnector keep track of the offsets of datadownloaded from the server (and queued for consumption by the end userof the API), or does it keep track of the actual offset that has beenconsumed by the end user?

--Tom

On Fri, Dec 21, 2012 at 10:37 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote:>> But if crash happens just after offset committed, then unprocessed>> message in consumer will be skipped after reconnected.>>>> If the consumer crashes, you will get duplicates, not lose any data.>> Thanks,> Neha

> Does the ConsumerConnector keep track of the offsets of data> downloaded from the server (and queued for consumption by the end user> of the API), or does it keep track of the actual offset that has been> consumed by the end user?>> --Tom>> On Fri, Dec 21, 2012 at 10:37 AM, Neha Narkhede <[EMAIL PROTECTED]>> wrote:> >> But if crash happens just after offset committed, then unprocessed> >> message in consumer will be skipped after reconnected.> >>> >> > If the consumer crashes, you will get duplicates, not lose any data.> >> > Thanks,> > Neha>

In our project we use senseidb to consume kafka data. Senseidb will process the message immediately but won't flush to disk immeidately. So if senseidb crash then all result not flushed will be lost£¬ we want to rewind kafka. The offset we want to rewind to is the flush checkpoint.In this case, we will lost some data

Sent from my iPad

ÔÚ 2012-12-22£¬1:37£¬Neha Narkhede <[EMAIL PROTECTED]> Ð´µÀ£º

> > But if crash happens just after offset committed, then unprocessed message in consumer will be skipped after reconnected.> > If the consumer crashes, you will get duplicates, not lose any data.> > Thanks,> Neha >

It seems that a common thread is that while ConsumerConnector workswell for the standard case, it just doesn't work for any case wheremanual offset management (explicit checkpoints, rollbacks, etc) isrequired.

If any Kafka devs are looking for a way to improve it, I thinkmodifying it to be more modular regarding offset management would begreat! You could provide an interface for loading/committing offsets,then provide a ZK implementation as the default. It would be backwardscompatible, but be useful in all of the use cases where explicitoffset management is required.

(of course, I know I'm just an armchair kafka dev, so there may bereasons why this won't work, or would be an extremely low priorty,or...)

--Tom

On Fri, Dec 21, 2012 at 4:12 PM, Yonghui Zhao <[EMAIL PROTECTED]> wrote:> In our project we use senseidb to consume kafka data. Senseidb will process the message immediately but won't flush to disk immeidately. So if senseidb crash then all result not flushed will be lost， we want to rewind kafka. The offset we want to rewind to is the flush checkpoint.> In this case, we will lost some data>> Sent from my iPad>> 在 2012-12-22，1:37，Neha Narkhede <[EMAIL PROTECTED]> 写道：>>>>> But if crash happens just after offset committed, then unprocessed message in consumer will be skipped after reconnected.>>>> If the consumer crashes, you will get duplicates, not lose any data.>>>> Thanks,>> Neha>>

Besides this, work has started on scaling the offset storage for theconsumer as part of this JIRA -https://issues.apache.org/jira/browse/KAFKA-657. It is true that the teamis currently focussed on developing and stabilizing replication, but wewelcome ideas and contribution to the consumer client re-design project aswell.

> It seems that a common thread is that while ConsumerConnector works> well for the standard case, it just doesn't work for any case where> manual offset management (explicit checkpoints, rollbacks, etc) is> required.>> If any Kafka devs are looking for a way to improve it, I think> modifying it to be more modular regarding offset management would be> great! You could provide an interface for loading/committing offsets,> then provide a ZK implementation as the default. It would be backwards> compatible, but be useful in all of the use cases where explicit> offset management is required.>> (of course, I know I'm just an armchair kafka dev, so there may be> reasons why this won't work, or would be an extremely low priorty,> or...)>> --Tom>> On Fri, Dec 21, 2012 at 4:12 PM, Yonghui Zhao <[EMAIL PROTECTED]>> wrote:> > In our project we use senseidb to consume kafka data. Senseidb will> process the message immediately but won't flush to disk immeidately. So if> senseidb crash then all result not flushed will be lost， we want to rewind> kafka. The offset we want to rewind to is the flush checkpoint.> > In this case, we will lost some data> >> > Sent from my iPad> >> > 在 2012-12-22，1:37，Neha Narkhede <[EMAIL PROTECTED]> 写道：> >> >>> >> But if crash happens just after offset committed, then unprocessed> message in consumer will be skipped after reconnected.> >>> >> If the consumer crashes, you will get duplicates, not lose any data.> >>> >> Thanks,> >> Neha> >>>

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext