We are starting to use Kafka in production but we found an unexpected (atleast for me) behavior with the use of partitions. We have a bunch oftopics with a few partitions each. We try to consume all data from severalconsumers (just one consumer group).

The problem is in the rebalance step. The rebalance splits the partitionsper topic between all consumers. So if you have 100 topics but only 2partitions each and 10 consumers only two consumers will be used. That is,for each topic all partitions will be listed and shared between theconsumers in the consumer group in order (not randomly).

This behavior is also described in algorithm 1 of the original kafka paper[1].

I don't understand this decision. Why is split by topic? Does it make senseto divide all partitions from all topics between all the consumers in theconsumer group? I don't see the reason of this so I would like to hear youropinion before changing the code.

Currently, partition is the smallest unit that we distribute data amongconsumers (in the same consumer group). So, if the # of consumers is largerthan the total number of partitions in a Kafka cluster (across allbrokers), some consumers will never get any data. Such a decision is doneon a per topic basis. If a consumer consumes multiple topics, it would makesense to divide partitions across all topics to consumers. We haven't donethat yet. Part of the reason is that we need to figure out how to balancethe data across topics since they can be of different sizes. We can lookinto that post 0.8.

For now, the solution is to increase the number of partitions on the broker.

> Hello>> We are starting to use Kafka in production but we found an unexpected (at> least for me) behavior with the use of partitions. We have a bunch of> topics with a few partitions each. We try to consume all data from several> consumers (just one consumer group).>> The problem is in the rebalance step. The rebalance splits the partitions> per topic between all consumers. So if you have 100 topics but only 2> partitions each and 10 consumers only two consumers will be used. That is,> for each topic all partitions will be listed and shared between the> consumers in the consumer group in order (not randomly).>> This behavior is also described in algorithm 1 of the original kafka paper> [1].>> I don't understand this decision. Why is split by topic? Does it make sense> to divide all partitions from all topics between all the consumers in the> consumer group? I don't see the reason of this so I would like to hear your> opinion before changing the code.>> We are using kafka 0.7.1.>> Thank you in advance>> Pablo>> [1] "Kafka: a Distributed Messaging System for Log Processing", Jay Kreps,> Neha Narkhede and Jun Rao.>> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf>

That is a good suggestion. Ideally, the partitions across all topics shouldbe distributed evenly across consumer streams instead of a per-topic baseddecision. There is no particular advantage to the current scheme ofper-topic rebalancing that I can think of. Would you mind filing a JIRA totrack this improvement ?

> Pablo,>> Currently, partition is the smallest unit that we distribute data among> consumers (in the same consumer group). So, if the # of consumers is larger> than the total number of partitions in a Kafka cluster (across all> brokers), some consumers will never get any data. Such a decision is done> on a per topic basis. If a consumer consumes multiple topics, it would make> sense to divide partitions across all topics to consumers. We haven't done> that yet. Part of the reason is that we need to figure out how to balance> the data across topics since they can be of different sizes. We can look> into that post 0.8.>> For now, the solution is to increase the number of partitions on the> broker.>> Thanks,>> Jun>> On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González <> [EMAIL PROTECTED]> wrote:>> > Hello> >> > We are starting to use Kafka in production but we found an unexpected (at> > least for me) behavior with the use of partitions. We have a bunch of> > topics with a few partitions each. We try to consume all data from> several> > consumers (just one consumer group).> >> > The problem is in the rebalance step. The rebalance splits the partitions> > per topic between all consumers. So if you have 100 topics but only 2> > partitions each and 10 consumers only two consumers will be used. That> is,> > for each topic all partitions will be listed and shared between the> > consumers in the consumer group in order (not randomly).> >> > This behavior is also described in algorithm 1 of the original kafka> paper> > [1].> >> > I don't understand this decision. Why is split by topic? Does it make> sense> > to divide all partitions from all topics between all the consumers in the> > consumer group? I don't see the reason of this so I would like to hear> your> > opinion before changing the code.> >> > We are using kafka 0.7.1.> >> > Thank you in advance> >> > Pablo> >> > [1] "Kafka: a Distributed Messaging System for Log Processing", Jay> Kreps,> > Neha Narkhede and Jun Rao.> >> >> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf> >>

I was trying to avoid adding more partitions. I have enough partitions ifyou count all partitions in all topics. I understand the problem withdifferent data load per topic but the current schema does not solve thisproblem either so we shouldn't be worse is we consider all partitions fromall topics at the same time.

I will open the JIRA ticket to track this.

Thanks again for the clarification.

Cheers

Pablo

2013/1/7 Neha Narkhede <[EMAIL PROTECTED]>

> Pablo,>> That is a good suggestion. Ideally, the partitions across all topics should> be distributed evenly across consumer streams instead of a per-topic based> decision. There is no particular advantage to the current scheme of> per-topic rebalancing that I can think of. Would you mind filing a JIRA to> track this improvement ?>> Thanks,> Neha>>> On Mon, Jan 7, 2013 at 9:10 AM, Jun Rao <[EMAIL PROTECTED]> wrote:>> > Pablo,> >> > Currently, partition is the smallest unit that we distribute data among> > consumers (in the same consumer group). So, if the # of consumers is> larger> > than the total number of partitions in a Kafka cluster (across all> > brokers), some consumers will never get any data. Such a decision is done> > on a per topic basis. If a consumer consumes multiple topics, it would> make> > sense to divide partitions across all topics to consumers. We haven't> done> > that yet. Part of the reason is that we need to figure out how to balance> > the data across topics since they can be of different sizes. We can look> > into that post 0.8.> >> > For now, the solution is to increase the number of partitions on the> > broker.> >> > Thanks,> >> > Jun> >> > On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González <> > [EMAIL PROTECTED]> wrote:> >> > > Hello> > >> > > We are starting to use Kafka in production but we found an unexpected> (at> > > least for me) behavior with the use of partitions. We have a bunch of> > > topics with a few partitions each. We try to consume all data from> > several> > > consumers (just one consumer group).> > >> > > The problem is in the rebalance step. The rebalance splits the> partitions> > > per topic between all consumers. So if you have 100 topics but only 2> > > partitions each and 10 consumers only two consumers will be used. That> > is,> > > for each topic all partitions will be listed and shared between the> > > consumers in the consumer group in order (not randomly).> > >> > > This behavior is also described in algorithm 1 of the original kafka> > paper> > > [1].> > >> > > I don't understand this decision. Why is split by topic? Does it make> > sense> > > to divide all partitions from all topics between all the consumers in> the> > > consumer group? I don't see the reason of this so I would like to hear> > your> > > opinion before changing the code.> > >> > > We are using kafka 0.7.1.> > >> > > Thank you in advance> > >> > > Pablo> > >> > > [1] "Kafka: a Distributed Messaging System for Log Processing", Jay> > Kreps,> > > Neha Narkhede and Jun Rao.> > >> > >> >> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf> > >> >>

> Thank you Jun and Neha>> I was trying to avoid adding more partitions. I have enough partitions if> you count all partitions in all topics. I understand the problem with> different data load per topic but the current schema does not solve this> problem either so we shouldn't be worse is we consider all partitions from> all topics at the same time.>> I will open the JIRA ticket to track this.>> Thanks again for the clarification.>> Cheers>> Pablo>>>> 2013/1/7 Neha Narkhede <[EMAIL PROTECTED]>>>> Pablo,>>>> That is a good suggestion. Ideally, the partitions across all topics>> should>> be distributed evenly across consumer streams instead of a per-topic based>> decision. There is no particular advantage to the current scheme of>> per-topic rebalancing that I can think of. Would you mind filing a JIRA to>> track this improvement ?>>>> Thanks,>> Neha>>>>>> On Mon, Jan 7, 2013 at 9:10 AM, Jun Rao <[EMAIL PROTECTED]> wrote:>>>> > Pablo,>> >>> > Currently, partition is the smallest unit that we distribute data among>> > consumers (in the same consumer group). So, if the # of consumers is>> larger>> > than the total number of partitions in a Kafka cluster (across all>> > brokers), some consumers will never get any data. Such a decision is>> done>> > on a per topic basis. If a consumer consumes multiple topics, it would>> make>> > sense to divide partitions across all topics to consumers. We haven't>> done>> > that yet. Part of the reason is that we need to figure out how to>> balance>> > the data across topics since they can be of different sizes. We can look>> > into that post 0.8.>> >>> > For now, the solution is to increase the number of partitions on the>> > broker.>> >>> > Thanks,>> >>> > Jun>> >>> > On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González <>> > [EMAIL PROTECTED]> wrote:>> >>> > > Hello>> > >>> > > We are starting to use Kafka in production but we found an unexpected>> (at>> > > least for me) behavior with the use of partitions. We have a bunch of>> > > topics with a few partitions each. We try to consume all data from>> > several>> > > consumers (just one consumer group).>> > >>> > > The problem is in the rebalance step. The rebalance splits the>> partitions>> > > per topic between all consumers. So if you have 100 topics but only 2>> > > partitions each and 10 consumers only two consumers will be used. That>> > is,>> > > for each topic all partitions will be listed and shared between the>> > > consumers in the consumer group in order (not randomly).>> > >>> > > This behavior is also described in algorithm 1 of the original kafka>> > paper>> > > [1].>> > >>> > > I don't understand this decision. Why is split by topic? Does it make>> > sense>> > > to divide all partitions from all topics between all the consumers in>> the>> > > consumer group? I don't see the reason of this so I would like to hear>> > your>> > > opinion before changing the code.>> > >>> > > We are using kafka 0.7.1.>> > >>> > > Thank you in advance>> > >>> > > Pablo>> > >>> > > [1] "Kafka: a Distributed Messaging System for Log Processing", Jay>> > Kreps,>> > > Neha Narkhede and Jun Rao.>> > >>> > >>> >>> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf>> > >>> >>>>>

(From http://kafka.apache.org/design.html) one potential benefit of theexisting rebalancing logic is to reduce the number of connections tobrokers per consumer instance. However, if you have a large number ofpartitions and few brokers and/or consumer instances then it wouldn'treally help; so I agree it would be good to implement KAFKA-687.KAFKA-564<https://issues.apache.org/jira/browse/KAFKA-564> mayalso be related - i.e., it may be easier to implement along with/afterKAFKA-687,

> Pablo,>> That is a good suggestion. Ideally, the partitions across all topics should> be distributed evenly across consumer streams instead of a per-topic based> decision. There is no particular advantage to the current scheme of> per-topic rebalancing that I can think of. Would you mind filing a JIRA to> track this improvement ?>> Thanks,> Neha>>> On Mon, Jan 7, 2013 at 9:10 AM, Jun Rao <[EMAIL PROTECTED]> wrote:>> > Pablo,> >> > Currently, partition is the smallest unit that we distribute data among> > consumers (in the same consumer group). So, if the # of consumers is> larger> > than the total number of partitions in a Kafka cluster (across all> > brokers), some consumers will never get any data. Such a decision is done> > on a per topic basis. If a consumer consumes multiple topics, it would> make> > sense to divide partitions across all topics to consumers. We haven't> done> > that yet. Part of the reason is that we need to figure out how to balance> > the data across topics since they can be of different sizes. We can look> > into that post 0.8.> >> > For now, the solution is to increase the number of partitions on the> > broker.> >> > Thanks,> >> > Jun> >> > On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González <> > [EMAIL PROTECTED]> wrote:> >> > > Hello> > >> > > We are starting to use Kafka in production but we found an unexpected> (at> > > least for me) behavior with the use of partitions. We have a bunch of> > > topics with a few partitions each. We try to consume all data from> > several> > > consumers (just one consumer group).> > >> > > The problem is in the rebalance step. The rebalance splits the> partitions> > > per topic between all consumers. So if you have 100 topics but only 2> > > partitions each and 10 consumers only two consumers will be used. That> > is,> > > for each topic all partitions will be listed and shared between the> > > consumers in the consumer group in order (not randomly).> > >> > > This behavior is also described in algorithm 1 of the original kafka> > paper> > > [1].> > >> > > I don't understand this decision. Why is split by topic? Does it make> > sense> > > to divide all partitions from all topics between all the consumers in> the> > > consumer group? I don't see the reason of this so I would like to hear> > your> > > opinion before changing the code.> > >> > > We are using kafka 0.7.1.> > >> > > Thank you in advance> > >> > > Pablo> > >> > > [1] "Kafka: a Distributed Messaging System for Log Processing", Jay> > Kreps,> > > Neha Narkhede and Jun Rao.> > >> > >> >> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf> > >> >>

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext