I hope i understood what you are asking is this . If not then pardon me :)from the hadoop developer handbook few lines

The*Partitioner* class determines which partition a given (key, value) pairwill go to. The default partitioner computes a hash value for the key andassigns the partition based on this result. It garantees that all therecords mapping to one key go to same reducer

> Hi,>> quick question. How are the data from the map tasks partitionned for> the reducers?>> If there is 1 reducer, it's easy, but if there is more, are all they> same keys garanteed to end on the same reducer? Or not necessary? If> they are not, how can we provide a partionning function?>> Thanks,>> JM>

However, I'm wondering how it's working when it's done into HBase. Dowe have defaults partionners so we have the same garantee that recordsmapping to one key go to the same reducer. Or do we have to implementthis one our own.

JM

2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>:> I hope i understood what you are asking is this . If not then pardon me :)> from the hadoop developer handbook few lines>> The*Partitioner* class determines which partition a given (key, value) pair> will go to. The default partitioner computes a hash value for the key and> assigns the partition based on this result. It garantees that all the> records mapping to one key go to same reducer>> You can write your custom partitioner as well> here is the link :> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning>>>>> On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <> [EMAIL PROTECTED]> wrote:>>> Hi,>>>> quick question. How are the data from the map tasks partitionned for>> the reducers?>>>> If there is 1 reducer, it's easy, but if there is more, are all they>> same keys garanteed to end on the same reducer? Or not necessary? If>> they are not, how can we provide a partionning function?>>>> Thanks,>>>> JM>>>>>> --> Nitin Pawar

> Hi Nitin,>> You got my question correctly.>> However, I'm wondering how it's working when it's done into HBase. Do> we have defaults partionners so we have the same garantee that records> mapping to one key go to the same reducer. Or do we have to implement> this one our own.>> JM>> 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>:> > I hope i understood what you are asking is this . If not then pardon me> :)> > from the hadoop developer handbook few lines> >> > The*Partitioner* class determines which partition a given (key, value)> pair> > will go to. The default partitioner computes a hash value for the key and> > assigns the partition based on this result. It garantees that all the> > records mapping to one key go to same reducer> >> > You can write your custom partitioner as well> > here is the link :> > http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning> >> >> >> >> > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <> > [EMAIL PROTECTED]> wrote:> >> >> Hi,> >>> >> quick question. How are the data from the map tasks partitionned for> >> the reducers?> >>> >> If there is 1 reducer, it's easy, but if there is more, are all they> >> same keys garanteed to end on the same reducer? Or not necessary? If> >> they are not, how can we provide a partionning function?> >>> >> Thanks,> >>> >> JM> >>> >> >> >> > --> > Nitin Pawar>

> Jean-Marc:> Take a look at HRegionPartitioner which is in both mapred and mapreduce> packages:>> * This is used to partition the output keys into groups of keys.>> * Keys are grouped according to the regions that currently exist>> * so that each reducer fills a single region so load is distributed.>> Cheers>> On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <> [EMAIL PROTECTED]> wrote:>> > Hi Nitin,> >> > You got my question correctly.> >> > However, I'm wondering how it's working when it's done into HBase. Do> > we have defaults partionners so we have the same garantee that records> > mapping to one key go to the same reducer. Or do we have to implement> > this one our own.> >> > JM> >> > 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>:> > > I hope i understood what you are asking is this . If not then pardon me> > :)> > > from the hadoop developer handbook few lines> > >> > > The*Partitioner* class determines which partition a given (key, value)> > pair> > > will go to. The default partitioner computes a hash value for the key> and> > > assigns the partition based on this result. It garantees that all the> > > records mapping to one key go to same reducer> > >> > > You can write your custom partitioner as well> > > here is the link :> > > http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning> > >> > >> > >> > >> > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <> > > [EMAIL PROTECTED]> wrote:> > >> > >> Hi,> > >>> > >> quick question. How are the data from the map tasks partitionned for> > >> the reducers?> > >>> > >> If there is 1 reducer, it's easy, but if there is more, are all they> > >> same keys garanteed to end on the same reducer? Or not necessary? If> > >> they are not, how can we provide a partionning function?> > >>> > >> Thanks,> > >>> > >> JM> > >>> > >> > >> > >> > > --> > > Nitin Pawar> >>

If the table we are outputing to has regions, then partitions are buildaround that, and that's fine. But if the table is totally empty with asingle region, even if we setNumReduceTasks to 2 or more, all the keys willgo on the same first reducer because of this: if (this.startKeys.length == 1){ return 0; }I think it will be better to return something like keycrc%numPartitionsinstead. That still allow the application to spread the reducing processover multinode(racks) even if there is only one region in the table.

In my usecase, I have millions of lines producing some statistics. At theend, I will have only about 600 lines, but it will take a lot of map andreduce time to go from millions to 600, that's why I'm looking to have morethan one reducer. However, with only 600 lines, it's very difficult topre-split the table. Keys are all very close.

Does anyone see anything wrong with changing this default behaviour whenstartKeys.length == 1? If not, I will open a JIRA and upload a patch.

No. The reducer will simply write on the table the same way you are doing aregular Put. If a split is required because of the size, then the regionwill be split, but at the end, there will not necessary be any regionsplit.

In the usecase described below, all the 600 lines will "simply" go into theonly region in the table and no split will occur.

The goal is to partition the data for the reducer only. Not in the table.

JM

2013/4/10 Graeme Wallace <[EMAIL PROTECTED]>

> Whats the behavior then if you return hash % num_reducers and you have no> splits defined. When the reducer writes to the table does the region server> local to the reducer create a new region ?>> Graeme>>> On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <> [EMAIL PROTECTED]> wrote:>> > So.> >> > I looked at the code, and I have one comment/suggestion here.> >> > If the table we are outputing to has regions, then partitions are build> > around that, and that's fine. But if the table is totally empty with a> > single region, even if we setNumReduceTasks to 2 or more, all the keys> will> > go on the same first reducer because of this:> > if (this.startKeys.length == 1){> > return 0;> > }> > I think it will be better to return something like keycrc%numPartitions> > instead. That still allow the application to spread the reducing process> > over multinode(racks) even if there is only one region in the table.> >> > In my usecase, I have millions of lines producing some statistics. At the> > end, I will have only about 600 lines, but it will take a lot of map and> > reduce time to go from millions to 600, that's why I'm looking to have> more> > than one reducer. However, with only 600 lines, it's very difficult to> > pre-split the table. Keys are all very close.> >> > Does anyone see anything wrong with changing this default behaviour when> > startKeys.length == 1? If not, I will open a JIRA and upload a patch.> >> > JM> >> > 2013/4/10 Jean-Marc Spaggiari <[EMAIL PROTECTED]>> >> > > Thanks Ted.> > >> > > It's exactly where I was looking at now. I was close. I will take a> > deeper> > > look.> > >> > > Thanks Nitin for the link. I will read that too.> > >> > > JM> > >> > > 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>> > >> > >> To add what Ted said,> > >>> > >> the same discussion happened on the question Jean asked> > >>> > >> https://issues.apache.org/jira/browse/HBASE-1287> > >>> > >>> > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> > >>> > >> > Jean-Marc:> > >> > Take a look at HRegionPartitioner which is in both mapred and> > mapreduce> > >> > packages:> > >> >> > >> > * This is used to partition the output keys into groups of keys.> > >> >> > >> > * Keys are grouped according to the regions that currently exist> > >> >> > >> > * so that each reducer fills a single region so load is> distributed.> > >> >> > >> > Cheers> > >> >> > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <> > >> > [EMAIL PROTECTED]> wrote:> > >> >> > >> > > Hi Nitin,> > >> > >> > >> > > You got my question correctly.> > >> > >> > >> > > However, I'm wondering how it's working when it's done into HBase.> > Do> > >> > > we have defaults partionners so we have the same garantee that> > records> > >> > > mapping to one key go to the same reducer. Or do we have to> > implement> > >> > > this one our own.> > >> > >> > >> > > JM> > >> > >> > >> > > 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>:> > >> > > > I hope i understood what you are asking is this . If not then> > >> pardon me> > >> > > :)> > >> > > > from the hadoop developer handbook few lines> > >> > > >> > >> > > > The*Partitioner* class determines which partition a given (key,> > >> value)> > >> > > pair> > >> > > > will go to. The default partitioner computes a hash value for> the> > >> key> > >> > and> > >> > > > assigns the partition based on this result. It garantees that

bq. I think it will be better to return something like keycrc%numPartitions

Can you explain how keycrc is obtained ?I think if we change this logic, we should make it serve (relatively) moregeneral use case.But I didn't find, in hadoop 1.0, how Partitioner can accept parameters:

> Hi Greame,>> No. The reducer will simply write on the table the same way you are doing a> regular Put. If a split is required because of the size, then the region> will be split, but at the end, there will not necessary be any region> split.>> In the usecase described below, all the 600 lines will "simply" go into the> only region in the table and no split will occur.>> The goal is to partition the data for the reducer only. Not in the table.>> JM>> 2013/4/10 Graeme Wallace <[EMAIL PROTECTED]>>> > Whats the behavior then if you return hash % num_reducers and you have no> > splits defined. When the reducer writes to the table does the region> server> > local to the reducer create a new region ?> >> > Graeme> >> >> > On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <> > [EMAIL PROTECTED]> wrote:> >> > > So.> > >> > > I looked at the code, and I have one comment/suggestion here.> > >> > > If the table we are outputing to has regions, then partitions are build> > > around that, and that's fine. But if the table is totally empty with a> > > single region, even if we setNumReduceTasks to 2 or more, all the keys> > will> > > go on the same first reducer because of this:> > > if (this.startKeys.length == 1){> > > return 0;> > > }> > > I think it will be better to return something like keycrc%numPartitions> > > instead. That still allow the application to spread the reducing> process> > > over multinode(racks) even if there is only one region in the table.> > >> > > In my usecase, I have millions of lines producing some statistics. At> the> > > end, I will have only about 600 lines, but it will take a lot of map> and> > > reduce time to go from millions to 600, that's why I'm looking to have> > more> > > than one reducer. However, with only 600 lines, it's very difficult to> > > pre-split the table. Keys are all very close.> > >> > > Does anyone see anything wrong with changing this default behaviour> when> > > startKeys.length == 1? If not, I will open a JIRA and upload a patch.> > >> > > JM> > >> > > 2013/4/10 Jean-Marc Spaggiari <[EMAIL PROTECTED]>> > >> > > > Thanks Ted.> > > >> > > > It's exactly where I was looking at now. I was close. I will take a> > > deeper> > > > look.> > > >> > > > Thanks Nitin for the link. I will read that too.> > > >> > > > JM> > > >> > > > 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>> > > >> > > >> To add what Ted said,> > > >>> > > >> the same discussion happened on the question Jean asked> > > >>> > > >> https://issues.apache.org/jira/browse/HBASE-1287> > > >>> > > >>> > > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <[EMAIL PROTECTED]>> wrote:> > > >>> > > >> > Jean-Marc:> > > >> > Take a look at HRegionPartitioner which is in both mapred and> > > mapreduce> > > >> > packages:> > > >> >> > > >> > * This is used to partition the output keys into groups of keys.> > > >> >> > > >> > * Keys are grouped according to the regions that currently exist> > > >> >> > > >> > * so that each reducer fills a single region so load is> > distributed.> > > >> >> > > >> > Cheers> > > >> >> > > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <> > >

> Hi Greame,>> No. The reducer will simply write on the table the same way you are doing a> regular Put. If a split is required because of the size, then the region> will be split, but at the end, there will not necessary be any region> split.>> In the usecase described below, all the 600 lines will "simply" go into the> only region in the table and no split will occur.>> The goal is to partition the data for the reducer only. Not in the table.>Then just use the default partitioner?

I looked for partitioners into HBase scope only, that's why I also thoughtwe where using HTablePartitioner. But looking at the default one used Ifound org.apache.hadoop.mapreduce.lib.partition.HashPartitioner like St.Ackconfirmed. And it's doing exactly what I was talking about for the keyhash(and not keycrc).

Changing HRegionPartitioner behaviour also will be useless becauseTableMapReduceUtil will overwrite the number of reducers if we have setmore than the number of regions.

> On Wed, Apr 10, 2013 at 12:01 PM, Jean-Marc Spaggiari <> [EMAIL PROTECTED]> wrote:>> > Hi Greame,> >> > No. The reducer will simply write on the table the same way you are> doing a> > regular Put. If a split is required because of the size, then the region> > will be split, but at the end, there will not necessary be any region> > split.> >> > In the usecase described below, all the 600 lines will "simply" go into> the> > only region in the table and no split will occur.> >> > The goal is to partition the data for the reducer only. Not in the table.> >>>> Then just use the default partitioner?>> The suggestion that you use HTablePartitioner seems inappropriate to your> task. See the sink doc here:>> http://hadoop.apache.org/docs/r2.0.3-alpha/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html>> St.Ack>

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext