If this job runs from multiple reducers on the same node, those per-hostlimits will be violated. Also, this is a shared environment and I don'twant long running network bound jobs uselessly taking up all reduce slots.

I think set tasktracker.reduce.tasks.maximum to be 1 may meet your requirementBest,

-- Nan ZhuSchool of Computer Science,McGill University

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

> I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.> > This job is network IO bound, gathering images from a set of webservers.> > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).> > If this job runs from multiple reducers on the same node, those per-host limits will be violated. Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.> > >

If this job runs from multiple reducers on the same node, those per-host limits will be violated. Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).

My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.

If this job runs from multiple reducers on the same node, those per-host limits will be violated. Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

There's no readily available way to do this today (you may beinterested in MAPREDUCE-199 though) but if your Job scheduler's notdoing multiple-assignments on reduce tasks, then only one is assignedper TT heartbeat, which gives you almost what you're looking for: 1reduce task per node, round-robin'd (roughly).

> I have a cluster of boxes with 3 reducers per node. I want to limit a> particular job to only run 1 reducer per node.>>>> This job is network IO bound, gathering images from a set of webservers.>>>> My job has certain parameters set to meet “web politeness” standards (e.g.> limit connects and connection frequency).>>>> If this job runs from multiple reducers on the same node, those per-host> limits will be violated. Also, this is a shared environment and I don’t> want long running network bound jobs uselessly taking up all reduce slots.

> Hey David,> > There's no readily available way to do this today (you may be> interested in MAPREDUCE-199 though) but if your Job scheduler's not> doing multiple-assignments on reduce tasks, then only one is assigned> per TT heartbeat, which gives you almost what you're looking for: 1> reduce task per node, round-robin'd (roughly).> > On Sat, Feb 9, 2013 at 9:24 AM, David Parks <[EMAIL PROTECTED]> wrote:>> I have a cluster of boxes with 3 reducers per node. I want to limit a>> particular job to only run 1 reducer per node.>> >> >> >> This job is network IO bound, gathering images from a set of webservers.>> >> >> >> My job has certain parameters set to meet “web politeness” standards (e.g.>> limit connects and connection frequency).>> >> >> >> If this job runs from multiple reducers on the same node, those per-host>> limits will be violated. Also, this is a shared environment and I don’t>> want long running network bound jobs uselessly taking up all reduce slots.> > > > --> Harsh J>

There's no readily available way to do this today (you may beinterested in MAPREDUCE-199 though) but if your Job scheduler's notdoing multiple-assignments on reduce tasks, then only one is assignedper TT heartbeat, which gives you almost what you're looking for: 1reduce task per node, round-robin'd (roughly).

If this job runs from multiple reducers on the same node, those per-hostlimits will be violated. Also, this is a shared environment and I don'twant long running network bound jobs uselessly taking up all reduce slots.--Harsh J

The suggestion to add a combiner is to help reduce the shuffle load(and perhaps, reduce # of reducers needed?), but it doesn't affectscheduling of a set number of reduce tasks nor does a scheduler carecurrently if you add that step in or not.

> I guess the FairScheduler is doing multiple assignments per heartbeat, hence> the behavior of multiple reduce tasks per node even when they should> otherwise be full distributed.>>>> Adding a combiner will change this behavior? Could you explain more?>>>> Thanks!>> David>>>>>> From: Michael Segel [mailto:[EMAIL PROTECTED]]> Sent: Monday, February 11, 2013 8:30 AM>>> To: [EMAIL PROTECTED]> Subject: Re: How can I limit reducers to one-per-node?>>>> Adding a combiner step first then reduce?>>>>>> On Feb 8, 2013, at 11:18 PM, Harsh J <[EMAIL PROTECTED]> wrote:>>>> Hey David,>> There's no readily available way to do this today (you may be> interested in MAPREDUCE-199 though) but if your Job scheduler's not> doing multiple-assignments on reduce tasks, then only one is assigned> per TT heartbeat, which gives you almost what you're looking for: 1> reduce task per node, round-robin'd (roughly).>> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <[EMAIL PROTECTED]> wrote:>> I have a cluster of boxes with 3 reducers per node. I want to limit a> particular job to only run 1 reducer per node.>>>> This job is network IO bound, gathering images from a set of webservers.>>>> My job has certain parameters set to meet “web politeness” standards (e.g.> limit connects and connection frequency).>>>> If this job runs from multiple reducers on the same node, those per-host> limits will be violated. Also, this is a shared environment and I don’t> want long running network bound jobs uselessly taking up all reduce slots.>>>>> --> Harsh J>>>> Michael Segel | (m) 312.755.9623>> Segel and Associates>>

For crawler type apps, typically you direct all of the URL's to crawl froma single domain to a single reducer. Typically, you also have manyreducers so that you can get decent bandwidth.

It is also common to consider the normal web politeness standards with agrain of salt, particularly by taking it as an average rate and doingseveral requests with a single connection, then waiting a bit longer thanwould otherwise be done. This helps the target domain and improves yourcrawler's utilization.

Large scale crawlers typically work out of a large data store with a flagscolumn that is pinned into memory. Successive passes of the crawler canscan the flag column very quickly to find domains with work to be done. This work can be done using map-reduce, but it is only vaguely like amap-reduce job.

I tried that approach at first, one domain to one reducer, but it failed mebecause my data set has many domains with just a few thousand images,trivial, but we also have reasonably many massive domains with 10 million+images.

One host downloading 10 or 20 million images, while obeying politenessstandards, will take multiple weeks. So I decided to randomly distributeURLs to each host and, per host, follow web politeness standards. Thedomains with 10M+ images should be able to support the load (they're bigsites like iTunes for example), the smaller ones are (hopefully) randomizedacross hosts enough to be reasonably safe.

For crawler type apps, typically you direct all of the URL's to crawl from asingle domain to a single reducer. Typically, you also have many reducersso that you can get decent bandwidth.

It is also common to consider the normal web politeness standards with agrain of salt, particularly by taking it as an average rate and doingseveral requests with a single connection, then waiting a bit longer thanwould otherwise be done. This helps the target domain and improves yourcrawler's utilization.

Large scale crawlers typically work out of a large data store with a flagscolumn that is pinned into memory. Successive passes of the crawler canscan the flag column very quickly to find domains with work to be done.This work can be done using map-reduce, but it is only vaguely like amap-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <[EMAIL PROTECTED]> wrote:

The suggestion to add a combiner is to help reduce the shuffle load(and perhaps, reduce # of reducers needed?), but it doesn't affectscheduling of a set number of reduce tasks nor does a scheduler carecurrently if you add that step in or not.On Mon, Feb 11, 2013 at 7:59 AM, David Parks <[EMAIL PROTECTED]> wrote:> I guess the FairScheduler is doing multiple assignments per heartbeat,hence> the behavior of multiple reduce tasks per node even when they should> otherwise be full distributed.>>>> Adding a combiner will change this behavior? Could you explain more?>>>> Thanks!>> David>>>>>> From: Michael Segel [mailto:[EMAIL PROTECTED]]> Sent: Monday, February 11, 2013 8:30 AM>>> To: [EMAIL PROTECTED]> Subject: Re: How can I limit reducers to one-per-node?>>>> Adding a combiner step first then reduce?>>>>>> On Feb 8, 2013, at 11:18 PM, Harsh J <[EMAIL PROTECTED]> wrote:>>>> Hey David,>> There's no readily available way to do this today (you may be> interested in MAPREDUCE-199 though) but if your Job scheduler's not> doing multiple-assignments on reduce tasks, then only one is assigned> per TT heartbeat, which gives you almost what you're looking for: 1> reduce task per node, round-robin'd (roughly).>> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <[EMAIL PROTECTED]>wrote:>> I have a cluster of boxes with 3 reducers per node. I want to limit a> particular job to only run 1 reducer per node.>>>> This job is network IO bound, gathering images from a set of webservers.>>>> My job has certain parameters set to meet "web politeness" standards (e.g.> limit connects and connection frequency).>>>> If this job runs from multiple reducers on the same node, those per-host> limits will be violated. Also, this is a shared environment and I don't> want long running network bound jobs uselessly taking up all reduce slots.>>>>> --> Harsh J>>>> Michael Segel | (m) 312.755.9623>> Segel and Associates>>--Harsh J

NEW: Monitor These Apps!

Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext