a server not running a datanode,it can only be namenode or jobtracker. the copier jobs run
on such server I think can bring uncertain risks.
I find something in the book "hadoop definitive guide"
When copying data into HDFS, it’s important to consider cluster balance. HDFS works
best when the file blocks are evenly spread across the cluster, so you want to ensure
that distcp doesn’t disrupt this. Going back to the 1,000 GB example, by specifying -m
1 a single map would do the copy, which―apart from being slow and not using the
cluster resources efficiently―would mean that the first replica of each block would
reside on the node running the map (until the disk filled up). The second and third
replicas would be spread across the cluster, but this one node would be unbalanced.
By having more maps than nodes in the cluster, this problem is avoided―for this reason,
it’s best to start by running distcp with the default of 20 maps per node.
But it's a method to copy large sets of files between two hadoop clusters, any way to copy
data into hdfs from local file system using multiple maps ? Of course writing a mapreduce
program will work, any other choise ?
2010-10-08
发件人： Taeho Kang
发送时间： 2010-10-08 13:49:55
收件人： common-user
抄送：
主题： Re: Re: how to make hadoop balance automatically
Have the dfs upload done by a server not running a datanode and your
blocks will be randomly distributed among active datanodes.
On Fri, Oct 8, 2010 at 2:39 PM, shangan <shangan@corp.kaixin001.com> wrote:
> is there any way to change the default storage policy ? for example: don't store the
first copy of a block on the local node but distribute the copies randomly instread
>
>
> 2010-10-08
>
>
>
>
> 发件人： Raj V
> 发送时间： 2010-09-28 22:28:12
> 收件人： common-user
> 抄送：
> 主题： Re: how to make hadoop balance automatically
>
> The first copy of a block is always stored on the local node. If you want a
> balanced distribution, do the data moving from the name node and don't make
> the name node into a data node.
> Raj
> ________________________________
> From: Neil Xu <neil.xuxf@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Tue, September 28, 2010 3:13:01 AM
> Subject: Re: how to make hadoop balance automatically
> Hi, Shangan
> you can find something useful at
> https://issues.apache.org/jira/browse/HADOOP-1652
> and the document
> https://issues.apache.org/jira/secure/attachment/12370966/BalancerUserGuide2.pdf
> shows how to rebalance.
> I think you can try to set more mappers (much larger than the number of
> nodes), and see if it will be improved.
> Neil
> 在 2010年9月28日 下午4:09，shangan <shangan@corp.kaixin001.com>写道：
>> I have a cluster of 30 nodes, and I put data into the cluster on one node I
>> called "NodeA" here. The consequence is that now this node always stores
>> more data than other node, for example other nodes store 10G to 15G,while
>> NodeA will store 50G to 60G .
>>
>> do anyone know what cause such consequence and how to avoid it ?
>> btw: I know there a balancer tool can do balance
>>
>> 2010-09-28
>>
>>
>>
>> shangan
>>
> __________ Information from ESET NOD32 Antivirus, version of virus signature database
5484 (20100927) __________
> The message was checked by ESET NOD32 Antivirus.
> http://www.eset.com
>
__________ Information from ESET NOD32 Antivirus, version of virus signature database 5513
(20101007) __________
The message was checked by ESET NOD32 Antivirus.
http://www.eset.com