Dear all:
I encountered a very basic problem that is so common and so basic,i.e I just can not recover a failed hadoop Datanode gracefully without stopping all service on all nodes then starting all service on all nodes.

Originally,I have hadoop cluster with 5 nodes(1 NameNodes:host001 +4 DataNodes host002,host003,host004,host005),When I shut off a DataNode(host005),HMC can found host005 DataNode down,then issued a warning blinking word on the HMC monitor.However, when I powered on this host005 DataNode again, host005 can not recover its hadoop services, like datanode, and tasktracker, so I still have ony 4 hadoop nodes workable.
My method to recover host005 datanode’s service is “stop all services ,then start all service on HMC”, but it’s so stupid, and not practical in real world. so I want to know if anybody can suggest a better way for me to follow.
What should I do if I want to recover a failed DataNode without stopping all service then starting all services.
What’s the correct procedure ?
Regards,
Jeff

Hi Vipul,
Can you please clarify if your issue is related to original issue in this post? If this issue is regarding 2.0 alpha, can you please post your question here:
http://hortonworks.com/community/forums/forum/hdp-2-0-alpha-feedback-2/
Thanks for using HDP.
Thanks,
Seth
...

Hi Vipul,
thanks for trying Hortonworks Data Platform.
A bit more information as to what you are attempting here would be helpful? Was this an attempt to add nodes? A new Install? What version of HDP is this happening on? You could also help us by following the instructions here http://hortonworks.com/community/forums/topic/hmc-installation-support-help-us-help-you
Thanks,
Ted.
...

Jeff,
it is not clear what do you mean by this
“(2) host005 has replicate problem,so we can’t add node”
As of your questions:
HDP supports all new nodes adding and nodes decommissioning.
HMC will support nodes decommissioning in the future release.
Looks like you mix HDP and HMC:
HDP is Hortonworks Data Platform, the actual Hadoop and ecosystem
HMC is Hortonworks Management Center, managing and monitoring component, which is in it’s early implementation stages.
In general, if cluster works correctly and one of the nodes lost power and then get it back up, you do not need to...

Jeff,
it is not clear what do you mean by this
“(2) host005 has replicate problem,so we can’t add node”

As of your questions:
HDP supports all new nodes adding and nodes decommissioning.
HMC will support nodes decommissioning in the future release.
Looks like you mix HDP and HMC:
HDP is Hortonworks Data Platform, the actual Hadoop and ecosystem
HMC is Hortonworks Management Center, managing and monitoring component, which is in it’s early implementation stages.

In general, if cluster works correctly and one of the nodes lost power and then get it back up, you do not need to do anything it will join cluster back automatically.

However, it seems like you did something incorrectly, assuming the fact that you ask installation questions for many weeks in a row in this forum.

Would you like to take this offline and have WebEx session to perform clean install and then simulate node failure and recovery?

Dear Ted:
I used “Add Nodes”, and still failed.
Bellow are the steps
(1) Add node host005
(2) host005 has replicate problem,so we can’t add node
so I changed hostnmae host005 to a new name host105 , to avoid the replication problem
(4) Add node host105
(5) HMC failed to add nodes
(6) Please uninstall cluster.
..too bad…
It seems like Horton works HDP1.1 is not suitable for in production.
If HDP1.1 does not support decommission,
How can we regularly maintain the hadoop cluster nodes under HDP1.1 ?
How can we remove a bad datanode then replace it with a new data...

I used “Add Nodes”, and still failed.
Bellow are the steps
(1) Add node host005
(2) host005 has replicate problem,so we can’t add node
so I changed hostnmae host005 to a new name host105 , to avoid the replication problem
(4) Add node host105
(5) HMC failed to add nodes
(6) Please uninstall cluster.
..too bad…

It seems like Horton works HDP1.1 is not suitable for in production.
If HDP1.1 does not support decommission,
How can we regularly maintain the hadoop cluster nodes under HDP1.1 ?
How can we remove a bad datanode then replace it with a new datanode ?
What if one DataNode lost power suddenly and get back its power later ?
How to recover this DataNode ?
I dont know if Hortonworks HDP1.1 can handle this basic problem,
but I do hope HDP1.1 can be can more competive in the hadoop world.

Jeff,
Once the node has been properly, or accidentally removed from the cluster, the way to get it, or any new node, back into the cluster is to use the “add node” facility of HMC. I was assuming that you wanted to bring the node that you simulated failure on back into the cluster without treating it as a new node.
Ted.
...

Once the node has been properly, or accidentally removed from the cluster, the way to get it, or any new node, back into the cluster is to use the “add node” facility of HMC. I was assuming that you wanted to bring the node that you simulated failure on back into the cluster without treating it as a new node.

Dear Ted :
I still failed to recover my DataNode host005,
Actually and originally, I have 5 nodes(1 NameNode host001+4 DataNode host002,host003,host004,host005)in hadoop cluster. Later, I tried to simulate a failure condition of hadoop DataNode by powering off host005 node,then after powering off host005 5 minutes, then I
(1)power up host005 .
(2)Service hmc-agent start===>start for puppet on host005
(3)$hadoop datanode ====>start datanode on host005
(4)$hadoop tasktracke r===>start tasktracker on host005
but from then on, no matter what I have done ,the host005 DataNode never back t...

After all,
My question are
1.How can I remove a malfunctioned DataNode successfully, then add a DataNode for substitution the bad nodes by HMC or other service?
2.If currently HMC doesn’t support decommission, Is there any other method to meet my requirement ?( when a data node fails , we need to remove it in proper procedure, then add a new node for substitution)

Jeff,
The process should be:
* power on the box
* manually restart the datanode and tasktracker services (these will not start automatically when you start the computer)
* once these services are started they should join the cluster automatically, provided that the namenode is already running.
This is assuming that you have fixed whatever it was that caused the datanode to die in the first place. Also you should note that the proper procedure for shutting down a node, if you cannot stop the whole cluster, is to stop the hadoop services on the box first then after they have fully stopped shutd...

The process should be:
* power on the box
* manually restart the datanode and tasktracker services (these will not start automatically when you start the computer)
* once these services are started they should join the cluster automatically, provided that the namenode is already running.

This is assuming that you have fixed whatever it was that caused the datanode to die in the first place. Also you should note that the proper procedure for shutting down a node, if you cannot stop the whole cluster, is to stop the hadoop services on the box first then after they have fully stopped shutdown the computer.