Removing a datanode (failed or decommissioned) should not require a namenode restart

Details

Description

I've heard of several Hadoop users using dfsadmin -report to monitor the number of dead nodes, and alert if that number is not 0. This mechanism tends to work pretty well, except when a node is decommissioned or fails, because then the namenode requires a restart for said node to be entirely removed from HDFS. More details here:

Allen Wittenauer
added a comment - 03/May/10 20:52 I've seen this as well.
The basic premise is that you are removing a node from the grid permanently. So you:
a) add node to dfs.hosts.exclude
b) dfsadmin -refreshNodes
c) wait for decom to finish
d) remove node from both dfs.hosts and dfs.hosts.exclude
If you check the web UI and dfsadmin -report, it is still listed as valid.

Nigel Daley
added a comment - 10/Jan/11 20:43 At this point I don't see how this 6 month old unassigned issue is a blocker for 0.22. I also think this is an improvement, not a bug. Removing from 0.22 blocker list.

We also got complaints from our admins about this because it makes it really hard to set up professional monitoring. My company operates close to a 100,000 machines (only a handful Hadoop nodes though), so it's a big concern that our infrastructure behaves well.

Also, node decommissioning is one of the things QA departments typically test during product
evaluation, so this could hamper Hadoop adoption in some organizations.

Matthias Friedrich
added a comment - 07/Feb/11 07:07 We also got complaints from our admins about this because it makes it really hard to set up professional monitoring. My company operates close to a 100,000 machines (only a handful Hadoop nodes though), so it's a big concern that our infrastructure behaves well.
Also, node decommissioning is one of the things QA departments typically test during product
evaluation, so this could hamper Hadoop adoption in some organizations.

Matt Foley
added a comment - 05/May/11 23:54 HDFS-1773 seems to be a duplicate of this, and it is resolved/fixed in trunk (v23) and 0.20.204.0. (Thanks, Koji.) The only requirement seems to be that dfs.hosts is used.