Details

Description

The current hadoop network topology (described in some previous issues like: Hadoop-692) works well in classic three-tiers network when it comes out. However, it does not take into account other failure models or changes in the infrastructure that can affect network bandwidth efficiency like: virtualization.
Virtualized platform has following genes that shouldn't been ignored by hadoop topology in scheduling tasks, placing replica, do balancing or fetching block for reading:
1. VMs on the same physical host are affected by the same hardware failure. In order to match the reliability of a physical deployment, replication of data across two virtual machines on the same host should be avoided.
2. The network between VMs on the same physical host has higher throughput and lower latency and does not consume any physical switch bandwidth.
Thus, we propose to make hadoop network topology extend-able and introduce a new level in the hierarchical topology, a node group level, which maps well onto an infrastructure that is based on a virtualized environment.

Sub-Tasks

Activity

Hi Allen Wittenauer, most of work get committed (1.2.0 for branch-2, 2.1.0-beta for branch-2). However, it is still open because it hasn't completed in YARN side (YARN-18 and YARN-19). Will mark this umbrella as resolved when that two JIRAs get completed.

Junping Du
added a comment - 20/Mar/15 19:45 Hi Allen Wittenauer , most of work get committed (1.2.0 for branch-2, 2.1.0-beta for branch-2). However, it is still open because it hasn't completed in YARN side ( YARN-18 and YARN-19 ). Will mark this umbrella as resolved when that two JIRAs get completed.

Junping, thank you your explanations. Beyond separating compute and storage of a virtual cluster, can you comment on isolation? It sounds like you would use a multitude of virtual HDFS's in order to fence off virtual clusters among each other.

Jan Kunigk
added a comment - 21/Mar/13 11:51 Junping, thank you your explanations. Beyond separating compute and storage of a virtual cluster, can you comment on isolation? It sounds like you would use a multitude of virtual HDFS's in order to fence off virtual clusters among each other.

Jan, Thanks for the questions. It doesn't have to be multiple clusters and each with dedicated HDFS. It also make sense you setup some purely compute-only clusters that based on the same HDFS cluster by separating TaskTracker(or NodeManager) and DataNode into different VMs. The NodeGroup-awareness here will help to guarantee nodeGroup-level (physical host) locality. So you can power off/suspend your compute-cluster without any affection on other clusters. Given this, you don't have to suspend your HDFS cluster for saving resources for other applications.
In other case, if you want to suspend a virtual cluster (with HDFS also), I would recommend you to stop HDFS service before you suspend your cluster and start again after you resume the cluster. It helps to get rid of data re-replication caused by DNs' heartbeat outage, and there is no need for extra storage tier for persistence.

Junping Du
added a comment - 19/Mar/13 08:26 Jan, Thanks for the questions. It doesn't have to be multiple clusters and each with dedicated HDFS. It also make sense you setup some purely compute-only clusters that based on the same HDFS cluster by separating TaskTracker(or NodeManager) and DataNode into different VMs. The NodeGroup-awareness here will help to guarantee nodeGroup-level (physical host) locality. So you can power off/suspend your compute-cluster without any affection on other clusters. Given this, you don't have to suspend your HDFS cluster for saving resources for other applications.
In other case, if you want to suspend a virtual cluster (with HDFS also), I would recommend you to stop HDFS service before you suspend your cluster and start again after you resume the cluster. It helps to get rid of data re-replication caused by DNs' heartbeat outage, and there is no need for extra storage tier for persistence.

Referring to one of your earlier comments on 08/Jun/12:
> For 2. It's right that VMs on the same host will not share storage directly
> but could do so (with getting virtual disks) through Hypervisor FS (Like VMFS in VMware vSphere) layer.
> Another way (should recommend for hadoop case) is to go through RDM (Raw Disk Mapping) configuration
> in hypervisor that each VM can get some dedicated physical disks.

Are you envisioning a usage model where each virtual cluster has its own distributed filesystem ?
When I use virtualization I would most likely suspend my virtual clusters from time to time...
Can you comment on what would happen to the HDFS data in this case, would one have to persist it in a different storage tier?

Jan Kunigk
added a comment - 18/Mar/13 16:09 Junping,
Referring to one of your earlier comments on 08/Jun/12:
> For 2. It's right that VMs on the same host will not share storage directly
> but could do so (with getting virtual disks) through Hypervisor FS (Like VMFS in VMware vSphere) layer.
> Another way (should recommend for hadoop case) is to go through RDM (Raw Disk Mapping) configuration
> in hypervisor that each VM can get some dedicated physical disks.
Are you envisioning a usage model where each virtual cluster has its own distributed filesystem ?
When I use virtualization I would most likely suspend my virtual clusters from time to time...
Can you comment on what would happen to the HDFS data in this case, would one have to persist it in a different storage tier?

Andy, you can just simply start TaskTracker daemon on some nodes while start DataNode daemon on other nodes. If you need to follow up with detail on how to do it, please send mail to me or user@hadoop.apache.org as that seems not related to this jira.

Junping Du
added a comment - 27/Jan/13 14:09 Andy, you can just simply start TaskTracker daemon on some nodes while start DataNode daemon on other nodes. If you need to follow up with detail on how to do it, please send mail to me or user@hadoop.apache.org as that seems not related to this jira.

Hi Andy, HADOOP-8468-total.patch is a patch which is out of date. We already divide it into several patches and most of them are checked in trunk now (except YARN-18 and YARN-19). If you are interested in backport it to other branches, I would suggest to go sub-jira patches.

Junping Du
added a comment - 25/Jan/13 15:34 Hi Andy, HADOOP-8468 -total.patch is a patch which is out of date. We already divide it into several patches and most of them are checked in trunk now (except YARN-18 and YARN-19 ). If you are interested in backport it to other branches, I would suggest to go sub-jira patches.

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
--------------------------
File to patch: hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java: No such file or directory
Skip this patch? [y] n
File to patch:
Skip this patch? [y] y
Skipping patch.
1 out of 1 hunk ignored
can't find file to patch at input line 4972
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
1 out of 1 hunk ignored
patching file hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestTopologyResolver.java
can't find file to patch at input line 5059
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
3 out of 3 hunks ignored
can't find file to patch at input line 5105
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
1 out of 1 hunk ignored
patching file hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImplWithNodeGroup.java
can't find file to patch at input line 5168
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
4 out of 4 hunks ignored
can't find file to patch at input line 5257
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/NodeType.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
1 out of 1 hunk ignored
can't find file to patch at input line 5269
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
2 out of 2 hunks ignored
patching file hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNodeWithNodeGroup.java
can't find file to patch at input line 5357
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
3 out of 3 hunks ignored
can't find file to patch at input line 5413
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
10 out of 10 hunks ignored
patching file hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueueWithNodeGroup.java
can't find file to patch at input line 5635
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

+++ hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
--------------------------
File to patch:
Skip this patch? [y]
Skipping patch.
12 out of 12 hunks ignored
patching file hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoSchedulerWithNodeGroup.java
patching file hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueueWithNodeGroup.java
can't find file to patch at input line 6174
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------

Hi Konstantin, Thanks for your question and carefully reading on the result. Yes. TestDFSIO has no locality awareness in task scheduling as you said. However, after tasks are scheduled, the work in this umbrella (let's call it HVE for short) can enhance the possibility for client to choose local physical host's data block for two reasons:
1. HVE make sure all replica cross 3 physical hosts (total 6 hosts), so for any HDFS read, there is 50% chance to have a replica living on the same physical host (previously, it is between 1/3 - 1/2)
2. With HVE, HDFS client can correctly sort the replicas to have nodegroup-local replica have priority to be chosen rather than rack-local replica.
The first reason is just special for this case, but second reason affects more general.
Is that make sense?

Junping Du
added a comment - 23/Nov/12 07:59 Hi Konstantin, Thanks for your question and carefully reading on the result. Yes. TestDFSIO has no locality awareness in task scheduling as you said. However, after tasks are scheduled, the work in this umbrella (let's call it HVE for short) can enhance the possibility for client to choose local physical host's data block for two reasons:
1. HVE make sure all replica cross 3 physical hosts (total 6 hosts), so for any HDFS read, there is 50% chance to have a replica living on the same physical host (previously, it is between 1/3 - 1/2)
2. With HVE, HDFS client can correctly sort the replicas to have nodegroup-local replica have priority to be chosen rather than rack-local replica.
The first reason is just special for this case, but second reason affects more general.
Is that make sense?

Junping,
Checked your article with performance results. Got a question about it.
How do you explain the performance gain with DFSIO?
MapReduce-wise DFSIO is completely unaware of the locality of the data it reads, because input data is just the file with the file name that the mapper should read. So the input file with name of the file to read is local to task, but not the file that it then reads.
Not saying there is anything wrong with your results, I just think it needs more explanation.

Konstantin Shvachko
added a comment - 23/Nov/12 07:31 Junping,
Checked your article with performance results. Got a question about it.
How do you explain the performance gain with DFSIO?
MapReduce-wise DFSIO is completely unaware of the locality of the data it reads, because input data is just the file with the file name that the mapper should read. So the input file with name of the file to read is local to task, but not the file that it then reads.
Not saying there is anything wrong with your results, I just think it needs more explanation.

Junping Du
added a comment - 30/Oct/12 17:45 As some followup with meetup, I quickly summarize how to configure and use HVE as a draft user guide and put it attached. Please help to review and comments.

Junping Du
added a comment - 17/Sep/12 08:52 Thanks for great comments. Konstantin. the doc revised-1.0 already address full policy definition.
Hi guys, I am back porting patches to branch-1. Hope I can get your support and help on reviewing.

It's good that you formulated the policies. Now I can see the differences. In way-2 you actually don't need to say "virtual node". It is the implementation detail. You only care that the first replica is on the local physical node. So way-2 is the same as the original.
In way-1 I agree only one change is needed. Rather surprising.

I briefly checked the patch, and see now that your abstractions are driven by the implementation. Whether you define it way-1 or way-2 implementation-wise you still introduce a new inner level in the topology.
I do not think you need the new class InnerNodeWithNodeGroup. It doesn't have new members or constructors. It overrides isRack(), but only because the old implementation assumed racks are on the second level. I'd rather add nodeType member than checking children of children.

So, I think I understand your motivation with the design. Thanks for clarifying your thoughts to me. I still think that the terminology is better when talking about extending the topology with new leaves, but your way is also valid and does not change the policy much. You choose. Either way, please add the full policy definition in the document.

Konstantin Shvachko
added a comment - 22/Jun/12 19:31 It's good that you formulated the policies. Now I can see the differences. In way-2 you actually don't need to say "virtual node". It is the implementation detail. You only care that the first replica is on the local physical node. So way-2 is the same as the original.
In way-1 I agree only one change is needed. Rather surprising.
I briefly checked the patch, and see now that your abstractions are driven by the implementation. Whether you define it way-1 or way-2 implementation-wise you still introduce a new inner level in the topology.
I do not think you need the new class InnerNodeWithNodeGroup. It doesn't have new members or constructors. It overrides isRack(), but only because the old implementation assumed racks are on the second level. I'd rather add nodeType member than checking children of children.
So, I think I understand your motivation with the design. Thanks for clarifying your thoughts to me. I still think that the terminology is better when talking about extending the topology with new leaves, but your way is also valid and does not change the policy much. You choose. Either way, please add the full policy definition in the document.

> What changed your mind? Sounds like the right direction to me.
From above comments, you can see way-1 inherit original policy almost as much as way-2. But way-1 will take more simplicity in implementation for some reasons like: DatanodeDescriptor don't have to remap to additional virtual node layer, NetworkTopology structure is easier to extend in InnerNode rather than leaf node, etc. Thoughts?

Junping Du
added a comment - 21/Jun/12 09:40 > What changed your mind? Sounds like the right direction to me.
From above comments, you can see way-1 inherit original policy almost as much as way-2. But way-1 will take more simplicity in implementation for some reasons like: DatanodeDescriptor don't have to remap to additional virtual node layer, NetworkTopology structure is easier to extend in InnerNode rather than leaf node, etc. Thoughts?

> If you put it in terms when virtual nodes are added as the fourth level, then you don't need to change a word in the old policy.
Still need some slightly change as first replica should be placed on local virtual node but not node local. Let me show a two different way of translation the original rules you list above (in rule 2, I omit "on two different nodes" there as it is duplicated with rule 0).
Original:
0. No more than one replica is placed at any one node
1. First replica on the local node
2. Second and third replicas are in the same rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks.

two ways: 1) node, rack -> node, nodegroup; 2) node, rack -> virtual node, node, rack. The black word represent additional layer.
way 1:
0. No more than one replica is placed at any one nodegroup
1. First replica on the local node
2. Second and third replicas are in the same rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks
way 2:
0. No more than one replica is placed at any one node
1. First replica on the local virtual node
2. Second and third replicas are in the same rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks

Junping Du
added a comment - 21/Jun/12 08:53 Hi Konstantin,
Thanks for your comments. Please see my reply:
> If you put it in terms when virtual nodes are added as the fourth level, then you don't need to change a word in the old policy.
Still need some slightly change as first replica should be placed on local virtual node but not node local. Let me show a two different way of translation the original rules you list above (in rule 2, I omit "on two different nodes" there as it is duplicated with rule 0).
Original:
0. No more than one replica is placed at any one node
1. First replica on the local node
2. Second and third replicas are in the same rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks.
two ways: 1) node, rack -> node, nodegroup ; 2) node, rack -> virtual node , node, rack. The black word represent additional layer.
way 1:
0. No more than one replica is placed at any one nodegroup
1. First replica on the local node
2. Second and third replicas are in the same rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks
way 2:
0. No more than one replica is placed at any one node
1. First replica on the local virtual node
2. Second and third replicas are in the same rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks
So you can see it is equivalent in words.

Junping try to rewrite the policy I stated earlier using your terms for 4-level topology with node-groups as the third level, and you will see many words change. If you put it in terms when virtual nodes are added as the fourth level, then you don't need to change a word in the old policy. I thought it's a good thing to keep old policies consistent with new use cases. Confirms (1) that it's a good policy, and (2) that it's a good design.

Konstantin Shvachko
added a comment - 21/Jun/12 08:00 > 3rd on local node of 2nd
How so?
Junping try to rewrite the policy I stated earlier using your terms for 4-level topology with node-groups as the third level, and you will see many words change. If you put it in terms when virtual nodes are added as the fourth level, then you don't need to change a word in the old policy. I thought it's a good thing to keep old policies consistent with new use cases. Confirms (1) that it's a good policy, and (2) that it's a good design.
> Agree. That's what I try to do previously also.
What changed your mind? Sounds like the right direction to me.

Hi Konstantin,
That's good suggestions. Updated proposal should address most of them. A few comments below:
> So my motivation with virtual node extension is that it formally inherits the existing policy, but semantically adds a new level of topology.
Agree. That's what I try to do previously also. The current way is mapping node -> (virtual) node and add "nodegroup" level, so that policy is almost exactly the same: 1st on local (virtual) node, 2nd on off-rack, 3rd on local node of 2nd. The only difference is to make sure 2nd and 3rd are off-nodegroup (and if 1st cannot be local(virtual) node, then can be nodegroup-local node).
> But from the failure scenarios viewpoint they are bound to the same node, meaning that node failure takes all of them down
Yes. So adding a node-group level should address the failure relationship between (virtual) nodes perfectly. I think the key points for map current node to vm level include:
Virtual node (VM) plays as leaf node. There are still failure only happens within VM like daemon failure, os failure, and some physical failure (like: disk failure, as in most cases for running hadoop, VM should mount separated physical disks rather than sharing disk with other VM). So, VM still show some independency even in failure group semantics.
Virtual node is where JVM is running and java network call happens. In current code base, ip(hostname) of a node (reader, datanode) is used to keep data locality. Only VM-level ip is easy to get by JVM and RPC call so that make sense to represent node ip.
Thoughts?

Junping Du
added a comment - 16/Jun/12 11:58 Hi Konstantin,
That's good suggestions. Updated proposal should address most of them. A few comments below:
> So my motivation with virtual node extension is that it formally inherits the existing policy, but semantically adds a new level of topology.
Agree. That's what I try to do previously also. The current way is mapping node -> (virtual) node and add "nodegroup" level, so that policy is almost exactly the same: 1st on local (virtual) node, 2nd on off-rack, 3rd on local node of 2nd. The only difference is to make sure 2nd and 3rd are off-nodegroup (and if 1st cannot be local(virtual) node, then can be nodegroup-local node).
> But from the failure scenarios viewpoint they are bound to the same node, meaning that node failure takes all of them down
Yes. So adding a node-group level should address the failure relationship between (virtual) nodes perfectly. I think the key points for map current node to vm level include:
Virtual node (VM) plays as leaf node. There are still failure only happens within VM like daemon failure, os failure, and some physical failure (like: disk failure, as in most cases for running hadoop, VM should mount separated physical disks rather than sharing disk with other VM). So, VM still show some independency even in failure group semantics.
Virtual node is where JVM is running and java network call happens. In current code base, ip(hostname) of a node (reader, datanode) is used to keep data locality. Only VM-level ip is easy to get by JVM and RPC call so that make sense to represent node ip.
Thoughts?

Here is current replication policy.
0. No more than one replica is placed at any one node
1. First replica on the local node
2. Second and third replicas on two different nodes in a different rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks.

With my thinking that the virtual node level is added, the policy remains unchanged. With a single optional clarification:
(1) First replica on the virtual node then on the local node

With your approach of adding the hypervisor layer the policy need to be revised, by replacing "node" with "node group".

So my motivation with virtual node extension is that it formally inherits the existing policy, but semantically adds a new level of topology.

> Each VM on the same physical machine plays independently

As you correctly mention in the design doc, topology is about failure scenarios rather than independence of VMs. VM-s are independent as the entities reporting to the NameNode. But from the failure scenarios viewpoint they are bound to the same node, meaning that node failure takes all of them down.
So the policy should not change, only the implementation of it should.

> VMs lives on the same physical machine can belong to different logical Hadoop clusters

Well you can run two DNs or TTs on the same node belonging to different clusters even now, but nobody does that, because operationally it's just too much hassle. Not sure if virtualization will make it different.
I heard of attempts to run multiple clusters on the same physical nodes for isolation purposes, but didn't hear it was successful.

Konstantin Shvachko
added a comment - 15/Jun/12 23:03 Sorry, got distracted with the Hadoop event of the week.
Here is current replication policy.
0. No more than one replica is placed at any one node
1. First replica on the local node
2. Second and third replicas on two different nodes in a different rack
3. Other replicas on random nodes with restriction that no more than two replicas are placed in the same rack, if there is enough racks.
With my thinking that the virtual node level is added, the policy remains unchanged. With a single optional clarification:
(1) First replica on the virtual node then on the local node
With your approach of adding the hypervisor layer the policy need to be revised, by replacing "node" with "node group".
So my motivation with virtual node extension is that it formally inherits the existing policy, but semantically adds a new level of topology .
> Each VM on the same physical machine plays independently
As you correctly mention in the design doc, topology is about failure scenarios rather than independence of VMs. VM-s are independent as the entities reporting to the NameNode. But from the failure scenarios viewpoint they are bound to the same node, meaning that node failure takes all of them down.
So the policy should not change, only the implementation of it should.
> VMs lives on the same physical machine can belong to different logical Hadoop clusters
Well you can run two DNs or TTs on the same node belonging to different clusters even now, but nobody does that, because operationally it's just too much hassle. Not sure if virtualization will make it different.
I heard of attempts to run multiple clusters on the same physical nodes for isolation purposes, but didn't hear it was successful.

Hey Konstantin, Thanks for a lot of good suggestions here.
For 1. In concept, there are two ways to look at the change we proposed. One way is like you said, we add vm level extension to physical host (but make physical host to be innernode, but not leaf any more.). The other way is: we look at VM (virtual node) as previous physical node as container of processes but add an innernode layer between node and rack. We are preferring the second way as following reasons:
1) Each VM on the same physical machine plays independently in general but have some relations on reliability and lower communication overhead. Each VM has independent hostname, ip and it is the place where hadoop daemons running.
2) VMs lives on the same physical machine can belong to different logical hadoop clusters, physical host is not like before that can only be dedicated to one logical hadoop cluster but could be shared. Also, physical host's ip and host info (hypervisor's ip and info) should not be aware by hadoop.
3) In some data locality related policies, VM map to previous physical node well as the first choice to place 1st replica, scheduling task, etc.
For 2. It's right that VMs on the same host will not share storage directly but could do so (with getting virtual disks) through Hypervisor FS (Like VMFS in VMware vSphere) layer. Another way (should recommend for hadoop case) is to go through RDM (Raw Disk Mapping) configuration in hypervisor that each VM can get some dedicated physical disks. In both cases, the virtual disk drive (and its capacity) for each VM are independent and can be reported by DN without any overlapping.
For 3. Yes. It looks we are missing replica removal policy in proposal. I will revise it as your suggestion. Thanks!
For 4. YARN is doing good job in resolving fixed task slot issue that exists in MRv1. Besides resolving this issue in MRv1, it still have some scenarios to run multiple VMs per physical node, like: tenant's task isolation in vm level, separation data node and compute node to support hadoop MapReduce(YARN) cluster auto scale in and out, support standard-customised nodes (as a requirement of cloud) in a heterogeneous hardware environment, etc.
Thoughts?

Junping Du
added a comment - 08/Jun/12 09:01 Hey Konstantin, Thanks for a lot of good suggestions here.
For 1. In concept, there are two ways to look at the change we proposed. One way is like you said, we add vm level extension to physical host (but make physical host to be innernode, but not leaf any more.). The other way is: we look at VM (virtual node) as previous physical node as container of processes but add an innernode layer between node and rack. We are preferring the second way as following reasons:
1) Each VM on the same physical machine plays independently in general but have some relations on reliability and lower communication overhead. Each VM has independent hostname, ip and it is the place where hadoop daemons running.
2) VMs lives on the same physical machine can belong to different logical hadoop clusters, physical host is not like before that can only be dedicated to one logical hadoop cluster but could be shared. Also, physical host's ip and host info (hypervisor's ip and info) should not be aware by hadoop.
3) In some data locality related policies, VM map to previous physical node well as the first choice to place 1st replica, scheduling task, etc.
For 2. It's right that VMs on the same host will not share storage directly but could do so (with getting virtual disks) through Hypervisor FS (Like VMFS in VMware vSphere) layer. Another way (should recommend for hadoop case) is to go through RDM (Raw Disk Mapping) configuration in hypervisor that each VM can get some dedicated physical disks. In both cases, the virtual disk drive (and its capacity) for each VM are independent and can be reported by DN without any overlapping.
For 3. Yes. It looks we are missing replica removal policy in proposal. I will revise it as your suggestion. Thanks!
For 4. YARN is doing good job in resolving fixed task slot issue that exists in MRv1. Besides resolving this issue in MRv1, it still have some scenarios to run multiple VMs per physical node, like: tenant's task isolation in vm level, separation data node and compute node to support hadoop MapReduce(YARN) cluster auto scale in and out, support standard-customised nodes (as a requirement of cloud) in a heterogeneous hardware environment, etc.
Thoughts?

Junping, I went over the design document. It is pretty comprehensive. A few comments on the design.

Conceptually you are extending current Network Topology by introducing a new layer of leaf nodes. Current topology assumes that physical nodes are the leaves of the hierarchy and you add virtual nodes that can reside on physical nodes. I think this is a more logical way to look at the new topology, rather than saying that you introduce the second layer (node groups) over the nodes, as document does.

The document should clarify how local storage is used by VMs on a physical box. I think the assumption is that VMs never share storage resources. Otherwise there could be a reporting problem. That is, if two VMs share a drive and send two DF reports to the NameNode, then the drive will be counted twice, which can cause problems. I'd recommend to update the pictures and add a section talking about reporting of DNs' resources to NN to make this issue explicitly covered in the design.

For block replication there are 3 policies to consider:

block placement policy, when a new block is created

block replication policy, when under-replicated blocks are recovered

replica removal policy, when replicas are removed for over-replicated blocks
You covered the first two, and probably need to look into the third as well.
For the first two I'd be good to write down the entire modified policy rather than just listing the differences. And make sure they converge to existing policies if virtual node layer is not defined.

For YARN I am not convinced you will need to run multiple VMs per node, if not for the sake of generosity. It seems YARN should rely on NodeManager to report resources and manage Containers of a node as a whole. Not sure how multiple VMs on a node can help here.
For MRv1 on the contrary running multiple VMs per node can be useful for modeling variable slots. In this case again the VMs should not share memory otherwise repoting will go wrong.

Konstantin Shvachko
added a comment - 08/Jun/12 07:05 Junping, I went over the design document. It is pretty comprehensive. A few comments on the design.
Conceptually you are extending current Network Topology by introducing a new layer of leaf nodes. Current topology assumes that physical nodes are the leaves of the hierarchy and you add virtual nodes that can reside on physical nodes. I think this is a more logical way to look at the new topology, rather than saying that you introduce the second layer (node groups) over the nodes, as document does.
The document should clarify how local storage is used by VMs on a physical box. I think the assumption is that VMs never share storage resources. Otherwise there could be a reporting problem. That is, if two VMs share a drive and send two DF reports to the NameNode, then the drive will be counted twice, which can cause problems. I'd recommend to update the pictures and add a section talking about reporting of DNs' resources to NN to make this issue explicitly covered in the design.
For block replication there are 3 policies to consider:
block placement policy, when a new block is created
block replication policy, when under-replicated blocks are recovered
replica removal policy, when replicas are removed for over-replicated blocks
You covered the first two, and probably need to look into the third as well.
For the first two I'd be good to write down the entire modified policy rather than just listing the differences.
And make sure they converge to existing policies if virtual node layer is not defined.
For YARN I am not convinced you will need to run multiple VMs per node, if not for the sake of generosity. It seems YARN should rely on NodeManager to report resources and manage Containers of a node as a whole. Not sure how multiple VMs on a node can help here.
For MRv1 on the contrary running multiple VMs per node can be useful for modeling variable slots. In this case again the VMs should not share memory otherwise repoting will go wrong.

Hi Luke,
Yes. I agree with you that when number of nodes of logical cluster is much smaller than number of (available) physical hosts, it is good to do such placement for reliability if infrastructure allows (although may trade off a bit on more network traffic across rack/core switch. Isn't it?). Are noting this approach in proposal and describing its use scenario good enough to go for proposal?

Junping Du
added a comment - 06/Jun/12 02:21 Hi Luke,
Yes. I agree with you that when number of nodes of logical cluster is much smaller than number of (available) physical hosts, it is good to do such placement for reliability if infrastructure allows (although may trade off a bit on more network traffic across rack/core switch. Isn't it?). Are noting this approach in proposal and describing its use scenario good enough to go for proposal?
Thanks,
Junping

Actually, the two approaches are orthogonal. Avoiding placing more than one data node of the same logical cluster on the same physical host will increase reliability even if the new topology algorithm is in place.

VM placement is only NP hard if instance configuration is arbitrary and that you require absolute optimal placement. It's easier if the number of instance types is limited a la AWS. I suspect that greedy algorithms exist to approximate the optimal replacement. We don't need millisecond response time for such placement algorithm either, which is only done once at the logical cluster deploy time and when there are physical host failures.

It's definitely easier to do such placement when number of nodes of a logical cluster is much smaller than the number of physical hosts, which is the case for AWS and SmartCloud.

Luke Lu
added a comment - 05/Jun/12 21:34 Actually, the two approaches are orthogonal. Avoiding placing more than one data node of the same logical cluster on the same physical host will increase reliability even if the new topology algorithm is in place.
VM placement is only NP hard if instance configuration is arbitrary and that you require absolute optimal placement. It's easier if the number of instance types is limited a la AWS. I suspect that greedy algorithms exist to approximate the optimal replacement. We don't need millisecond response time for such placement algorithm either, which is only done once at the logical cluster deploy time and when there are physical host failures.
It's definitely easier to do such placement when number of nodes of a logical cluster is much smaller than the number of physical hosts, which is the case for AWS and SmartCloud.

I would update proposal a bit with listing the first approach. This is a workaround without hadoop code change. However, this "1-1 mapping" of data node to physical host will take following restrictions:
1. If nodes' number is larger than the number of physical host.
2. If the number of nodes is smaller than physical hosts, but some hosts are fully occupied by other logical hadoop clusters or other applications.
3. The clouds/datacenters are formed of heterogeneous hosts that some hosts are not suitable to deploy hadoop nodes. i.e. attached to shared storage only.
In general, VM placement in cloud is a complex BIN-packing problem which is NP-hard and should be optimised for a balance of resource utilization and reliability. Applying an absolute rule like the first approach is not the best way. In addition, the principle of hadoop network topology should reflect the physical(or virtual) topology in the bottom layer but should not take strict requirements/restriction to deploying topology.
Thoughts?

Junping Du
added a comment - 05/Jun/12 11:20 I would update proposal a bit with listing the first approach. This is a workaround without hadoop code change. However, this "1-1 mapping" of data node to physical host will take following restrictions:
1. If nodes' number is larger than the number of physical host.
2. If the number of nodes is smaller than physical hosts, but some hosts are fully occupied by other logical hadoop clusters or other applications.
3. The clouds/datacenters are formed of heterogeneous hosts that some hosts are not suitable to deploy hadoop nodes. i.e. attached to shared storage only.
In general, VM placement in cloud is a complex BIN-packing problem which is NP-hard and should be optimised for a balance of resource utilization and reliability. Applying an absolute rule like the first approach is not the best way. In addition, the principle of hadoop network topology should reflect the physical(or virtual) topology in the bottom layer but should not take strict requirements/restriction to deploying topology.
Thoughts?

This is a comment on the proposal, IMO, is missing a viable option. There are essentially two approaches to address the problem.

Enhance VM placement to ensure 1-1 mapping of data node to physical host within a logical hadoop cluster. This approach doesn't require any modification to Hadoop to achieve the same data reliability/redundancy. This can be a viable option for Hadoop clusters with number of nodes smaller than number of physical hosts, e.g, large public or company wide clouds.

For Hadoop clusters with more data nodes than the physical host. The analysis in the proposal is spot on and the extra layer is required to achieve optimum data reliability.

Luke Lu
added a comment - 04/Jun/12 21:10 This is a comment on the proposal, IMO, is missing a viable option. There are essentially two approaches to address the problem.
Enhance VM placement to ensure 1-1 mapping of data node to physical host within a logical hadoop cluster. This approach doesn't require any modification to Hadoop to achieve the same data reliability/redundancy. This can be a viable option for Hadoop clusters with number of nodes smaller than number of physical hosts, e.g, large public or company wide clouds.
For Hadoop clusters with more data nodes than the physical host. The analysis in the proposal is spot on and the extra layer is required to achieve optimum data reliability.

It looks like this move action will actually move sub jira out of parent jira (umbrella). Do we need three parent JIRAs in Common/HDFS/MapReduce?
To your questions on running hadoop inside VMs, I don't have a concrete number for now. But we know some enterprise customer would like to run hadoop cluster in their virtualized datacenter/private cloud.

Junping Du
added a comment - 04/Jun/12 17:07 It looks like this move action will actually move sub jira out of parent jira (umbrella). Do we need three parent JIRAs in Common/HDFS/MapReduce?
To your questions on running hadoop inside VMs, I don't have a concrete number for now. But we know some enterprise customer would like to run hadoop cluster in their virtualized datacenter/private cloud.

You can move the JIRAs. More Actions -> Move. If it is possible to split them up, it is nice to keep them separate, but it is not totally necessary. If they do span multiple projects and are hard to split up you can leave them under HADOOP. The main reason for this is that some people only watch the HDFS lists, while others only look at the MAPREDUCE lists, and may miss changes that are not filed under the appropriate group.

I am interested to see where this goes, and it seems very logical to me to be able to express to Hadoop what your topology really does look like. I am not sure how many groups are running Hadoop inside VMs except perhaps on EC2, but I have a very limited view into that right now.

Robert Joseph Evans
added a comment - 04/Jun/12 16:44 You can move the JIRAs. More Actions -> Move. If it is possible to split them up, it is nice to keep them separate, but it is not totally necessary. If they do span multiple projects and are hard to split up you can leave them under HADOOP. The main reason for this is that some people only watch the HDFS lists, while others only look at the MAPREDUCE lists, and may miss changes that are not filed under the appropriate group.
I am interested to see where this goes, and it seems very logical to me to be able to express to Hadoop what your topology really does look like. I am not sure how many groups are running Hadoop inside VMs except perhaps on EC2, but I have a very limited view into that right now.

Hi Robert,
Thanks for your reply. So you are suggesting re-create sub tasks in proper project(Common, HDFS, MAPREDUCE). Isn't it?
For patch mixed with cross projects (like 1st sub jira, mix COMMON and HDFS), we should create both a common and hdfs project for it?

I have been looking at some of your patches, but there is a lot here to go through and it is likely to take some time.

Could you please move your JIRAs to the appropriate project. HDFS JIRAs should be moved out of HADOOP and into HDFS, Mapreduce should go to MAPREDUCE, and only the ones that stay in HADOOP should be for code that goes under the hadoop-common-project directory.

Robert Joseph Evans
added a comment - 04/Jun/12 15:56 Junping Du,
I have been looking at some of your patches, but there is a lot here to go through and it is likely to take some time.
Could you please move your JIRAs to the appropriate project. HDFS JIRAs should be moved out of HADOOP and into HDFS, Mapreduce should go to MAPREDUCE, and only the ones that stay in HADOOP should be for code that goes under the hadoop-common-project directory.
Thanks

Junping Du
added a comment - 04/Jun/12 14:41 Patch is divided into 7 patches and attached to each sub tasks. There are some dependencies between patches and only three patches are independent patches: P1, P3 and P6.