Fault tolerant Hadoop Job Tracker

Details

Type: New Feature

Status:Resolved

Priority: Major

Resolution:
Incomplete

Affects Version/s:
None

Fix Version/s:
None

Component/s:
None

Labels:

None

Environment:

High availability enterprise system

Description

The Hadoop framework has been designed, in an eort to enhance perfor-
mances, with a single JobTracker (master node). It's responsibilities varies
from managing job submission process, compute the input splits, schedule
the tasks to the slave nodes (TaskTrackers) and monitor their health.
In some environments, like the IBM and Google's Internet-scale com-
puting initiative, there is the need for high-availability, and performances
becomes a secondary issue. In this environments, having a system with
a Single Point of Failure (such as Hadoop's single JobTracker) is a major
concern.
My proposal is to provide a redundant version of Hadoop by adding
support for multiple replicated JobTrackers. This design can be approached
in many dierent ways.

Francesco, HADOOP-3245 allows jobtracker to (re)start. If the history files are maintained on dfs then the jobtracker can start on any other machine. The only issue that prevents us from achieving complete fault tolerance is HADOOP-4016. Once that gets fixed, we can port the jobtracker to any machine and continue from there. Hence the jobtracker is fault tolerant but the switch is not seamless and porting requires manual intervention.

Amar Kamat
added a comment - 04/Nov/08 11:46 Francesco, HADOOP-3245 allows jobtracker to (re)start. If the history files are maintained on dfs then the jobtracker can start on any other machine. The only issue that prevents us from achieving complete fault tolerance is HADOOP-4016 . Once that gets fixed, we can port the jobtracker to any machine and continue from there. Hence the jobtracker is fault tolerant but the switch is not seamless and porting requires manual intervention.

Thanks for your quick response Amar.Just a quick question, is it possible now to have multiple instances of JobTracker running simultaneously on different machines with transparent failover to a redundant copy?

It is a problem reported by the IBM HiPODS team working on the academic initiative

Francesco Salbaroli
added a comment - 04/Nov/08 11:58 Thanks for your quick response Amar.Just a quick question, is it possible now to have multiple instances of JobTracker running simultaneously on different machines with transparent failover to a redundant copy?
It is a problem reported by the IBM HiPODS team working on the academic initiative

This is an interesting project which will provide much thesis work, especially from the testing and proof of correctness perspectives.

-There are some implicit assumptions about the ability of the infrastructure to provision hardware, namely that Cold Standby is inappropriate. If a virtual machine can be provisioned and brought up live within a minute, Cold Standby is surprisingly viable, and, on pay-as-you-go infrastructure, cost-effective.

The statement that forwarding all state changes to all slaves -Hot Standby is best needs to be qualified with estimated load values and the impact of the events on the network. Is there a cluster size or map/reduce job lifetime in which the state traffic will become an issue, or is it just load on the nodes.

-How do you intend to implement failover without notifying the task trackers? DNS update?

-I would like to see some coverage of the election protocol, in particular, how to coordinate such an election over an infrastructure which denies multicast IP (e.g. Amazon EC2).

-Determining the liveness of the JobTracker is going to be hard. Using Lamport's definitions, it is only live if it is capable of performing work within bounded time, so the true way to determine health is to submit work to the system. Early failures: IPC deadlock, host outage etc, may be detectable early, but some failure modes may be hard to detect. Some of the ongoing work in HADOOP-3628 can act as a starting point, but it is inadequate if you really want "HA".

-Ignoring HDFS availability, What is going to happen when the farm partitions and both partitions have slaves and a set of task trackers? Who will be in charge?

-I would have expected some citations for an MSc project; presumably this is an early draft.

-Take a look at Anubis; this is how we implement partition awareness/HA, though it currently uses Multicast to bootstrap, so will not work on EC2 without adding a new discovery mechanism (simpleDB?)http://wiki.smartfrog.org/wiki/display/sf/Anubis

Steve Loughran
added a comment - 05/Nov/08 14:55 This is an interesting project which will provide much thesis work, especially from the testing and proof of correctness perspectives.
-There are some implicit assumptions about the ability of the infrastructure to provision hardware, namely that Cold Standby is inappropriate. If a virtual machine can be provisioned and brought up live within a minute, Cold Standby is surprisingly viable, and, on pay-as-you-go infrastructure, cost-effective.
The statement that forwarding all state changes to all slaves -Hot Standby is best needs to be qualified with estimated load values and the impact of the events on the network. Is there a cluster size or map/reduce job lifetime in which the state traffic will become an issue, or is it just load on the nodes.
-How do you intend to implement failover without notifying the task trackers? DNS update?
-I would like to see some coverage of the election protocol, in particular, how to coordinate such an election over an infrastructure which denies multicast IP (e.g. Amazon EC2).
-Determining the liveness of the JobTracker is going to be hard. Using Lamport's definitions, it is only live if it is capable of performing work within bounded time, so the true way to determine health is to submit work to the system. Early failures: IPC deadlock, host outage etc, may be detectable early, but some failure modes may be hard to detect. Some of the ongoing work in HADOOP-3628 can act as a starting point, but it is inadequate if you really want "HA".
-Ignoring HDFS availability, What is going to happen when the farm partitions and both partitions have slaves and a set of task trackers? Who will be in charge?
-I would have expected some citations for an MSc project; presumably this is an early draft.
-Take a look at Anubis; this is how we implement partition awareness/HA, though it currently uses Multicast to bootstrap, so will not work on EC2 without adding a new discovery mechanism (simpleDB?)
http://wiki.smartfrog.org/wiki/display/sf/Anubis

1) You're right, but I don't have exact details about the infrastructure on which the Hadoop cluster will run (probably here at IBM will be possible to run Hadoop on top of IBM BlueCloud cloud computing infrastructure). So in an environment with fault-detection and fast, automatic VM provisioning, cold standby can be an option. This point need further investigation and I hope to receive feedback again on this.

2) In Hot-Standby I supposed a very small number of replicas of the JobTracker (between 2 and 4), so the high network traffic shouldn't be a major concern. But this can be verified only after extensive test sessions.

3) Yes, DNS update seems to be the best option.

4) I wasn't aware of the limitation on multicast of Amazon EC2. Two possible solutions: define statically the JobTracker nodes or maintain a list of nodes on DFS or shared cache

5) I thought about a heartbeat mechanism.

6) Network partitioning is an issue. Ignoring HDFS, using an election protocol will produce separate smaller fully functional cluster (that is an unwanted feature) but, it should be able to detect multiple running masters and re-run the election to reach another stable state.

7)This will be part of my M.Sc. but I produced this document only to propose my ideas to the community and gather feedback. And, yes, this is an early draft.

8) I will take a look at it

So, thank you again for the interest you show in my work and I hope to hear from you and other community members soon.

Francesco Salbaroli
added a comment - 06/Nov/08 11:32 Thank you for the comment Steve.
1) You're right, but I don't have exact details about the infrastructure on which the Hadoop cluster will run (probably here at IBM will be possible to run Hadoop on top of IBM BlueCloud cloud computing infrastructure). So in an environment with fault-detection and fast, automatic VM provisioning, cold standby can be an option. This point need further investigation and I hope to receive feedback again on this.
2) In Hot-Standby I supposed a very small number of replicas of the JobTracker (between 2 and 4), so the high network traffic shouldn't be a major concern. But this can be verified only after extensive test sessions.
3) Yes, DNS update seems to be the best option.
4) I wasn't aware of the limitation on multicast of Amazon EC2. Two possible solutions: define statically the JobTracker nodes or maintain a list of nodes on DFS or shared cache
5) I thought about a heartbeat mechanism.
6) Network partitioning is an issue. Ignoring HDFS, using an election protocol will produce separate smaller fully functional cluster (that is an unwanted feature) but, it should be able to detect multiple running masters and re-run the election to reach another stable state.
7)This will be part of my M.Sc. but I produced this document only to propose my ideas to the community and gather feedback. And, yes, this is an early draft.
8) I will take a look at it
So, thank you again for the interest you show in my work and I hope to hear from you and other community members soon.
Francesco

What the community think about using the JGroups reliable multicast system for communicating and status monitoring between master and slaves?

It has 2 major benefits:
1) Implements reliable multicast communications
2) Abstracts from the protocol used (can exploit benefits of Multicast UDP, where available, or using TCP where Multicast is forbidden i.e. Amazon EC2)

Francesco Salbaroli
added a comment - 19/Nov/08 15:03 What the community think about using the JGroups reliable multicast system for communicating and status monitoring between master and slaves?
It has 2 major benefits:
1) Implements reliable multicast communications
2) Abstracts from the protocol used (can exploit benefits of Multicast UDP, where available, or using TCP where Multicast is forbidden i.e. Amazon EC2)
Regards,
Francesco

I will release a preliminary test version of Fault tolerant Hadoop before 17th Dec.

Features will include:
-JGroups 2.6.7 toolkit for reliable multicast communication that is based on a highly configurable protocol stack to adapt to different environments (I will post documentation about it).
-Completely wraps around the Hadoop sourcecode to minimize modifications in the source tree.
-Dynamic JobTracker address resolution using HDFS as a support.

Francesco Salbaroli
added a comment - 11/Dec/08 11:30 I will release a preliminary test version of Fault tolerant Hadoop before 17th Dec.
Features will include:
-JGroups 2.6.7 toolkit for reliable multicast communication that is based on a highly configurable protocol stack to adapt to different environments (I will post documentation about it).
-Completely wraps around the Hadoop sourcecode to minimize modifications in the source tree.
-Dynamic JobTracker address resolution using HDFS as a support.
Enhancement in future versions:
-Higher level of abstraction
-Better exception handling
I'll post the sourcecode at the beginning of the next week (hopefully).
Can I be added to the developers?
Best regards,
Francesco

This is a very preliminary (and tested only locally) release of Fault tolerant Hadoop.

The hadoop source tree is only slightly modified in the org.apache.hadoop.mapred.TaskTracker class.

The package containing the fault tolerance wrapper is org.apache.hadoop.mapred.faulttolerant.

The JGroups library (jgroups-all.jar) must be copied in the lib/ folder.

To run the FT version of Hadoop:
1) Configure hadoop-site.xml to match the environment
2) Format the HDFS filesystem ($HADOOP_HOME/bin/namenode -format)
3) Run the HDFS daemons (to run locally $HADOOP_HOME/bin/start-dfs.sh)
4) Run one or more instances of FTJobTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTJobTracker)
5) Run one or more instance of FTTaskTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTTaskTracker)

Francesco Salbaroli
added a comment - 18/Dec/08 10:47 This is a very preliminary (and tested only locally) release of Fault tolerant Hadoop.
The hadoop source tree is only slightly modified in the org.apache.hadoop.mapred.TaskTracker class.
The package containing the fault tolerance wrapper is org.apache.hadoop.mapred.faulttolerant.
The JGroups library (jgroups-all.jar) must be copied in the lib/ folder.
To run the FT version of Hadoop:
1) Configure hadoop-site.xml to match the environment
2) Format the HDFS filesystem ($HADOOP_HOME/bin/namenode -format)
3) Run the HDFS daemons (to run locally $HADOOP_HOME/bin/start-dfs.sh)
4) Run one or more instances of FTJobTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTJobTracker)
5) Run one or more instance of FTTaskTracker ($HADOOP_HOME/bin/hadoop org.apache.hadoop.mapred.faulttolerant.FTTaskTracker)
Regards,
Francesco Salbaroli

Nigel Daley
added a comment - 05/Jan/09 22:35 Francesco, given this is a new feature, it can be integrated into the next major release (which is 0.21). It can't be integrated into a sustaining release.

> In my opinion, it is better to avoid using active copies due to the high
> complexity of the coordination protocol and, instead, using a master-slave
> model with soft-state shared between copies through a distributed cache
> mechanism or saved on HDFS.

Please forgive me if I'm being naive here (I see that I'm a bit late to the show), but wouldn't using Zookeeper to persist jobtracker state effectively mask this complexity?

Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?

Bo Shi
added a comment - 08/Jan/09 17:31 > In my opinion, it is better to avoid using active copies due to the high
> complexity of the coordination protocol and, instead, using a master-slave
> model with soft-state shared between copies through a distributed cache
> mechanism or saved on HDFS.
Please forgive me if I'm being naive here (I see that I'm a bit late to the show), but wouldn't using Zookeeper to persist jobtracker state effectively mask this complexity?
Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?

Sorry for the late comments:
For a master/slave HA solution, two main problems are:
1. Mechanism that determines a master in a cluster during startup and failover. Handling loss of quorum, split-brain and fencing in case of split-brain. It also requires comprehensive management tools for configuring, managing and monitoring cluster.
2. Sharing state information between master and slave, so that a slave node can take over as master.

Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

> Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?
Storing the jobtracker state on Zookeeper may not be a viable option, given that ZooKeeper is intended for storing small amount of data (KB) and JobTracker has lot more data than that to persist.

Suresh Srinivas
added a comment - 08/Jan/09 17:51 Sorry for the late comments:
For a master/slave HA solution, two main problems are:
1. Mechanism that determines a master in a cluster during startup and failover. Handling loss of quorum, split-brain and fencing in case of split-brain. It also requires comprehensive management tools for configuring, managing and monitoring cluster.
2. Sharing state information between master and slave, so that a slave node can take over as master.
Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.
Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.
> Has anyone explored refactoring the job tracker to use Zookeeper instead of engineering a new master/slave replication system?
Storing the jobtracker state on Zookeeper may not be a viable option, given that ZooKeeper is intended for storing small amount of data (KB) and JobTracker has lot more data than that to persist.

dhruba borthakur
added a comment - 08/Jan/09 18:25 Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store.

>Sorry for the late comments:
>For a master/slave HA solution, two main problems are:
>1. Mechanism that determines a master in a cluster during startup and failover.

The JGroups library (whose manual can be found here: http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected.

>Handling loss of quorum,

The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update.

>split-brain and fencing in case of split-brain.

The JGroups library tries to automatically handle network partitions and merging, but given that:

There is no shared soft-state

There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode)

the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added.

> It also >requires comprehensive management tools for configuring, managing and monitoring cluster.

I am now adding JMX support. After the initial testing phase I will post result and an updated version.

>2. Sharing state information between master and slave, so that a slave node can take over as master.
>Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.

I hope I have given an answer to your question. If you need more, feel free to contact me.

>Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.

Francesco Salbaroli
added a comment - 09/Jan/09 10:30 - edited >Sorry for the late comments:
>For a master/slave HA solution, two main problems are:
>1. Mechanism that determines a master in a cluster during startup and failover.
The JGroups library (whose manual can be found here: http://www.jgroups.org/javagroupsnew/docs/manual/pdf/manual.pdf ) handles automatically the election of a group coordinator. The node elected group coordinator is also the master of the cluster. In case of a failure a new group coordinator (and, consequentially, a new cluster master) will be elected.
>Handling loss of quorum,
The shared state resides entirely on HDFS (see issues HADOOP-1876 and HADOOP-3245 ) so, until now, there is no shared soft-state between nodes. However the facilities for managing a shared state are present and can be used in a future update.
>split-brain and fencing in case of split-brain.
The JGroups library tries to automatically handle network partitions and merging, but given that:
There is no shared soft-state
There is only one access point in the whole Hadoop cluster to the HDFS (the NameNode)
the network partition problem should not be an issue (only one partition at a time can access the HDFS). In future versions a more elegant way of dealing with network partitions should be added.
> It also >requires comprehensive management tools for configuring, managing and monitoring cluster.
I am now adding JMX support. After the initial testing phase I will post result and an updated version.
>2. Sharing state information between master and slave, so that a slave node can take over as master.
>Currenly the proposed solution addresses mainly the second problem. I have not seen much information on how the first problem is addressed. While the sharing >of information between master and slave can be done in many ways, managing the master/slave cluster is a more complicated problem. Could you please add >more information on how the design handles these issues and some notes on how administrator uses this functionality to manage the cluster.
I hope I have given an answer to your question. If you need more, feel free to contact me.
>Also analysis of the impact of job tracker performance due to the introduction of this feature needs to be done.
I am about to begin the testing phase, results will follow
Regards,
Francesco

>the network partition problem should not be an issue (only one partition at a time can access the HDFS).

I think the problem still exists. Suppose a network partition occurs between a master and the remainder of the nodes. A new master is elected allright, and the new master will now own the metadata state. The new master will start to update JobTracker metadata strored in HDFS beucae he thinks he is the sole owner of this metadata. At this time, how can you be guaranteed that the old master is not continuing to update the same metadata?

dhruba borthakur
added a comment - 09/Jan/09 16:10 >the network partition problem should not be an issue (only one partition at a time can access the HDFS).
I think the problem still exists. Suppose a network partition occurs between a master and the remainder of the nodes. A new master is elected allright, and the new master will now own the metadata state. The new master will start to update JobTracker metadata strored in HDFS beucae he thinks he is the sole owner of this metadata. At this time, how can you be guaranteed that the old master is not continuing to update the same metadata?

> Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store.

That's the key concern. Zookeeper's in-memory datastructures would probably take much more space than those in the namenode and/or jobtracker do today. Other than that, Zookeeper seems ideally suited to these tasks. Perhaps if Zookeeper were to support namespace partitioning and rebalancing (hard problems) then it could be used to store such data. It would certainly vastly simplify many things.

Doug Cutting
added a comment - 09/Jan/09 18:19 > Zookeeper might not work well for maintaining JobTracker state (or for that matter, Namenode persistent state) because these processes have lots of metadata to store.
That's the key concern. Zookeeper's in-memory datastructures would probably take much more space than those in the namenode and/or jobtracker do today. Other than that, Zookeeper seems ideally suited to these tasks. Perhaps if Zookeeper were to support namespace partitioning and rebalancing (hard problems) then it could be used to store such data. It would certainly vastly simplify many things.

Francesco Salbaroli
added a comment - 14/Jan/09 16:22 Some bugfixes in release 0.3 to work in a distributed environment.
This version has been tested in a distributed VMware Infrastructure 3 environment (using RHEL 5.2 as a guest OS).

If I understand correctly, the current patch doesn't share the state between master and slaves. It relies on HADOOP-3245 for keeping the state. I assume this to work the state has to be kept on HDFS instead of local filesystem. In case a new master is elected, the jobtracker is started using the state from HDFS, right?
Also, reading the master info from HDFS at frequent interval from each node may not scale well. I think Zookeeper would be better suited in the case where we are just doing master election and keeping watch on the master changes.

Sharad Agarwal
added a comment - 16/Jun/09 08:37 If I understand correctly, the current patch doesn't share the state between master and slaves. It relies on HADOOP-3245 for keeping the state. I assume this to work the state has to be kept on HDFS instead of local filesystem. In case a new master is elected, the jobtracker is started using the state from HDFS, right?
Also, reading the master info from HDFS at frequent interval from each node may not scale well. I think Zookeeper would be better suited in the case where we are just doing master election and keeping watch on the master changes.

We are preferring Zookeeper for this. We had quite bad experience with JGroups in some of our previous projects which include Deadlocks, network traffic overhead etc (May be latest version of JGroups is stable). We were forced to replace jgroups. Zookeeper is the best solution available for leader election. We have seen that Zookeeper is very well used in similar situations in "Katta" project and also some of our internal projects.

3. How to Notify JobClients and Task Trackers about the new Master, on failure.
One option would be DNS as mentioned.
Another option is providing a list of job tracker ips to JobClients and Task trackers. They can silently retry on all available ips in case of failure. At the server side, slave job trackers will not accept any service request. This way we can avoid split brain and network partition scenarios. Zookeeper cluster inherently avoids the split brain issues in leader election.

We have not yet started our work. Please provide your valuable opinions.

Hari A V
added a comment - 19/Jan/11 10:14 Hi,
In my team, we also have been analysing on how to provide HA for Job Tracker. Our approach is also quite similar to Francesco's approach.
The complete HA solution can be divided to three aspects
1. Sharing of job related state between Master and Slave job trackers
This can be achieved with issues HADOOP-1876 and HADOOP-3245 .
2. Failure Detection and Master Election
We are preferring Zookeeper for this. We had quite bad experience with JGroups in some of our previous projects which include Deadlocks, network traffic overhead etc (May be latest version of JGroups is stable). We were forced to replace jgroups. Zookeeper is the best solution available for leader election. We have seen that Zookeeper is very well used in similar situations in "Katta" project and also some of our internal projects.
3. How to Notify JobClients and Task Trackers about the new Master, on failure.
One option would be DNS as mentioned.
Another option is providing a list of job tracker ips to JobClients and Task trackers. They can silently retry on all available ips in case of failure. At the server side, slave job trackers will not accept any service request. This way we can avoid split brain and network partition scenarios. Zookeeper cluster inherently avoids the split brain issues in leader election.
We have not yet started our work. Please provide your valuable opinions.
thanks
Hari

(1) Start a JobTracker attatched with a virtual IP. All TaskTrackers and JobClients connect with JT via virtual IP;
(2) 2 LVS servers (for hot standby) monitor the state of JobTracker;
(3) When JobTracker down, LVS will trigger a script to start another Jobtracker on another server. Jobs information will be achived with issues HADOOP-1876 and HADOOP-3245.

This solution need not any changes on JobTracker, but a little complicated deployment.

Leitao Guo
added a comment - 23/Feb/11 11:59 We also have a solution to JobTracker HA based on LVS:
(1) Start a JobTracker attatched with a virtual IP. All TaskTrackers and JobClients connect with JT via virtual IP;
(2) 2 LVS servers (for hot standby) monitor the state of JobTracker;
(3) When JobTracker down, LVS will trigger a script to start another Jobtracker on another server. Jobs information will be achived with issues HADOOP-1876 and HADOOP-3245 .
This solution need not any changes on JobTracker, but a little complicated deployment.

Leitao Guo
added a comment - 23/Feb/11 12:05 @Hari, your solution is just the same with HBase master HA, I think it works!
I think I can contribute to this issue. Should I start a new JIRA about ZK+JT or reassign this issue to me ?

Leitao and Hari - apologies for coming in late. I've missed this so far.

HADOOP-1876 and HADOOP-3245 have had too many issues in the past and we have since moved away from this model - in fact we never deployed either at an reasonable scale due to issues we have seen with them. Also, we have actually have removed a lot of this code in future versions of Hadoop since they didn't work well at all and complicated the JobTracker to a very large extent.

OTOH, we have been working on a completely revamped architecture for Hadoop Map-Reduce via MAPREDUCE-279. You guys might we interested... also we would love your feedback based on your experiences there. Thanks!

Arun C Murthy
added a comment - 28/Feb/11 18:43 Leitao and Hari - apologies for coming in late. I've missed this so far.
HADOOP-1876 and HADOOP-3245 have had too many issues in the past and we have since moved away from this model - in fact we never deployed either at an reasonable scale due to issues we have seen with them. Also, we have actually have removed a lot of this code in future versions of Hadoop since they didn't work well at all and complicated the JobTracker to a very large extent.
OTOH, we have been working on a completely revamped architecture for Hadoop Map-Reduce via MAPREDUCE-279 . You guys might we interested... also we would love your feedback based on your experiences there. Thanks!

Although HADOOP-1876 and HADOOP-3245 do not work well, I think the failover for JobTracker is still considerable.

If JobTracker on one server down, we need to restart JobTracker or migrate JobTracker to another server. In our scenario, we may not care about whether the job will continue just from the same progress before JobTracker failed, but the automatic failover is needed. Integrating zookeeper with JobTracker is a workable solution for failover I think.

Leitao Guo
added a comment - 01/Mar/11 06:40 Thanks for your response, Arun.
Although HADOOP-1876 and HADOOP-3245 do not work well, I think the failover for JobTracker is still considerable.
If JobTracker on one server down, we need to restart JobTracker or migrate JobTracker to another server. In our scenario, we may not care about whether the job will continue just from the same progress before JobTracker failed, but the automatic failover is needed. Integrating zookeeper with JobTracker is a workable solution for failover I think.

Wang Xin
added a comment - 22/Apr/11 16:19 I want to know how about this issue,I want to get it HA future in 0.21v,but I want to know how to integration with ZK.someone can tell me how about Jobtrackers with Jgroup or others.

Sorry for a very late response.
@Arun: Yes MAPREDUCE-225 is a completely new architecture. May be still need to wait for longer time to get it done. For those who uses 0.20 version and need a simple "availability solution", a much simpler approach would be helpful
@Leitao: Yes, its similar to HMaster HA. It works. I have finished the development of ZK based framework and integrated with JT. I am in the process of contributing it back. As a first step, i have opened a Jira in Zookeeper for a generic LeaderElectionService (ZOOKEEPER-1080). I will upload the patch soon.

ZK+JT may not be a full fledged HA solution. But what it tries to address is
1. Avoid manual intervention during a Jobtracker failure.
2. Recover and Continue the jobs ( even re-submitting the jobs) without notifying to clients who submitted the job.

Solution remains very simple as no need to synchronize the "state of the jobs".

Cons
-------
Job may take longer time to finish during failover due to re-submission of jobs

Hari A V
added a comment - 08/Jun/11 09:09 hi,
Sorry for a very late response.
@Arun: Yes MAPREDUCE-225 is a completely new architecture. May be still need to wait for longer time to get it done. For those who uses 0.20 version and need a simple "availability solution", a much simpler approach would be helpful
@Leitao: Yes, its similar to HMaster HA. It works. I have finished the development of ZK based framework and integrated with JT. I am in the process of contributing it back. As a first step, i have opened a Jira in Zookeeper for a generic LeaderElectionService ( ZOOKEEPER-1080 ). I will upload the patch soon.
ZK+JT may not be a full fledged HA solution. But what it tries to address is
1. Avoid manual intervention during a Jobtracker failure.
2. Recover and Continue the jobs ( even re-submitting the jobs) without notifying to clients who submitted the job.
Solution remains very simple as no need to synchronize the "state of the jobs".
Cons
-------
Job may take longer time to finish during failover due to re-submission of jobs
Please provide suggestions
-Hari