Enabling HDFS HA

An HDFS high availability (HA) cluster uses two NameNodes—an active NameNode and a standby NameNode. Only one NameNode can be active at any point in time. HDFS HA depends on
maintaining a log of all namespace modifications in a location available to both NameNodes, so that in the event of a failure, the standby NameNode has up-to-date information about the edits and
location of blocks in the cluster.
Important: Enabling and disabling HA causes a service outage for the HDFS service and all services that depend on HDFS.
Before enabling or disabling HA, ensure that there are no jobs running on your cluster.

Enabling HDFS HA Using Cloudera Manager

You can use Cloudera Manager to configure your CDH 5 cluster for HDFS HA and automatic failover. In Cloudera Manager 5, HA is implemented using Quorum-based storage. Quorum-based storage
relies upon a set of JournalNodes, each of which maintains a local edits directory that logs the modifications to the namespace metadata. Enabling HA enables automatic failover as part of the same
command.

Important:

Enabling or disabling HA causes the previous monitoring history to become unavailable.

Some parameters will be automatically set as follows once you have enabled JobTracker HA. If you want to change the
value from the default for these parameters, use an advanced configuration snippet.

mapred.jobtracker.restart.recover: true

mapred.job.tracker.persist.jobstatus.active: true

mapred.ha.automatic-failover.enabled: true

mapred.ha.fencing.methods: shell(/bin/true)

Enabling High Availability and Automatic Failover

The Enable High Availability workflow leads you through adding a second (standby) NameNode and configuring JournalNodes.

Check the checkbox next to the hosts where you want the standby NameNode to be set up and click OK. The standby NameNode cannot be on the same host as
the active NameNode, and the host that is chosen should have the same hardware configuration (RAM, disk space, number of cores, and so on) as the active NameNode.

Check the checkboxes next to an odd number of hosts (a minimum of three) to act as JournalNodes and click OK. JournalNodes should be hosted on hosts
with similar hardware specification as the NameNodes. Cloudera recommends that you put a JournalNode each on the same hosts as the active and standby NameNodes, and the third JournalNode on similar
hardware, such as the JobTracker.

Click Continue.

In the JournalNode Edits Directory property, enter a directory location for the JournalNode edits directory into the fields for each JournalNode host.

You may enter only one directory for each JournalNode. The paths do not need to be the same on every JournalNode.

The directories you specify should be empty, and must have the appropriate permissions.

Extra Options: Decide whether Cloudera Manager should clear existing data in ZooKeeper, standby NameNode, and JournalNodes. If the directories are not
empty (for example, you are re-enabling a previous HA configuration), Cloudera Manager will not automatically delete the contents—you can select to delete the contents by keeping the default checkbox
selection. The recommended default is to clear the directories. If you choose not to do so, the data should be in sync across the edits directories of the JournalNodes and should have the same
version data as the NameNodes.

Click Continue.

Cloudera Manager executes a set of commands that will stop the dependent services, delete, create, and configure roles and directories as appropriate, create a nameservice and failover controller,
and restart the dependent services and deploy the new client configuration.

Important: If you change the NameNode Service RPC Port (dfs.namenode.servicerpc-address) while automatic
failover is enabled, this will cause a mismatch between the NameNode address saved in the ZooKeeper /hadoop-ha znode and the NameNode address that the Failover
Controller is configured with. This will prevent the Failover Controllers from restarting. If you need to change the NameNode Service RPC Port after Auto Failover has been enabled, you must do the
following to re-initialize the znode:

Stop the HDFS service.

Configure the service RPC port:

Go to the HDFS service.

Click the Configuration tab.

Select Scope > NameNode.

Select Category > Ports and Addresses.

Locate the NameNode Service RPC Port property or search for it by typing its name in the Search box.

Execute the following to remove the configured nameservice. This example assumes the name of the nameservice is nameservice1. You can identify the nameservice from the Federation and High Availability section on the HDFS Instances tab:

rmr /hadoop-ha/nameservice1

Click the Instances tab.

Select Actions > Initialize High Availability State in ZooKeeper.

Start the HDFS service.

Fencing Methods

To ensure that only one NameNode is active at a time, a fencing method is required for the shared edits directory. During a failover, the fencing method is responsible for ensuring that
the previous active NameNode no longer has access to the shared edits directory, so that the new active NameNode can safely proceed writing to it.

By default, Cloudera Manager configures HDFS to use a shell fencing method (shell(./cloudera_manager_agent_fencer.py)) that
takes advantage of the Cloudera Manager Agent. However, you can configure HDFS to use the sshfence method, or you can add your own shell fencing scripts, instead of or
in addition to the one Cloudera Manager provides.

The fencing parameters are found in the Service-Wide > High Availability
category under the configuration properties for your HDFS service.

For details of the fencing methods supplied with CDH 5, and how fencing is configured, see Fencing Configuration.

Enabling HDFS HA Using the Command Line

Important:

If you use Cloudera Manager, do not use these command-line instructions.

This information applies specifically to CDH 5.5.x. If you use a lower version of CDH, see the documentation for
that version located at Cloudera Documentation.

This section describes the software configuration required for HDFS HA in CDH 5 and explains how to set configuration properties and use the command line to deploy HDFS HA.

Configuring Software for HDFS HA

Configuration Overview

As with HDFS federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is
designed such that all the nodes in the cluster can have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.

HA clusters reuse the Nameservice ID to identify a single HDFS instance that may consist of multiple HA NameNodes. In addition, there is a new abstraction
called NameNode ID. Each distinct NameNode in the cluster has a different NameNode ID. To support a single configuration file for all of the NameNodes, the relevant
configuration parameters include the Nameservice ID as well as the NameNode ID.

Changes to Existing Configuration Parameters

The following configuration parameter has changed for YARN implementations:

fs.defaultFS - formerly fs.default.name, the default path prefix used by the Hadoop FS client when none is given.
(fs.default.name is deprecated for YARN implementations, but will still work.)

Optionally, you can configure the default path for Hadoop clients to use the HA-enabled logical URI. For example, if you use mycluster as the Nameservice
ID as shown below, this will be the value of the authority portion of all of your HDFS paths. You can configure the default path in your core-site.xml file:

New Configuration Parameters

To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.

The order in which you set these configurations is unimportant, but the values you choose for dfs.nameservices and dfs.ha.namenodes.[Nameservice ID] will determine the keys of those that follow. This means that you should decide on these values before setting the rest of the configuration
options.

Configure dfs.nameservices

dfs.nameservices - the logical name for this new nameservice

Choose a logical name for this nameservice, for example mycluster, and use this logical name for the value of this configuration option. The name you
choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.

Note: If you are also using HDFS federation, this configuration setting should also include the list of other Nameservices, HA or otherwise, as a
comma-separated list.

dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice

Configure a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used mycluster as the NameService ID previously, and you wanted to use nn1 and nn2 as the individual IDs of the NameNodes,
you would configure this as follows:

Note: If you have Hadoop Kerberos security features enabled, and you intend to use HSFTP, you should also set the https-address similarly for each NameNode.

Configure dfs.namenode.shared.edits.dir

dfs.namenode.shared.edits.dir - the location of the shared storage directory

Configure the addresses of the JournalNodes which provide the shared edits storage, written to by the Active NameNode and read by the Standby NameNode to stay up-to-date with all the
file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be in
the form:

qjournal://<host1:port1>;<host2:port2>;<host3:port3>/<journalId>

The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though it is not a
requirement, it's a good idea to reuse the Nameservice ID for the journal identifier.

For example, if the JournalNodes for this cluster were running on the machines node1.example.com, node2.example.com, and
node3.example.com, and the nameservice ID were mycluster, you would use the following as the value for this setting (the default port for
the JournalNode is 8485):

dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state

On each JournalNode machine, configure the absolute path where the edits and other local state information used by the JournalNodes will be stored; use only a single path per
JournalNode. (The other JournalNodes provide redundancy; you can also configure this directory on a locally-attached RAID-1 or RAID-10 array.)

Client Failover Configuration

dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode

Configure the name of the Java class which the DFS client will use to determine which NameNode is the current active, and therefore which NameNode is currently serving client requests.
The only implementation which currently ships with Hadoop is the ConfiguredFailoverProxyProvider, so use this unless you are using a custom one. For example:

Fencing Configuration

dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the active NameNode during a failover

It is desirable for correctness of the system that only one NameNode be in the active state at any given time.

Important: When you use Quorum-based Storage, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential
for corrupting the file system metadata in a "split-brain" scenario. But when a failover occurs, it is still possible that the previously active NameNode could serve read requests to clients - and
these requests may be out of date - until that NameNode shuts down when it tries to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using
Quorum-based Storage.

To improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last
fencing method in the list.

Note: If you choose to use no actual fencing methods, you still must configure something for this setting, for example shell(/bin/true).

The fencing methods used during a failover are configured as a carriage-return-separated list, and these will be attempted in order until one of them indicates that fencing has
succeeded.

For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

Configuring the sshfence fencing method

sshfence - SSH to the active NameNode and kill the process

The sshfence option uses SSH to connect to the target node and uses fuser to kill the process listening on the service's TCP
port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, you must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files.

Important: The files must be accessible to the user running the NameNode processes (typically the hdfs user on
the NameNode hosts).

Optionally, you can configure a non-standard username or port to perform the SSH as shown below. You can also configure a timeout, in milliseconds, for the SSH, after which this fencing
method will be considered to have failed:

The string between '(' and ')' is passed directly to a bash shell and cannot include any closing parentheses.

When executed, the first argument to the configured script will be the address of the NameNode to be fenced, followed by all arguments specified in the configuration.

The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the '_' character replacing any '.' characters in the
configuration keys. The configuration used has already had any NameNode-specific configurations promoted to their generic forms - for example dfs_namenode_rpc-address
will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.

The following variables referring to the target node to be fenced are also available:

Variable

Description

$target_host

Hostname of the node to be fenced

$target_port

IPC port of the node to be fenced

$target_address

The two variables above, combined as host:port

$target_nameserviceid

The nameservice ID of the NameNode to be fenced

$target_namenodeid

The NameNode ID of the NameNode to be fenced

You can also use these environment variables as substitutions in the shell command itself. For example:

If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method
in the list will be attempted.

Note: This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (for
example, by forking a subshell to kill its parent in some number of seconds).

Automatic Failover Configuration

The previous sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the
active node has failed. This section describes how to configure and deploy automatic failover. For an overview of how automatic failover is implemented, see Automatic Failover.

Deploying ZooKeeper

In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the
ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Operators using MapReduce v2 (MRv2) often choose to deploy the third ZooKeeper process on the same node as the YARN
ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.

See the ZooKeeper documentation for instructions on how to set up a ZooKeeper ensemble. In the following sections
we assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZooKeeper command-line interface (CLI).

Configuring Automatic Failover

Note: Before you begin configuring automatic failover, you must shut down your cluster. It is not currently possible to transition from a manual
failover setup to an automatic failover setup while the cluster is running.

As with the parameters described earlier in this document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For
example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.my-nameservice-id.

There are several other configuration parameters which you can set to control the behavior of automatic failover, but they are not necessary for most installations. See the configuration
section of the Hadoop documentation for details.

Initializing the HA state in ZooKeeper

After you have added the configuration keys, the next step is to initialize the required state in ZooKeeper. You can do so by running the following command from one of the NameNode
hosts.

Note: The ZooKeeper ensemble must be running when you use this command; otherwise it will not work properly.

$ hdfs zkfc -formatZK

This will create a znode in ZooKeeper in which the automatic failover system stores its data.

Securing access to ZooKeeper

If you are running a secure cluster, you will probably want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the
metadata in ZooKeeper or potentially triggering a false failover.

In order to secure the information in ZooKeeper, first add the following to your core-site.xml file:

Note the '@' character in these values – this specifies that the configurations are not inline, but rather point to a file on disk.

The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZooKeeper CLI. For example, you may specify something like digest:hdfs-zkfcs:mypassword where hdfs-zkfcs is a unique username for ZooKeeper, and mypassword is some unique string
used as a password.

Next, generate a ZooKeeper Access Control List (ACL) that corresponds to this authentication, using a command such as the following:

Is it important that I start the ZKFC and NameNode daemons in any particular order?

No. On any given node you may start the ZKFC before or after its corresponding NameNode.

What additional monitoring should I put in place?

You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit,
and should be restarted to ensure that the system is ready for automatic failover. Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, automatic
failover will not function.

What happens if ZooKeeper goes down?

If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with
no issues.

Can I designate one of my NameNodes as primary/preferred?

No. Currently, this is not supported. Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts
first.

How can I initiate a manual failover when automatic failover is configured?

Even if automatic failover is configured, you can initiate a manual failover. It will perform a coordinated
failover.

Deploying HDFS High Availability

After you have set all of the necessary configuration options, you are ready to start the JournalNodes and the two HA NameNodes.

Install and Start the JournalNodes

Install the JournalNode daemons on each of the machines where they will run.

To install JournalNode on Red Hat-compatible systems:

$ sudo yum install hadoop-hdfs-journalnode

To install JournalNode on Ubuntu and Debian systems:

$ sudo apt-get install hadoop-hdfs-journalnode

To install JournalNode on SLES systems:

$ sudo zypper install hadoop-hdfs-journalnode

Start the JournalNode daemons on each of the machines where they will run:

sudo service hadoop-hdfs-journalnode start

Wait for the daemons to start before formatting the primary NameNode (in a new cluster) and before starting the NameNodes (in all cases).

Format the NameNode (if new cluster)

If you are setting up a new HDFS cluster, format the NameNode you will use as your primary NameNode; see Formatting
the NameNode.
Important: Make sure the JournalNodes have started. Formatting will fail if you have configured the NameNode to communicate with the
JournalNodes, but have not started the JournalNodes.

Start the NameNodes

Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are
using a password) or$ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each
command executed by this user, $ <command>

Starting the standby NameNode with the -bootstrapStandby option copies over the contents of the primary NameNode's metadata directories (including the
namespace information and most recent checkpoint) to the standby NameNode. (The location of the directories containing the NameNode metadata is configured using the configuration options dfs.namenode.name.dir and dfs.namenode.edits.dir.)

You can visit each NameNode's web page by browsing to its configured HTTP address. Notice that next to the configured address is the HA state of the NameNode (either "Standby" or
"Active".) Whenever an HA NameNode starts and automatic failover is not enabled, it is initially in the Standby state. If automatic failover is enabled the first NameNode that is started will become
active.

Restart Services (if converting existing non-HA cluster)

If you are converting from a non-HA to an HA configuration, you need to restart the JobTracker and TaskTracker (for MRv1, if used), or ResourceManager, NodeManager, and JobHistory Server
(for YARN), and the DataNodes:

On each DataNode:

$ sudo service hadoop-hdfs-datanode start

On each TaskTracker system (MRv1):

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system (MRv1):

$ sudo service hadoop-0.20-mapreduce-jobtracker start

Verify that the JobTracker and TaskTracker started properly:

sudo jps | grep Tracker

On the ResourceManager system (YARN):

$ sudo service hadoop-yarn-resourcemanager start

On each NodeManager system (YARN; typically the same ones where DataNode service runs):

$ sudo service hadoop-yarn-nodemanager start

On the MapReduce JobHistory Server system (YARN):

$ sudo service hadoop-mapreduce-historyserver start

Deploy Automatic Failover (if it is configured)

If you have configured automatic failover using the ZooKeeper FailoverController (ZKFC), you must install and start the zkfc daemon on each of the machines
that runs a NameNode. Proceed as follows.

To install ZKFC on Red Hat-compatible systems:

$ sudo yum install hadoop-hdfs-zkfc

To install ZKFC on Ubuntu and Debian systems:

$ sudo apt-get install hadoop-hdfs-zkfc

To install ZKFC on SLES systems:

$ sudo zypper install hadoop-hdfs-zkfc

To start the zkfc daemon:

$ sudo service hadoop-hdfs-zkfc start

It is not important that you start the ZKFC and NameNode daemons in a particular order. On any given node you can start the ZKFC before or after its corresponding NameNode.

You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and
should be restarted to ensure that the system is ready for automatic failover.

Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, then automatic failover will not function. If the ZooKeeper cluster crashes, no
automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.

Verifying Automatic Failover

After the initial deployment of a cluster with automatic failover enabled, you should test its operation. To do so, first locate the active NameNode. As mentioned above, you can tell
which node is active by visiting the NameNode web interfaces.

Once you have located your active NameNode, you can cause a failure on that node. For example, you can use kill -9 <pid of NN> to simulate a JVM
crash. Or you can power-cycle the machine or its network interface to simulate different kinds of outages. After you trigger the outage you want to test, the other NameNode should automatically
become active within several seconds. The amount of time required to detect a failure and trigger a failover depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.

If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further
diagnose the issue.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.