Configuring Software for HDFS HA-CFANZ编程社区

Configuring Software for HDFS HA

Note:

The subsections that follow explain how to configure HDFS HA using Quorum-based storage. This is the only implementation supported in CDH 5.

Configuration Overview

As with HDFS Federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster can have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.

HA clusters reuse the NameService ID to identify a single HDFS instance that may consist of multiple HA NameNodes. In addition, there is a new abstraction called NameNode ID. Each distinct NameNode in the cluster has a different NameNode ID. To support a single configuration file for all of the NameNodes, the relevant configuration parameters include the NameService ID as well as the NameNode ID.

Changes to Existing Configuration Parameters

The following configuration parameter has changed for YARN implementations:

fs.defaultFS - formerly fs.default.name, the default path prefix used by the Hadoop FS client when none is given. (fs.default.name

mycluster as the NameService ID as shown below, this will be the value of the authority portion of all of your HDFS paths. You can configure the default path in your core-site.xml file:

For YARN: <property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property>
For MRv1: <property> <name>fs.default.name</name> <value>hdfs://mycluster</value> </property>

New Configuration Parameters

To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml

The order in which you set these configurations is unimportant, but the values you choose for dfs.nameservices and dfs.ha.namenodes.[NameService ID]

Configure dfs.nameservices

dfs.nameservices

Choose a logical name for this nameservice, for example mycluster, and use this logical name for the value of this configuration option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.

Note:

If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.

<property> <name>dfs.nameservices</name> <value>mycluster</value> </property>

Configure dfs.ha.namenodes.[nameservice ID]

dfs.ha.namenodes.[nameservice ID]

Configure a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used mycluster as the NameService ID previously, and you wanted to use nn1 and nn2

<property> <name>dfs.ha.namenodes.mycluster</name> <value>nn1,nn2</value> </property>

Note:

In this release, you can configure a maximum of two NameNodes per nameservice.

Configure dfs.namenode.rpc-address.[nameservice ID]

dfs.namenode.rpc-address.[nameservice ID].[name node ID]

For both of the previously-configured NameNode IDs, set the full address and RPC port of the NameNode process. Note that this results in two separate configuration options. For example:

<property> <name>dfs.namenode.rpc-address.mycluster.nn1</name> <value>machine1.example.com:8020</value></property> <property> <name>dfs.namenode.rpc-address.mycluster.nn2</name> <value>machine2.example.com:8020</value> </property>

Note:

If necessary, you can similarly configure the servicerpc-address

Configure dfs.namenode.http-address.[nameservice ID]

dfs.namenode.http-address.[nameservice ID].[name node ID]

Similarly to rpc-address

<property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>machine1.example.com:50070</value></property><property> <name>dfs.namenode.http-address.mycluster.nn2</name> <value>machine2.example.com:50070</value> </property>

Note:

If you have Hadoop's Kerberos security features enabled, and you intend to use HSFTP, you should also set the https-address

Configure dfs.namenode.shared.edits.dir

dfs.namenode.shared.edits.dir

Configure the addresses of the JournalNodes which provide the shared edits storage, written to by the Active NameNode and read by the Standby NameNode to stay up-to-date with all the file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be in the form:

qjournal://<host1:port1>;<host2:port2>;<host3:port3>/<journalId>

The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though it is not a requirement, it's a good idea to reuse the nameservice ID for the journal identifier.

For example, if the JournalNodes for this cluster were running on the machines node1.example.com, node2.example.com, and node3.example.com, and the nameservice ID were mycluster, you would use the following as the value for this setting (the default port for the JournalNode is 8485):

<property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value></property>

Configure dfs.journalnode.edits.dir

dfs.journalnode.edits.dir

On each JournalNode machine, configure the absolute path where the edits and other local state information used by the JournalNodes will be stored; use only a single path per JournalNode. (The other JournalNodes provide redundancy; you can also configure this directory on a locally-attached RAID-1 or RAID-10 array.)

For example:

<property> <name>dfs.journalnode.edits.dir</name> <value>/data/1/dfs/jn</value></property>

Now create the directory (if it doesn't already exist) and make sure its owner is hdfs, for example:

$ sudo mkdir -p /data/1/dfs/jn$ sudo chown -R hdfs:hdfs /data/1/dfs/jn

Client Failover Configuration

dfs.client.failover.proxy.provider.[nameservice ID]

Configure the name of the Java class which the DFS Client will use to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The only implementation which currently ships with Hadoop is theConfiguredFailoverProxyProvider, so use this unless you are using a custom one. For example:

<property> <name>dfs.client.failover.proxy.provider.mycluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value></property>

Fencing Configuration

dfs.ha.fencing.methods

It is desirable for correctness of the system that only one NameNode be in the Active state at any given time.

Important:

When you use Quorum-based Storage, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata in a "split-brain" scenario. But when a failover occurs, it is still possible that the previously active NameNode could serve read requests to clients - and these requests may be out of date - until that NameNode shuts down when it tries to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using Quorum-based Storage.

To improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list.

Note:

If you choose to use no actual fencing methods, you still must configure something for this setting, for example shell(/bin/true).

The fencing methods used during a failover are configured as a carriage-return-separated list, and these will be attempted in order until one of them indicates that fencing has succeeded.

There are two fencing methods which ship with Hadoop:

sshfence
shell

For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer

Configuring the sshfence fencing method

sshfence

The sshfence option uses SSH to connect to the target node and uses fuser to kill the process listening on the service's TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, you must also configure the dfs.ha.fencing.ssh.private-key-files

Important:

The files must be accessible to the user running the NameNode processes (typically the hdfs

For example:

<property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value></property><property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/exampleuser/.ssh/id_rsa</value></property>

Optionally, you can configure a non-standard username or port to perform the SSH as shown below. You can also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed:

<property> <name>dfs.ha.fencing.methods</name> <value>sshfence([[username][:port]])</value></property><property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> <description> SSH connection timeout, in milliseconds, to use with the builtin sshfence fencer. </description> </property>

Configuring the shell fencing method

shell

The shell fencing method runs an arbitrary shell command, which you can configure as shown below:

<property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value></property>

The string between '(' and ')' is passed directly to a bash

When executed, the first argument to the configured script will be the address of the NameNode to be fenced, followed by all arguments specified in the configuration.

The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the '_' character replacing any '.' characters in the configuration keys. The configuration used has already had any NameNode-specific configurations promoted to their generic forms - for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.

Variable	Description
$target_host	Hostname of the node to be fenced
$target_port	IPC port of the node to be fenced
$target_address	The above two variables, combined as <host:port>
$target_nameserviceid	The nameservice ID of the NameNode to be fenced
$target_namenodeid	The namenode ID of the NameNode to be fenced

These environment variables may also be used as substitutions in the shell command itself. For example:

<property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value></property>

If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.

Note:

This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (for example, by forking a subshell to kill its parent in some number of seconds).

Automatic Failover Configuration

The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.

Component Overview

Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController

Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:

Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.
Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node can take a special exclusive lock in ZooKeeper indicating that it should become the next active NameNode.

The ZKFailoverController

Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds promptly with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special lock znode. This lock uses ZooKeeper's support for "ephemeral" nodes; if the session expires, the lock node will be automatically deleted.
ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has "won the election", and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.

Deploying ZooKeeper

In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Operators using MapReduce v2 (MRv2) often choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.

See the ZooKeeper documentation for instructions on how to set up a ZooKeeper ensemble. In the following sections we assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZooKeeper command-line interface (CLI).

Configuring Automatic Failover

Note:

Before you begin configuring automatic failover, you must shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.

Configuring automatic failover requires two additional configuration parameters. In your hdfs-site.xml

<property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value></property>

This specifies that the cluster should be set up for automatic failover. In your core-site.xml

<property> <name>ha.zookeeper.quorum</name> <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value></property>

This lists the host-port pairs running the ZooKeeper service.

As with the parameters described earlier in this document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.my-nameservice-id.

There are several other configuration parameters which you can set to control the behavior of automatic failover, but they are not necessary for most installations. See the configuration section of the Hadoop documentation for details.

Initializing the HA state in ZooKeeper

After you have added the configuration keys, the next step is to initialize the required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.

Note:

The ZooKeeper ensemble must be running when you use this command; otherwise it will not work properly.

$ hdfs zkfc -formatZK

This will create a znode

Securing access to ZooKeeper

If you are running a secure cluster, you will probably want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.

In order to secure the information in ZooKeeper, first add the following to your core-site.xml

<property> <name>ha.zookeeper.auth</name> <value>@/path/to/zk-auth.txt</value></property><property> <name>ha.zookeeper.acl</name> <value>@/path/to/zk-acl.txt</value></property>

Note the '@' character in these values – this specifies that the configurations are not inline, but rather point to a file on disk.

The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZooKeeper CLI. For example, you may specify something like digest:hdfs-zkfcs:mypassword where hdfs-zkfcs is a unique username for ZooKeeper, andmypassword

Next, generate a ZooKeeper Access Control List (ACL) that corresponds to this authentication, using a command such as the following:

$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypasswordoutput: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=

Copy and paste the section of this output after the '->' string into the file zk-acls.txt, prefixed by the string "digest:". For example:

digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda

To put these ACLs into effect, rerun the zkfc -formatZK

After doing so, you can verify the ACLs from the ZooKeeper CLI as follows:

[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=: cdrwa

Automatic Failover FAQ No. On any given node you may start the ZKFC before or after its corresponding NameNode. You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should be restarted to ensure that the system is ready for automatic failover. Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, automatic failover will not function.

If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.

No. Currently, this is not supported. Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts first.

Even if automatic failover is configured, you can initiate a manual failover using the hdfs haadmin -failover command. It will perform a coordinated failover.

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-High-Availability-Guide/cdh5hag_hdfs_ha_software_config.html