iSCSI failover cluster with Pacemaker

This howto will show the different steps that can be followed to create an operational iSCSI cluster of two nodes in failover mode. By coupling DRBD synchronization and Pacemaker, this will build a redundant shared storage system that is also cheap.

This howto will also teach you the basic commands to use to start successfully with Pacemaker.

Initializing the cluster

Before launching the Pacemaker cluster, we need to set its basic configuration parameter in the /etc/corosync/corosync.conf. This file is organized in various sections :

totem : this goes about the multicast configuration used by all nodes to exchange information like configuration, quorum, hearbeat, …

In bold,the most important parameters. The mcastaddr and mcastport are the multicast address and port you will use to let the nodes of the cluster communicate to each others. It must not be overlap with the one of another cluster that may be running on the same network.

Bindnetaddr is the address of the network where the nodes are located.

Then name will be the name of the cluster service to use. In our case, only pacemaker can be used.

The last step is to start the cluster with /etc/init.d/corosync and to be sure that this script is launched automatically at each boot :

# chkconfig –-add corosync
# chkconfig –-level 3 corosync on

This configuration file must be manually updated on each node taking part into the cluster. This is the only configuration that must be done on all nodes. Once this file is created and the cluster service started on each node, the configuration is done by talking to cluster daemon on one node, using the crm command. There is a cluster daemon that manages automatically the distribution of the configuration to all nodes of the cluster.

Once the cluster daemons are running, you can check the status of the cluster by launching the command crm_mon on any node.

At this stage, our cluster is “empty”, but you can already use the crm configure show command to display the current – empty – configuration.

To end the initialization of the cluster, we will set some global properties of the cluster :

we disable the use of STONITH device (Shoot The Other Node In The Head – in case of fail-over, a mechanism to send a command to completely shut down the other node to be sure that I will be well deactivated)

by default, the quorum state is lost when a number less or equal to the half of nodes stay in cluster. In our case of two nodes fail-over cluster, if one node disappear that automatically the quorum state is lost (only 1 node survive, which is equal to the half of the total numbers of nodes – 2). And a cluster who loose its quorum state won't switch the resources anymore. So we need to ignore the loss of quorum to be able to switch the resources, hence this setting.

By setting a certain positive level of resource stickiness, we ensure that after a fail-over, the resource continue to run on the node who survived and are not re-switched to the returning failed node. (switching resources stops them and increase the risk of problems).

Lastly, we will have to create at least one DRBD filesystem (this happens outside Pacemaker) like described in this other howto.

Adding resources to the cluster

In Pacemaker, the resources will be management by the resource agents (RA). These are scripts that the cluster will execute to perform tasks like starting & stopping service, requesting status, check, ...

To see which classes and providers are available for your resource agents, you can do :

With the keyword params, you specify the parameter and its value, each pair separated to the next one by a space. The parameter name is the one that the crm ra meta <ra> command display.

With the keyword op, you change the default values for the defined operations (like starting, stoping or monitoring the resource). These operations and their default values are also shown by the crm ra meta <ra> command.

Once the resource is defined, you can modify a given parameter of it by using the following commands:

Master-slave resources

Until now, with the use of the “primitive” configuration directive, we are defining a resource which will be started or stopped on one node by the cluster. And such a resource can only be started on one node and stopped on all the other nodes.

As you may guess, for DRBD, it should be slightly different. Indeed, there must be a DRBD resource running on both node of the cluster, one being primary and the other one being secondary. And when a switch or fail-over occurs, it is not a matter of starting or stopping the resource but to promote or demote it.

In Pacemaker, this can be achieved by using the so-called “master-slave” (ms) configuration stanza. For our cluster, the command will be :

In our command above, iscsi-meta is the name we will give to this resource, iscsi-drbd being the primitive name that must be configured as a master-slave resource. The second line specify how many master (primary) can be run at the same time. Obviously, in our two nodes setup, we only need one primary in the whole cluster.

Ordering and grouping resources

As you see, to run an iSCSI target into our cluster, we need to define 4 primitives and 1 master-slave resource. But all these resources when started must run together on the same node. There is no point to run the iSCSI target on one node, and the DRBD primary on another one.

Also, some resource must be started before others : DRBD must be primary before we setup the target and the logical unit on it. The target must be started before we start the logical unit.

The first value is the identifier we give to the grouping pair. Then we have the advisory-score which is here infinity (meaning mandatory). Then the resources (with its state after the colon) that must be grouped together.

Like explained in the help of the crm command, I've also tried to use more than 2 resources with the order and colocation stanza, but it did not lead to the same result as using only a combination of pair ordering and colocation.

I still don't know why.

And that's it, we have now configured all we have to let the cluster runs our iSCSI target in a fail-over mode.