Setting up Linux Operating System Clusters on Hyper-V (2 of 3)

Background

This blog post is the second in a series of three that walks through setting up Linux operating system clusters on Hyper-V. The walk-through uses Red Hat Cluster Suite (RHCS) as the clustering storage and Hyper-V’s Shared VDHX as the shared storage needed by the cluster software.

Part 1 of the series showed how to set up a Hyper-V host cluster and a shared VHDX. Then it showed how to set up five CentOS 6.7 VMs in the host cluster, all using the shared VHDX.

This post will set up the Linux OS cluster with the CentOS 6.7 VMs, running RHCS and the GFS2 file system. RHCS is specifically for use with RHEL/CentOS 6.x; RHEL/CentOS 7.x uses a different clustering software package that is not covered by this walk through. The GFS2 file system is specifically designed to be used on shared disks accessed by multiple nodes in a Linux cluster, and so is a natural example to use.

Let’s get started!

Setup a guest cluster with the five CentOS 6.7 VMs running RHCS + GFS2 file system

After 1 and 2, we should reboot all the nodes to make things take effect. Or we need to manually start or shut down the above service daemons on every node.

Optionally, remove the “rhgb quiet” kernel parameters for every node, so you can easily see which cluster daemon fails to start on VM bootup.

Use a web browser to access https://my-vm1:8084 (the web-based HA configuration tool luci — first login with root and grant the user ricci the permission to administrator and create a cluster, then logout and login with ricci)

Create a 5-node cluster “my-cluster”
We can confirm the cluster is created properly by checking the status of the service daemons and checking the cluster status (clustat):

service modclusterd status
service cman status
service clvmd status
service rgmanager status
clustat

e.g., when we run the commands in my-vm3, we get:

Add a fencing device (we use SCSI3 Persistent Registration) and associate all the VMs with it.

Fencing is used to prevent erroneous/unresponsive nodes from accessing the shared storage, so data consistency can be achieved.

“SCSI-3 PR, which stands for Persistent Reservation, supports multiple nodes accessing a device while at the same time blocking access to other nodes. SCSI-3 PR reservations are persistent across SCSI bus resets or node reboots and also support multiple paths from host to disk. SCSI-3 PR uses a concept of registration and reservation. Systems that participate, register a key with SCSI-3 device. Each system registers its own key. Then registered systems can establish a reservation. With this method, blocking write access is as simple as removing registration from a device. A system wishing to eject another system issues a preempt and abort command and that ejects another node. Once a node is ejected, it has no key registered so that it cannot eject others. This method effectively avoids the split-brain condition.”

This is how we add SCSI3 PR in RHCS:

NOTE 1: in /etc/cluster/cluster.conf, we need to manually specify devices=”/dev/sdb” and add a <unfence> for every VM. The web-based configuration tool doesn’t support this, but we do need this, otherwise cman can’t work properly.

NOTE 2: when we change /etc/cluster/cluster.conf manually, remember to increase “config_version” by 1 and propagate the new configuration to other nodes by “cman_tool version -r”.

Add a Quorum Disk to help to better cope with the Split-Brain issue. “In RHCS, CMAN (Cluster MANager) keeps track of membership by monitoring messages from other cluster nodes. When cluster membership changes, the cluster manager notifies the other infrastructure components, which then take appropriate action. If a cluster node does not transmit a message within a prescribed amount of time, the cluster manager removes the node from the cluster and communicates to other cluster infrastructure components that the node is not a member. Other cluster infrastructure components determine what actions to take upon notification that node is no longer a cluster member. For example, Fencing would disconnect the node that is no longer a member.A cluster can only function correctly if there is general agreement between the members regarding their status. We say a cluster has quorum if a majority of nodes are alive, communicating, and agree on the active cluster members. For example, in a thirteen-node cluster, quorum is only reached if seven or more nodes are communicating. If the seventh node dies, the cluster loses quorum and can no longer function.A cluster must maintain quorum to prevent split-brain issues. Quorum doesn’t prevent split-brain situations, but it does decide who is dominant and allowed to function in the cluster. Quorum is determined by communication of messages among cluster nodes via Ethernet. Optionally, quorum can be determined by a combination of communicating messages via Ethernet and through a quorum disk. For quorum via Ethernet, quorum consists of a simple majority (50% of the nodes + 1 extra). When configuring a quorum disk, quorum consists of user-specified conditions.”

In our 5-node cluster, if more than 2 nodes fail, the whole cluster will stop working.

In my-vm1, use “fdisk /dev/sdc” to create a partition. Here we don’t run mkfs against it.

Run “mkqdisk -c /dev/sdc1 -l myqdisk” to initialize the qdisk partition and run “mkqdisk -L” to confirm it’s done successfully.

Use the web-based tool to configure the qdisk:
Here a heuristics is defined to help to check the healthiness of every node. On every node, the ping command is run every 2 seconds. In (2*10 = 20) seconds, if 10 successful runs of ping aren’t achieved, the node itself thinks it has failed. As a consequence, it won’t vote, and it will be fenced, and the node will try to reboot itself.

After we “apply” the configuration in the Web GUI, /etc/cluster/cluster.conf is updated with the new lines:

Note 1: “Expected vote”: The expected votes value is used by cman to determine if the cluster has quorum. The cluster is quorate if the sum of votes of existing members is over half of the expected votes value. Here we have n=5 nodes. RHCS automatically specifies the vote value of the qdisk is n-1 = 4 and the expected votes value is n + (n -1) = 2n – 1 = 9. In the case only 1 node is alive, the effective vote value is: 1 + (n-1) = n, which is larger than (2n-1)/2 = n -1 (in C language), so the cluster will continue to function.

Note 2: In practice, “ping -c3 -t2 10.156.76.1” wasn’t always reliable – sometimes the ping failed after a timeout of 19 seconds and the related node was rebooted unexpectedly. Maybe it’s due to the firewall rule of the gateway server 10.156.76.1. In this case, replace “10.156.76.1” with “127.0.0.1” as a workaround.

Create a GFS2 file system in the shared storage /dev/sdb and test IO fencing

Create a 30GB LVM partition with fdisk

[root@my-vm1 ~]# fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x73312800.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

WARNING: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

NOTE: the above fdisk command is run in node1. On nodes 2 through 4, we need to run “partprobe /dev/sdb” command to force the kernel to discover the new partition (another method is: we can simply reboot nodes 2 through 4).

Create physical & logical volumes, run mkfs.gfs2 and mount the file systemRun the following on node1:

Now on nodes 1 through 4, “clustat” shows node 5 is offline and “cman_tool status” shows the current “Total votes: 8”. And the sg_persist command show the current SCSI owner of /dev/sdb is changed from node 5 to node 1 and there are only 4 registered keys:

In a word, the dead node 5 properly became offline and was fenced, and node1 has fixed a file system issue (“Found 1 revoke tags”) by replaying node 5’s GFS2 journal, so we have no data inconsistency issue.

Now let’s resume node 5 and we’ll find the cluster still doesn’t accept the node 5 as an online cluster member before node 5 reboots and rejoins the cluster with a known-good state.

Note: node 5 will be automatically rebooted by the qdisk daemon.

If we perform the above experiment by shutting down a node’s network (by “ifconfig eth0 down”), e.g., on node 3, we’ll get the same result, that is, node 3’s access to /mydata will be rejected and eventually the qdisk daemon will reboot node 3 automatically.

Wrap Up

Wow! That’s a lot of steps, but the result is worth it. You now have a 5 node Linux OS cluster with a shared GFS2 file system that can be read and written from all nodes. The cluster uses a quorum disk to prevent split-brain issues. These steps to set up a RHCS cluster are the same as you would use to set up a cluster of physical servers running CentOS 6.7, but the Hyper-V environment Linux is running in guest VMs, and shared storage is created on a Shared VHDX instead of a real physical shared disk.

In the last blog post, we’ll show setting up a web server on one of the CentOS 6.7 nodes, and demonstrate various failover cases.