Blogs

About this blog

Welcome to the Data Center Automation Blog, where you can read the perspectives from storage management experts. This Blog provides insights into the data center automation solution, as well as technical details about specific IBM products.

... in a nutshell, a group of nodes is considered to have quorum if it represents more than half the nodes in the cluster.

A TieBreaker is needed to decide who takes control in a situation where its not possible to decide based on the number of operational nodes in a sub-cluster, in other words when you have an even number of nodes in a cluster and a cluster split that results in half the nodes in each sub-cluster. The most obvious example is a 2 node cluster ... if the two nodes cannot talk to each other, a TieBreaker is needed to decide who should take control ... who should proceed with the necessary automation actions to keep resources highly available.

For the sake of this explanation, we're keeping things simple by only talking about a "Network" TieBreaker (there are other types like "disk" and "nfs"). We would specify a pingable system in the network that is independent of the clustered nodes, for example the gateway router used by the clustered nodes. Actually, it is considered a best practice to use the default gateway router as the Network TieBreaker device, also known as a "Quorum Device".

Consider a two node cluster as follows :

"node1" and "node2" are our clustered nodes, each configured with 10.20.30.1 as its default gateway for basic TCP/IP communications.

Now consider a node failure scenario. In my example, "node1" suffers a power failure.

"node2" can no longer ping "node1" (no response to heartbeats).

"node2" is only 1 node out of a 2 node domain which is not considered a majority (more than half), so it uses the defined Network TieBreaker we setup when we first deployed the cluster, the gateway router.

"node2" successfully pings 10.20.30.1 and therefore regains quorum. If the resources were not already running on "node2", the TSAMP product would then perform the necessary automation actions to bring the resources online on "node2" in order to keep them highly available.

Now consider a network adapter failure scenario. First lets assume the power to "node1" was restored and both nodes are communicating (heartbeating) again. At some point there is a break in the network connectivity that isolates "node1" from the rest of the network.

"node2" can no longer heartbeat/ping "node1".

"node1" can no longer heartbeat/ping "node2".

In this case, both nodes lose quorum and attempt to ping the Network TieBreaker device, again the gateway router in this example.

"node1" cannot reach the default gateway because of whatever problem caused it to be isolated from the network in the first place.

"node2" is able to ping the gateway, our Network TieBreaker, so it regains quorum and hosts the resources in TSAMP's effort to keep resources highly available.

If "node1" had been hosting the online resources, it would have been forced to reboot at this point, to ensure the resources can be brought online on a surviving node without fear that they would be running concurrently on more than one server.

That's how a "Network TieBreaker" works. Here's the assumption: If "node1" can communicate (ping) with the default gateway and "node2" can communicate (ping) with the default gateway, then "node1" must be able to communicate (heartbeat) with "node2". If for some strange reason you have a network would allow each node to ping a common gateway/device, but not each other, then a "Network" style TieBreaker is not for you.