To Arbitrate or Not to Arbitrate

Arbitration is not always used to determine cluster membership.
Some cluster software products rely exclusively on the use of multiple
cluster membership communication links (heartbeats). These algorithms
are described in the following sections.

No
Arbitration—Multiple Paths

Some approaches do not use arbitration, but instead rely on
multiple membership paths to ensure that the heartbeat or essential
intra-cluster communication remains unbroken. In this approach,
the event of a node failing entirely is considered more likely than
the event of several LAN paths all failing at the same time. Such
systems assume that a loss of communication means a node failure,
and packages are allowed to fail over when a loss of heartbeat is
detected.

This model is illustrated in Figure 1 and Figure 2.
In Figure 1, three separate LAN failures would be required
to break communication between the cluster nodes. This assumes that
hubs are separately powered, of course, and that other HA design
criteria are met.

Figure 1-1 Multiple
Heartbeat Failures

In Figure 2, on the other hand, a single node failure
would result in the loss of heartbeat communication. In the no-arbitration
model, the loss of heartbeat would be interpreted by the cluster
manager as a failure of node 1, and therefore the cluster could
re-form with packages failing over from node 1 to node 2.

Figure 1-2 Single
Node Failure

No
Arbitration—Multiple Media

It is possible to define multiple membership paths for intra-cluster communication
that employ different types of communication from node to node.
One path could use conventional LAN links, while a second path might
employ a disk.

This model is illustrated in Figure 3. Both a LAN
connection and a disk link provide redundant membership communication.

Figure 1-3 Multiple
Paths with Different Media

Note that the configuration could be expanded to include multiple
disk links plus multiple LAN links, as in Figure 4. Such
a configuration would require the loss of at least 4 links for the
heartbeat to be lost.

Figure 1-4 Additional
Multiple Paths with Different Media

No
Arbitration—Risks

When all is said and done, it may be very unlikely that intra-node communication
would be lost in the above configurations, but it is still possible
that heartbeat could disappear, with both nodes still running, and
this scenario can cause data corruption.

The risk of split brain syndrome is considerably greater with
extended distance clusters and disaster tolerant solutions in which
nodes are located in different data centers at some distance from
each other. For these types of solution, some form of arbitration
is essential.

The unlikely but possible scenario of
split brain can be definitively avoided with an arbitration device.
In other words, the risk of data corruption can be eliminated. The
HP Serviceguard family of clustering software includes this level
of protection.