Looking for current and maintained information and documentation on (Linux ) Open Source High Availability HA Clustering?You probably should be reading the Pacemaker site clusterlabs.orgThis site conserves Heartbeat specific stuff. See Site news for details.

Cluster

For our purposes, a cluster is a collection of loosely coupled cooperating computing elements
which we refer to as nodes.
Failures in clusters are not observed
instantaneously or simultaneously by every node.
Instead, failures occur
asynchronously, and are observed stochastically and independently by the
nodes of the cluster.
It is not guaranteed that any particular
failure will be observed in the same way by every node in the cluster, or
even observed at all by every node.

Subcluster

At any given point in time, a cluster is divided into zero or more
subclusters (or partitions) of live nodes.
Each live node has a view of subcluster membership, and
acknowledges membership in no more than one of these subclusters.
The most desirable state is that there is only one subcluster in the cluster, and that all (active)
nodes belong to that subcluster.
However, communication failures can cause a single cluster to divide into multiple
subclusters which are partially or completely unaware of each other.

Primary Subcluster

One of these subclusters may be designated the primary subcluster.
Traditionally, this primary subcluster is said to have quorum (or more specifically
have cluster-quorum, or cluster-wide-quorum).
How and why the primary subcluster is chosen is implementation-dependent.
How and why each node associates itself with a particular subcluster is
implementation-dependent.

Membership

Membership calculation is the process whereby each node associates itself
with a particular subcluster, and obtains a (probabalistically current) view
of subcluster membership. Quorum calculation is the process whereby the
primary subcluster (if any) is selected.

Normally, a membership algorithm will try to make the primary subcluster as
large as it can, but it is provable that this is impossible under all
circumstances. If a node is in a non-primary subcluster, the membership
algorithm is under no obligation to try and make these non-primary
subclusters as large as possible.

At any given point in time, each node can view the subcluster of
which it is a member as either having stable membership or being in
transition. Stable membership is defined to mean that no event has yet been
observed by the node which would cause it to recalculate subcluster
membership and/or quorum. In transition means that a particular
node has observed an event which may cause a membership change, but that the
process of recomputing membership and/or quorum has not yet completed.
Different nodes in a subcluster may have different views of the
stable/transition state at any given point in time.

Because failures occur asynchronously and are observed stochastically,
membership provides only probabilistic (and not absolute) assurances of
ability to communicate with any particular node.
Because it is impossible to know whether a node in a subcluster might
observed an event which would make it go into a into transition,
in a very real sense, membership is only probabilistically certain,
never absolutely certain.

Determination of Quorum

The quorum calculation process is required to select no more than one
primary subcluster at a time, but need not select any at all. Under some
circumstances, designating more than one primary subcluster at a time can
lead to irrecoverable application failures.
Nevertheless, no quorum algorithm can provide an absolute guarantee of this property
in the presence of arbitrary failures.
Different quorum implementations provide different
degrees of certainty for any given configuration and set of expected failures.

A common method for defining quorum is to say that a subcluster of size n in
a cluster of m total nodes has quorum if it has a plurality of nodes
in its partition. That is, it has n members where n > INT(m/2).
This simple method works quite well for larger clusters with generally reliable communications.
However it breaks down for 2-node clusters, and may perform poorly for geographically dispersed clusters
without highly reliable communications between sites.

Strongly Connected Subclusters

Strongly connected subclusters are subclusters where each node
can communicate with every member of its subcluster.

Consensus Subclusters

Consensus subclusters are subclusters where during the time when every
member of a subcluster views subcluster membership as being stable (as
defined above) each node has precisely the same view of subcluster
membership as every other member of its subcluster.

Cluster Membership Quality of Service

The lowest grade of membership algorithms only meet the letter of the law
above, never select a primary subcluster, and make no guarantees regarding
strong connectivity or consensus membership. Many traditional
high-performance clustering applications tolerate such mechanisms, but most
high-availability applications require stronger guarantees.

Such properties can be referred to collectively as cluster quality of
service (CQOS) properties.

Fencing

We use the term fencing to refer to the act of separating a cluster node
from the resources it manages, without its cooperation.
That is, proper fencing techniques will separate a node from its resources
without the cooperation of the node being fenced.
Given that a cluster node which is subjected to fencing is typically thought
to be errant, and the nature of the fault it has experienced is unknown, relying on a third party
(involuntary) fencing mechanism increases the probability that an errant node is
not longer using its resources.
Using fencing and is considered more reliable than simply relying on an errant node to stop using
resources on its own.

Relationship Between Quorum and Fencing

Normally quorum and fencing are used in combination, that is only the designated
subcluster would fence nodes. However, in certain configurations, reliably determining
a designated subcluster is difficult. In these cases, fencing can be used
to keep cluster resources from being improperly used, in spite of an imperfect
quorum method.

If reliable fencing is used
without quorum (or with an imperfect quorum mechanism), then certain undesirable
behaviors (such as mutual fencing) may occur, but fencing
will still protect cluster resources from being improperly used
in spite of any imperfections in the quorum method.