High-Availability Framework

The Sun Cluster system makes all components on the “path” between
users and data highly available, including network interfaces, the applications themselves,
the file system, and the multihost devices. In general, a cluster component is highly
available if it survives any single (software or hardware) failure in the system.

The following
table shows the kinds of Sun Cluster component failures (both hardware and software)
and the kinds of recovery that are built into the high-availability framework.

Table 3–1 Levels of Sun Cluster Failure
Detection and Recovery

Failed Cluster Component

Software Recovery

Hardware Recovery

Data service

HA API, HA framework

N/A

Public network adapter

Internet Protocol (IP) Network Multipathing

Multiple public network adapter cards

Cluster file system

Primary and secondary replicas

Multihost devices

Mirrored multihost device

Volume management (Solaris Volume Manager and VERITAS Volume Manager, which is available
in SPARC based clusters only)

Hardware RAID-5 (for example, Sun StorEdgeTM A3x00)

Global device

Primary and secondary replicas

Multiple paths to the device, cluster transport junctions

Private network

HA transport software

Multiple private hardware-independent networks

Node

CMM, failfast driver

Multiple nodes

Sun Cluster software's high-availability framework detects a node failure
quickly and creates a new equivalent server for the framework resources on a remaining
node in the cluster. At no time are all framework resources unavailable. Framework
resources that are unaffected by a crashed node are fully available during recovery.
Furthermore, framework resources of the failed node become available as soon as they
are recovered. A recovered framework resource does not have to wait for all other
framework resources to complete their recovery.

Most
highly available framework resources are recovered transparently to the applications
(data services) using the resource. The semantics of framework resource access are
fully preserved across node failure. The applications simply cannot detect that the
framework resource server has been moved to another node. Failure of a single node
is completely transparent to programs on remaining nodes by using the files, devices,
and disk volumes attached to this node. This transparency exists if an alternative
hardware path exists to the disks from another node. An example is the use of multihost
devices that have ports to multiple nodes.

Cluster Membership Monitor

To ensure that data is kept safe from corruption, all nodes must reach a consistent
agreement on the cluster membership. When necessary, the CMM coordinates a cluster
reconfiguration of cluster services (applications) in response to a failure.

The CMM receives information about connectivity to other nodes from the cluster
transport layer. The CMM uses the cluster interconnect to exchange state information
during a reconfiguration.

After detecting a change in cluster membership, the CMM performs a synchronized
configuration of the cluster. In a synchronized configuration, cluster resources might
be redistributed, based on the new membership of the cluster.

See About Failure Fencing for more information
about how the cluster protects itself from partitioning into multiple separate clusters.

Failfast Mechanism

If the CMM detects a critical problem with a node, it notifies the cluster framework
to forcibly shut down (panic) the node and to remove it from the cluster membership.
The mechanism by which this occurs is called failfast. Failfast
causes a node to shut down in two ways.

If a node leaves the cluster
and then attempts to start a new cluster without having quorum, it is “fenced”
from accessing the shared disks. See About Failure Fencing for details about this use of failfast.

If one or more cluster-specific
daemons die (clexecd, rpc.pmfd, rgmd, or rpc.ed) the failure is detected by the CMM and the
node panics.

When the death of a cluster daemon causes a node to panic, a message similar
to the following is displayed on the console for that node.

After the panic, the node might
reboot and attempt to rejoin the cluster. Alternatively, if the cluster is composed
of SPARC based systems, the node might remain at the OpenBootTM PROM
(OBP) prompt. The next action of the node is determined by the setting of the auto-boot? parameter. You can set auto-boot? with eeprom(1M), at the OpenBoot PROM ok prompt.

Cluster Configuration Repository (CCR)

The CCR uses a two-phase commit algorithm for updates: An update must be successfully
completed on all cluster members or the update is rolled back. The CCR uses the cluster
interconnect to apply the distributed updates.

Caution –

Although the CCR consists of text files, never edit the CCR
files manually. Each file contains a checksum record to ensure consistency between
nodes. Manually updating CCR files can cause a node or the entire cluster to stop
functioning.

The CCR relies on the CMM to guarantee that a cluster is running only when quorum
is established. The CCR is responsible for verifying data consistency across the cluster,
performing recovery as necessary, and facilitating updates to the data.