Detecting Cluster Failure on a System That Uses Hitachi TrueCopy or Universal Replicator Data
Replication

Detecting Primary Cluster Failure

When the primary cluster for a given protection group fails, the secondary
cluster in the partnership detects the failure. The cluster that fails might
be a member of more than one partnership, resulting in multiple failure detections.

The following actions take place when a primary cluster failure occurs.
During a failure, the appropriate protection groups are in the Unknown state.

Heartbeat failure is detected by a partner cluster.

The heartbeat is activated in emergency mode to verify that
the heartbeat loss is not transient and that the primary cluster has failed.
The heartbeat remains in the Online state during this default
timeout interval, while the heartbeat mechanism continues to retry the primary
cluster.

This query interval is set by using the Query_interval heartbeat property. If the heartbeat still fails after the interval
you configured, a heartbeat-lost event is generated and logged in the system
log. When you use the default interval, the emergency-mode retry behavior
might delay heartbeat-loss notification for about nine minutes. Messages are
displayed in the graphical user interface (GUI) and in the output of the geoadm status command.

Detecting Secondary Cluster Failure

When a secondary cluster for a given protection group fails, a cluster
in the same partnership detects the failure. The cluster that failed might
be a member of more than one partnership, resulting in multiple failure detections.

During failure detection, the following actions occur:

Heartbeat failure is detected by a partner cluster.

The heartbeat is activated in emergency mode to verify that
the secondary cluster is dead.

The cluster notifies the administrator. The system detects
all protection groups for which the cluster that failed was acting as secondary.
The state of the appropriate protection groups is marked Unknown.