A Cluster Formed by S2600 Fails to be Started Due to the Startup Failure of a Single Node

Publication Date: 2012-07-20Views: 124Downloads: 0

Issue Description

A deployment site (for details on its topology, see the attachment), an S2600 with single controller is installed. The controller software version is 1.04.01.205.T01. The SES version is S021. The server runs the AIX6100-01 operating system and adopts the HACMP cluster software, and the software version is 5.4.1.0. Controller A has two nodes and two private LUNs are mapped to the nodes. Two hosts share two LUNs to form a cluster storage network. A private LUN has been activated. During the startup process of the cluster, a single node fails to be started up.
According to the storage logs, the following operations are performed on the S2600:
A reservation command is executed:
Oct 14 23:25:56 AK-I kernel: [372919227]Reserve (6)[16] command for Host LUN 0, Device Lun 8 @ [jif=372919227] SCSI_PrintDebugInfo : 1382
A command to clear the reservation is executed later:
Oct 14 23:25:56 AK-I kernel: [372919957]SCSI_ClearReserveExec
Oct 14 23:25:56 AK-I kernel: [372919957] @ [jif=372919957] SCSI_ClearReserveExec : 2200
Oct 14 23:25:56 AK-I kernel: [372919957]This is the master controller
Oct 14 23:25:56 AK-I kernel: [372919957] @ [jif=372919957] SCSI_ClearReserveExec : 2207
Oct 14 23:25:56 AK-I kernel: [372919957]Enter SCSI_ClearReserve
Oct 14 23:25:56 AK-I kernel: [372919957] @ [jif=372919957] SCSI_ClearReserve : 2286

Alarm Information

None

Handling Process

Based on the previous analysis, the cause for the failure is the wrong sequences to start and stop the node and private LUN. The following workaround methods can be adopted:
1. startup sequence: Start HA, and then run the varyonvg command to manually activate the volume which the LUN belongs to.
2. Stop sequence: Run the varyonvg command to manually deactivate the volume which the LUN belongs to, and then stop the HA.

Root Cause

If no private LUN is configured, during the cluster startup process, the node that is started first sends a reservation command and then a LOGOUT command to clear the reservation. The other node repeats the commands later.
However, if the first started node does not clear its reservation after the reservation takes effect, the other node fails to be started up.
During the cluster startup process, the LOGOUT command is employed to clear the reservation, and the command stops the whole session. As for a private LUN, it shares the same session with the shared disk in the same primary node, namely, they are two connections of the same session. Therefore, if a private LUN is active, the LOGOUT command cannot stop the whole session of the cluster, and then the reservation fails to be cleared and the second node cannot be started up.