Part 6: Datacenter Activation Coordination – Who has a say?

I recently worked with a customer that had a three-member database availability group (DAG) that was extended to two sites in a site resilience configuration. During scheduled maintenance in the primary datacenter, the customer encountered an interesting situation. In this case, the customer had two DAG members deployed in their primary datacenter and the third member deployed in a remote datacenter. In addition, Datacenter Activation Coordination (DAC) mode was enabled for their DAG.

There was a need to shut down the servers in the primary datacenter. After completing maintenance tasks the servers in the primary datacenter were powered on. It was then noted that all of the databases were dismounted on the servers in the primary datacenter. This was verified with get-mailboxdatabase –status | fl name,mounted:

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted

Name : Mailbox Database 1252068500 Mounted : False

Name : Mailbox Database 1370762657 Mounted : False

Name : Mailbox Database 1511135053 Mounted : False

Name : Mailbox Database 1757981393 Mounted : False

So, the administrator issued a mount command, but an error was returned:

The error indicates that the DAG members must have quorum and automount consensus in order to mount databases. Because the DAG had DAC mode enabled, in order for automount consensus to be reached:

The node must be a member of a cluster.

The cluster must have quorum.

The node must be able to contact another member with a DACP bit set to 1 <or> it must be able to contact all other servers on the started servers list.

When the DAG members in the primary datacenter were shut down, the remaining DAG member went into a lost quorum state. Therefore the DACP bit of the third member changed to 0 in response to a cluster service state change. When the servers in the primary datacenter restart, they are unable to contact another DAG member with a DACP bit set to 1. Reviewing the properties of the DAG we saw that all three DAG members were on the started Mailbox servers list:

It was possible for the servers in the primary datacenter to contact the Microsoft Exchange Replication service on all servers on the started Mailbox servers list. So why then is automount consensus not reached?

When reviewing the status of servers in the cluster, we noted that the server in the remote datacenter was marked as down:

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name : mbx-1 State : Up

Name : mbx-2 State : Up

Name : mbx-3 State : Down

Why does MBX-3 report a status of down? Traditionally, when a lost quorum condition is encountered we expect the Cluster service on the servers where quorum was lost to terminate. Looking at the properties of the Cluster service we see that the default action is to restart when the service terminates.

In reviewing the application log on MBX-3 for events that occurred at the time MBX-1 and MBX-2 were shut down, we saw the following events:

Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 8/6/2012 10:03:04 AM Event ID: 1177 Task Category: Quorum Manager Level: Critical Keywords: User: SYSTEM Computer: MBX-3.exchange.msft Description: The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

We saw that the Cluster service acknowledged that there were no longer enough servers to maintain quorum and that the Cluster service entered the stop state gracefully – it did not terminate. When MBX-1 and MBX-2 were gracefully shutdown in this example they communicate with all servers in the cluster announcing that they will be leaving. In other words the servers in the primary datacenter did not just unexpectedly disappear, as in the case of a network failure or other catastrophic failure.

Since MBX-3 was informed that the other servers were leaving, it then determined that not enough votes would remain to satisfy quorum, and gracefully stopped its Cluster service rather than terminating it. When MBX-1 and MBX-2 were brought back online they subsequently formed a cluster using their votes (only 2 of 3 votes necessary) and then began the process of determining automount consensus. Since MBX-3 was a member of a cluster, but did not have its Cluster service started, it had no response to the DACP bit inquiry. The condition that we must contact all servers on the started servers list of the DAG when no servers advertise a DACP bit of 1 was not met.

To resolve this, the administrator simply needs to restart the Cluster service on MBX-3. This will in most cases result in databases mounting automatically as it allows the criteria for automount consensus to be satisfied and reached. Here is a sample showing databases automatically mounted after starting the cluster service on MBX-3.

PS C:\> Get-ClusterNode | fl name,state

Name : mbx-1 State : Up

Name : mbx-2 State : Up

Name : mbx-3 State : Up

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted

Name : Mailbox Database 1252068500 Mounted : True

Name : Mailbox Database 1757981393 Mounted : True

Name : Mailbox Database 1370762657 Mounted : True

Name : Mailbox Database 1511135053 Mounted : True

To illustrate the difference, here is an example where MBX-1 and MBX-2 were powered off instead of being gracefully shut down. The events on MBX-3 show that the servers left unexpectedly and the Cluster service was terminated due to a lost quorum condition.

Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 8/6/2012 12:15:00 PM Event ID: 1135 Task Category: Node Mgr Level: Critical Keywords: User: SYSTEM Computer: MBX-3.exchange.msft Description: Cluster node 'MBX-2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 8/6/2012 12:15:00 PM Event ID: 1135 Task Category: Node Mgr Level: Critical Keywords: User: SYSTEM Computer: MBX-3.exchange.msft Description: Cluster node 'MBX-1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 8/6/2012 12:15:00 PM Event ID: 1177 Task Category: Quorum Manager Level: Critical Keywords: User: SYSTEM Computer: MBX-3.exchange.msft Description: The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Based on the events present, the Cluster service was terminated and the Service Control Manager issued a restart. When MBX-1 and MBX-2 come back up, MBX-3 will successfully join the cluster. Thus in this scenario, each DAG member can contact all servers on the started mailbox servers list, receive a response, and automount consensus can be reached, and databases will automatically mount.

========================================================

Datacenter Activation Coordination Series:

Part 1: My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e) Part 2: Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft) Part 3: Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy) Part 4: Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq) Part 5: Datacenter Activation Coordination: How do I Force Automount Concensus? (https://aka.ms/T5sgqa) Part 6: Datacenter Activation Coordination: Who has a say? (https://aka.ms/W51h6n) Part 7: Datacenter Activation Coordination: When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover. (https://aka.ms/Oieqqp) Part 8: Datacenter Activation Coordination: Stop! In the Name of DAG... (https://aka.ms/Uzogbq) Part 9: Datacenter Activation Coordination: An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

In this case it does not matter. Neither start <or> stop was used in this example.

What I was trying to highlight here – it is commonly assumed that if the replication service on a machine is acceptable that it can cast a vote in this process. That is not true. The replication service has to be started – but the node itself must also be a member of a cluster for it's vote to count.