Part 8: Datacenter Activation Coordination: Stop! In the Name of DAG…

Sometimes, even when following a specific process, it takes only one mistake to send the entire process off course. Recently I’ve worked with several customers on their datacenter switchover steps that have found themselves unable to complete the process. Let’s explore several examples of what happened…

In the first example, we have a four member database availability group (DAG). Two members are deployed in the primary datacenter along with the witness server, and the other two members are installed in a remote datacenter with an alternate witness server. Each datacenter is an Active Directory site with a defined subnet. In this example, AD site Exchange-A is the primary datacenter and AD site Exchange-B is the remote datacenter. Here is an example network diagram:

In preparation for testing the witness server, MBX-1, MBX-2, and the router are powered down. This leaves MBX-3 and MBX-4 in a lost quorum state in the remote datacenter. The administrator starts the datacenter switchover process with Stop-DatabaseAvailabilityGroup, as shown in this example:

WARNING: Active Directory couldn't be updated in Exchange-A site(s) affected by the change to 'DAG'. It won't be completely usable until after Active Directory replication occurs. An error caused a change in the current set of domain controllers. + CategoryInfo : NotSpecified: (0:Int32) [], ADServerSettingsChangedException + FullyQualifiedErrorId : 372697AD

Next, the cluster service is stopped on MBX-3 and MBX-4.

Stop-service clussvc

To complete the switchover, Restore-DatabaseAvailabilityGroup is used.

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_14-07-52.764_restore-databaseavailabilitygroup.log". Unable to get the status of the cluster service on server 'MBX-2'. Error: 'Cannot open Service Control Manager on computer 'MBX-2'. This operation might require other privileges.' + CategoryInfo : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], FailedToGetServiceStatusForNodeException + FullyQualifiedErrorId : A9B129A5,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

The command returns an error indicating that it cannot contact server MBX-2 in order to determine the status of the Cluster service. Why is the task attempting to contact a server in the primary site that is down? Using Get-DatabaseAvailabilityGroup to review the properties of the DAG shows us why:

We can examine StoppedMailboxServers and note that MBX-3 and MBX-4 are on the stopped list when they should be on the started servers list. This happened because in this instance the administrator stopped the wrong Active Directory site. When using Stop-DatabaseAvailabilityGroup, the administrator should have specified site Exchange-A but accidentally specified Exchange-B. This means the restore task is attempting to force the Cluster service on either MBX-1 or MBX-2 online and subsequently evict MBX-3 and MBX-4 from the cluster.

If this mistake is made, how do you fix it? The first step that needs to be done is to correct the stopped and started servers list. To do this, first stop the correct set of servers.

WARNING: Active Directory couldn't be updated in Exchange-A site(s) affected by the change to 'DAG'. It won't be completely usable until after Active Directory replication occurs. An error caused a change in the current set of domain controllers. + CategoryInfo : NotSpecified: (0:Int32) [], ADServerSettingsChangedException + FullyQualifiedErrorId : 372697AD

Next, use Get-DatabaseAvailabiltyGroup to confirm that all four servers in the DAG now appear on the StoppedMailboxSservers list.

The failures that are displayed are expected. The Cluster services on the nodes are not in a started state at this time. Using Get-DatabaseAvailabilityGroup we note that the servers listed are correct for both the StartedMailboxServers and StoppedMailboxServers list.

The third step is to ensure the Cluster service is stopped on each node, which can be accomplished by using Stop-Service.

Stop-Service ClusSvc

The last step is to use Restore-DatabaseAvailabiltyGroup. This cmdlet will complete the datacenter switchover process by forcing the Cluster service to start and by evicting the nodes on the StoppedMailboxServers list.

In the second example we have a four-member DAG. Two members are in the primary datacenter with the witness server, and two members are in a remote datacenter with an alternate witness server configured. Both datacenters are in the same Active Directory site. Here is an example network diagram:

In preparation for testing the witness server, MBX-1, MBX-2, and the router are powered down. This leaves MBX-3 and MBX-4 in a lost quorum state in the remote datacenter. So the administrator starts the datacenter switchover process by issuing Stop-DatabaseAvailabilityGroup, as illustrated in the following example:

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_16-57-27.326_restore-databaseavailabilitygroup.log". Unable to form quorum for database availability group 'DAG'. Please try the operation again, or run the Restore-DatabaseAvailabilityGroup cmdlet and specify the site with servers known to be running. + CategoryInfo : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], DagTaskQuorumNotAchievedException + FullyQualifiedErrorId : C7FE0CB9,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

The command returns an error indicating that a quorum cannot be formed because no servers are known to be running. Why has this occurred? Using Get-DatabaseAvailabilityGroup we can review the properties of the DAG:

Specifically we are interested in StoppedMailboxServers. In this example, all four DAG members appear in the StoppedMailboxServers list. Why is that? In our scenario, all Exchange servers are in the same Active Directory site. The administrator issued Stop-DatabaseAvailabiltyGroup command with the ActiveDirectorySite parameter when instead the MailboxServer parameter should have been used. The MailboxServer parameter was needed so that the administrator could stop individual servers instead of all of the servers in the same site.

If this mistake is made, you can recover from it fairly easily. The first step is to fix the started and stopped mailbox server lists. You can use Start-DatabaseAvailabilityGroup to correct this.

The failures that are displayed are expected. The Cluster services on the DAG members is not started at this time. We can use Get-DatabaseAvailabliityGroup to verify that the StartedMailboxServers and StoppedMailboxServers lists are correct.

The second step is to ensure that the Cluster service is stopped on MBX-3 and MBX-4.

Stop-Server ClusSvc

The last step is to run Restore-DatabaseAvailabilityGroup command. This will complete the datacenter switchover process by forcing the Cluster service to start and by evicting the nodes on the stopped servers list.

In the last example we have a four-member DAG. Two members are installed in a primary datacenter with the witness server, and two members are installed in a remote datacenter with an alternate witness server configured. Both datacenters are in the same Active Directory site. The same situation described in this example can occur when multiple Active Directory sites are used, but in my experience, this problem most commonly occurs with just a single Active Directory site. Here is an example network diagram:

In preparation for testing the witness server, MBX-1, MBX-2, and the router are powered down. This leaves MBX-3 and MBX-4 in a lost quorum state in the remote datacenter. So, the administrator starts the datacenter switchover process with Stop-DatabaseAvailabilityGroup:

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_16-57-27.326_restore-databaseavailabilitygroup.log". Unable to form quorum for database availability group 'DAG'. Please try the operation again, or run the Restore-DatabaseAvailabilityGroup cmdlet and specify the site with servers known to be running. + CategoryInfo : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], DagTaskQuorumNotAchievedException + FullyQualifiedErrorId : C7FE0CB9,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

As with the previous examples, the problem because the administrator issued the Stop-DatabaseAvailabilityGroup command and all servers were added to the stopped servers list. This is verified with Get-DatabaseAvailabilityGroup.

The extent of the issue is realized when we attempt to correct the started and stopped mailbox server lists and proceed with the switchover process. As with the previous examples, we use Start-DatabaseAvailabilityGroup with the MailboxServer parameter to start the individual servers in the remote datacenter.

The failures that are displayed are expected because the Cluster services on the DAG members are not in a started state. Using Get-DatabaseAvailabliityGroup, we note that the servers are correct on both the StartedMailboxServers and StoppedMailboxServers list.

WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-08-12_17-55-16.974_restore-databaseavailabilitygroup.log". Couldn't start the Cluster service on 'MBX-3'. Service state: Stopped. Try forcing the cluster to start without quorum by running "net start clussvc /fq" from a command prompt on that node. + CategoryInfo : InvalidArgument: (:) [Restore-DatabaseAvailabilityGroup], FailedToStartClusSvcException + FullyQualifiedErrorId : 6CD04940,Microsoft.Exchange.Management.SystemConfigurationTasks.RestoreDatabaseAvailabilityGroup

As shown above, Restore-DatabaseAvailabilityGroup failed because it failed to successfully start the Cluster service on MBX-3 using force quorum. The error suggests that the administrator should attempt to manually start the service with /forcequorum.

net start clussvc /fq

System error 1058 has occurred. The service cannot be started, either because it is disabled or because it has no enabled devices associated with it.

After attempting to manually start the Cluster service with /forceQuorum the above error is displayed, which indicates that the Cluster service is not installed.

When reviewing Service Control Manager, we note that the Cluster service on the remaining members is set to Disabled.

When reviewing the system event log, we see the following event at or about the time the Stop-DatabaseAvailabilityGroup was issued.

This is where the extent of the mistake is exposed. Stop-DatabaseAvailabilityGroup was not only run against servers that should not have been stopped, but it was also run without the ConfigurationOnly parameter. When the cmdlet is run without the ConfigurationOnly parameter, any servers that are being stopped that are accessible will have their Cluster service forcibly cleaned up. This in turn prevents Restore-DatabaseAvailabilityGroup from being successful.

In order to overcome this situation the administrator must re-establish the Cluster and then proceed with database activation. The first step is to ensure that the Cluster service is completely cleaned up from the DAG members in the remote datacenter.

The second step is to use Active Directory Users and Computers to locate the DAG’s CNO. Right-click the CNO and select RESET, and then right-click the CNO and select disable. Allow sufficient time for the disabled account to replicate around Active Directory.

The third step is to manually create the cluster. There are three methods to manually create the cluster.

Windows 2008 and Windows 2008 R2 utilizing Failover Cluster Manager:

Launch Failover Cluster Manager.

In the upper right corner select “Create a cluster…”

In the “Before you begin” dialog, select Next.

On the “Selected Server” dialog enter the server names of all servers in the remote datacenter. In our example, we will add MBX-3 and MBX-4. Select the Add button after each server name. Select Next when completed.

On the “Validation Warning” select NO. Select Next when completed.

On the “Access Point for Administering the Cluster” in the “Cluster Name:” field, enter the name of the DAG. In our example we will use DAG (creative eh?). In the networks dialog enter the IP address assigned to the DAG in the remote datacenter (if you are not sure you can use Get-DatabaseAvailabilityGroup | fl name,databaseavailabilitygroupipaddresses to list the IP addresses assigned to the DAG). Select Next when complete.

On the “Confirmation” select Next.

At this time the Cluster service should be configured on both servers. On the “Summary” select Finish.

The last step is to use the Exchange Management Shell and run the following command:

Set-DatabaseAvailabilityGroup –identity DAG

By running this command and not specifying any values this will ensure that the DAG settings from Active Directory are applied to the new cluster.

At this time, the started and stopped mailbox server lists are accurate, and the Cluster service for the DAG has been re-established. To ensure the configuration is correct the administrator can run Set-DatabaseAvailabilityGroup. This will ensure that the DAG configuration in Active Directory matches the cluster configuration.

This completes the datacenter switchover for the database availability group. The procedure can now continue with database activation and changes required for client access.

This blog post covers three common scenarios I see where administrators make mistakes when using Stop-DatabaseAvailabilityGroup. When used incorrectly, the cmdlet can have unintended results and the steps outlined here can be used to work around them.

========================================================

Datacenter Activation Coordination Series:

Part 1: My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e) Part 2: Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft) Part 3: Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy) Part 4: Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq) Part 5: Datacenter Activation Coordination: How do I Force Automount Concensus? (https://aka.ms/T5sgqa) Part 6: Datacenter Activation Coordination: Who has a say? (https://aka.ms/W51h6n) Part 7: Datacenter Activation Coordination: When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover. (https://aka.ms/Oieqqp) Part 8: Datacenter Activation Coordination: Stop! In the Name of DAG... (https://aka.ms/Uzogbq) Part 9: Datacenter Activation Coordination: An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

I came across something similar to scenario one while preparing for our annual disaster recovery test (where I had to create a split-brain environment…been meaning to e-mail you about my findings on that) and discovered that if I lose the DACP-bit (and the PAM) at the recovery data center that I can simply run a Start-DAG to get things going again. I found this out by trial and error when I had to restart the Exchange services in a lab.

I recently went through this process and the article was a big help. One note, on Windows 2012, when you recreate the cluster and add the nodes to the site, cluster service will grab local disks on the nodes and assign them as cluster resources. This will
prevent Exchange from finding them. The disk resources have to be deleted from the cluster and then marked as online in Disk Manager on each node.

thanks for this insight into "anything that can go wrong will go wrong". I have one query. We ran through a two datacentre node majority (3/2) switchover recently. It all worked, but after the stop-databaseavailabilitygroup against the primary site get-databaseavailabilitygroup
at the secondary always failed code 0x46 0r 0x6d9. After a bit of digging it looked like the cluster services on the recovery side had stopped themselves because the cluster had lost quorum, so get-databaseavallabilitygroup could not work. AD showed the three
nodes had stopped ok and the restart-dag worked after the usual failures as the cluster service started. Should get-databaseavailabilitygroup work between the stop and the restart ?

Hi Tim,
Thanks a lot for these Great posts. One question, we have four member DAG, hence FSW will come into picture for the additional vote as per n/2 +1 method. ( 3 nodes should be available for the cluster to work). Now if two nodes and the FSW in the primary DC
went offline. The quorum lost and the cluster will be down.
Now we have two votes available MBX-3 and MBX-4 online. Can we bring the alternet FSW configured in the DR DC into picture for the additional vote which is required to bring this cluster online. This will make 3 votes online and the cluster should come online.
Why to do the entire datacenter switchover process?
I know I am missing something here. Hope you will help me find out that.

Thanks Tim… You’r blogs always awsone. But i’ve a questions based on Second scenario
DAG001 and DAG002 (DAG001 is in 10.1.0.0/16 and DAG002 is in 10.2.0.0/16) both are in the same site Assume Site A,and Each DAG contains 4 Exchange servers. Both DAGs are in same site but they are in Isolated N/w.

DAG001 databases dsitributed between 2 subnets and the same for dag002(We already enabled DAC mode). Considering the current infrastructure if my DAG001 goes down(Entire Subnet 10.1.0.0/16, even FSW which is tie breaker also went down) if i run restore-databaseavailbilitygroup
will it affect to DAG002?