Just a quick blog and reminder for something I worked a couple of months ago, or even longer and it took a considerable amount of time to figure out. Then last week a colleague just ping me with a question about a case they have been dealing for quite sometime.

Upon trying to join a second DAG member, in a virtualized Hyper-V environment, the following error message is presented:

“WriteError! Exception = Microsoft.Exchange.Cluster.Replay.DagTaskOperationFailedException: A server-side database availability group administrative operation failed. Error The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API ‘”AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired”‘ failed.. —> Microsoft.Exchange.Cluster.Replay.AmClusterApiException: An Active Manager operation failed. Error: An error occurred while attempting a cluster operation. Error: Cluster API ‘”AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired”‘ failed. —> System.ComponentModel.Win32Exception: This operation returned because the timeout period expired”

If you look at the cluster logs you would notice the heartbeat failing on port UDP 3343.

Obviously you need to go by the normal troubleshooting steps of network connectivity and firewall rules. If everything seems right, and as my friend pointed out all knobs were turned and he was ready to give up. That’s when I asked…

Have you disabled TCP/UDP Checksum offload on both host and guest network cards ?

After disabling these options on both guest and parent partition cluster could be formed without any issues.

So in this case, after you make sure the obvious have been checked out and you still cannot join your cluster nodes in a virtual environment, take a look at the NIC properties for TCP/UDP Checksum Offload for IPv4 and IPv6.

There are weeks where the same type of issues are reoccurring and makes you study and understand a little bit more about Windows Internals.

These past two weeks I had a couple of cases involving hangs that were caused by the File System Cache, and they will be the topic of this two-part blog.

With every support incident it is very important to understand the system and make sure you collect as much information as possible to help with the issue. In this case, a server had a brief hang, meaning that it was unresponsive for couple of minutes but it would recover afterwards. The biggest issue is that the backup operation never completed successfully since the application would time out before the server recovered.

During the hang period a kernel dump was collected to try to discover the reason the server was unresponsive, and luckily the dump was collected at the right time.

The investigation pointed to the a large number of outstanding pages within the File System Cache, or dirty pages , that were larger than the threshold set by the operating system. The easiest way to check these values is by using the WinDBG extension !defwrites

However the CcDirtyPageThreshold is set at 50% of physical memory and if you look at the current values at the moment of the dump we are below that threshold of 50% of physical memory , which means the operating system started to lower the threshold in an attempt to clear up the dirty pages faster, check Kb920739

********************

The System Internals Cache Manager uses a variable that is named CcDirtyPageThreshold. By default, the value ofCcDirtyPageThreshold may be set too high for scenarios where there are many lazy writes. By default, theCcDirtyPageThreshold global kernel variable is set to a value that is half of the physical memory. This variable triggers the cache manager’s write throttles.

*********************

If you want to check all the variables that contain Dirty use x nt!*Dirty*.

Once you find what you want , for instance fffff800`06020ae0 nt!CcTotalDirtyPages , just dump it :

We know it is one of the CPU’s generating the PCI Parity event, but we cannot tell emphatically which one since the system has two physical CPU’s, as you can check with !sysinfo cpuinfo extension or checking the Windows Object Manager extension (!object)