I paid a customer a visit a while ago and was requested to assist with a SQL Server Failover Cluster issue they were experiencing. They had internally transferred the case from the SQL team to folks who look after the Windows Server platform as they could not pick up anything relating to SQL during initial troubleshooting efforts.

My aim in this post is to:

explain what the issue was (adding disks meant to be local storage to the cluster)

provide a little bit of context on cluster disks and asymmetric storage configuration

discuss how the issue was resolved by removing the disks from cluster

Issue definition and scope

An attempt to move the SQL Server role/group from one node to another in a 2-node Failover Cluster failed. This is what they observed:

From the image above, it can be seen that all disk resources are online. Would you suspect that storage is involved at this stage? In cluster events, there was the standard Event ID 1069 confirming that the cluster resource ‘SQL Server’ of type ‘SQL Server’ in clustered role ‘SQL Server (MSSQLSERVER)’ failed. Additionally, this is what was in the cluster log – “failed to start service with error 2”:

Error code 2 means that the system cannot find the file specified:

A little bit of digging around reveals that this is the image path we are failing to get to:

Now that we have all this information, let’s look at how you would resolve this specific issue we were facing. Before that however, I would like to provide a bit of context relating to cluster disks, especially on Asymmetric Storage Configuration.

Let us first take a look at the disks node in Failover Cluster Manager (FCM) before adding the disks.

Here’s what we have (ordered by the disk number column):

The Failover Cluster Witness disk (1 GB)

SQL Data (50 GB)

SQL Logs (10 GB)

Other Stuff (5 GB)

The following window is presented when an attempt to add disks to a cluster operation is performed in FCM:

Both disks are added as cluster disks when one clicks OK at this stage. After adding the disks (which are not presented to both nodes), we see the following:

Nothing changed regarding the 4 disks we have already seen in FCM, and the two “local” disks are now included:

Cluster Disk 1 is online on node PTA-SQL11

Cluster Disk 2 is offline on node PTA-SQL11 as it is not physically connected to the node

At this stage, everything still works fine as the SQL binaries volume is still available on this node. Note that the “Available Storage” group is running on PTA-SQL11.

What happens when you move the Available Storage group?

Let’s take a look at FCM again:

Now we see that:

Cluster Disk 1 is now offline

Cluster Disk 2 is now online

The owner of the “Available Storage” group is now PTA-SQL12

This means that PTA-SQL12 can see the SQL binaries volume and PTA-SQL11 cannot, which causes downtime. Moving the SQL group to PTA-SQL12 works just fine as the SQL binaries drive is online on that node. You may also want to ensure that the resources are configured to automatically recover from failures. Below is an example of default configuration on a resource:

Process People and Technology

It may appear that the technology is at fault here, but the Failover Cluster service does its bit to protect us from shooting ourselves in the foot, and here are some examples:

Validation

The Failover Cluster validation report does a good job in letting you know that disks are only visible from one node. By the way, there’s also good information here on what’s considered for a disk to be clustered.

A warning is more like a “proceed with caution” when looking at a validation report. Failures/errors mean that the solution does not meet requirementsfor Microsoft support. Also be careful when validating storage as services may be taken offline.

Logic

In the following snippet from the cluster log, we see an example of the Failover Cluster Resource Control Manger (RCM) prevent the move of the “Available Storage” group to prevent downtime.

Back online and way forward

To get the service up and running again, we had to remove both Disk 1 and Disk 2 as cluster disks and make them “local” drives again. The cause was that an administrator had added disks that were not meant to be part of the cluster as clustered disks.

Disks need to be made online from a tool such as the Disk Management console as they are automatically placed in an offline state to avoid possible issues that may be caused by having a non-clustered disk online on two or more nodes in a shared disk scenario.

I got curious after this and reached out to folks who specialize in SQL server to get their views on whether the SQL binaries drive should or should not be shared. One of the strong views is to keep them as a non-shared (non-clustered) drives, especially for cases on SQL patching. What happens if SQL patching fails in a shared drive scenario for example?