In this guide i will give you a quick overview on how to troubleshoot your Storage Spaces, like ordinary Storage Pools with and without Tiering and Storage Spaces Direct. I will do the troubleshooting based on an issue we had with our test Storage Spaces Direct Cluster.

What happen was that we where starting to experience really bad performance on the VM’s. Response times where going trough the roof. We had response times of 8000 ms on normal os operations. What we traced it down to was faulty SSD drives. These where Kingston V310 consumer SSD drives. These did not have power loss protection on them, and that’s a problem as S2D or windows storage want’s to write to a safe place. The caching on these Kingston drives worked for a while. But after to much writing it failed. You can read all about SSD and power loss protection here.

So what i will cover here is how to identify slow disks.

First you will notice performance issues on your VM’s or physical server. Open resource monitor and look at Response times

You should get an output like this, and look for the SMB Connectivity Error Check. Look at errors the last 5 min and run the test a few times and look for if the numbers increase. As this is an indication for RDMA problems.

As you can see there where some issues on RDMA here, but nothing serious. The nr 4 node is being looked after. I will come back with a guide for that.

Now let’s have look at the disks on the physical machine.

If you have S2D you should see if all drives in the storagepool is attached to the Caching device. I have NVME as caching device and ssd and hdd as performance and capacity. The caching device will be the :0 and the disks will be 1:0 2:0 and so on depending on how many drives you have.

Now we can go back to Perf monitor. We can see on node 1 that the SSD’s are disk 4 and 5. In the screenshot i have here the SSD’s where 1 and 2

If you look closely at the Avg. Disk sec/Read, Avg. Disk sec/Transfer and Avg. Disk sec/Write you will see high numbers. They should be in the 0,001-0,003 or as low as possible. We did see numbers in the range of 1,870. And not this is Average numbers not max, so you can imagine the high latency on this. This was caused by Windows Server telling the storage disk to write to a safe place. And when the SSD caching is not working on the SSD the safe place is directly on the Nand cells, wich has about 200 iops. The ssd should do 25k iops in writing.

To replace failed drives in a storage pool i used this guide. I will paste it in here for reference.

PowerShell

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

# Find the failed Disk

Get-PhysicalDisk

# Shutdown, take the disk out and reboot. Set the missing disk to a variable

In the end we discovered that S2D was not working with SES and as such the HDD’s weren’t even in the enclosure – just the SSD’s! Change of HDD’s got two enclosures with the SSD’s and HDD’s as expected and all working fine now.

We have come across exactly the same situation (were using Intel 520 SSD’s without power protection). Have swapped the SSD’s out with Intel Data Centre models (which DO have power protection) however we are not seeing any difference…. The counters are not increasing even with heavy write operations.

Once you swapped your SSD’s out did you need to do anything else? E.g. reboot the nodes, repair the CSV’s?

Hello Steve, how are your virtual disks. You did swap 1 and 1 disk and let it rebuild? Can you run a optimize-storagepool to rebalance disks. You could check out, show-prettypool.ps1 script from Cosmos Darwin this will show you the balance of the disks.

Also give me a picture of your perfmon and physical disks when you have load on it.