Friday, January 29, 2016

Storage Spaces and Latent Sector Errors / Unrecoverable Read Errors

I emailed S2D_Feedback@microsoft.com to ask about storage spaces direct and how it handles Latent Sector Errors (LSE), otherwise known as Unrecoverable Read Errors. Here is the email I sent:

My company is in the process of evaluating different options
for upgrading our production server environment. I’m tasked with finding a
solution that meets our needs and is within our budget.

I’m trying to compare and contrast storage spaces direct
with storage spaces utilizing JBOD enclosures. Data resiliency, integrity and
availability are paramount. So I’m primary looking at both of these technologies
from that perspective. Thus, if we go the JBOD route, we’re looking at
implementing 3 enclosures and utilizing the enclosure awareness of storage
spaces. This solution has existed longer then storage spaces direct and I would
think has been tested more thoroughly. I like the scalability and elegance of
storage space direct though. From a conceptual overview and a hardware setup
perspective it just seems easier to grasp and it seems like a better solution.

My question is, how do both of these setups handle
unrecoverable read errors/latent sector errors? Does one solution handle them
better than the other?

There are horror stories about hardware RAID controllers
evicting drives because of URE/LSE and then during RAID rebuilds encountering
additional UREs/LSEs and bricking the storage. This is more worrisome when SATA
disks are used (due to UREs/LSEs occurring more often and sooner with SATA
disks compared to SAS disks.) How does storage spaces/S2D differ in this
regard? I know one of the selling points of S2D is the use of SATA disks. I’m
curious as to how this problem has been addressed since SATA disks are being
promoted. What happens if there is a URE/LSE in end user data? What happens if
there is a URE/LSE in the metadata used by storage spaces/S2D or the underlying
file system?

Here is the response I received:

Both Spaces direct and
Shared Spaces (with JBOD) both rely on the same software raid
implementation, difference is in the connectivity. Software raid implementation
does not throw away the entire drive on failure, we trigger activity to move
the data out of the drive while keeping the copy till data is moved (if we have
copies available). On Write failure we try to move the impacted range
right away while background activity is moving the untouched data out of the
disk, some of the disks fail to write but they can continue to support
reads in which case the data on those drives can still be used to serve user
requests. Until the data on the failed drive is rebuilt on spare capacity
the drive is not removed, user can still force but not automated. On URE
- we trigger rebuilt to recover lost copy, this is triggered both when reads
errors detected while satisfying user error or by back ground scrub process.
Back ground scrub process detects URE by validating sector level checksum
across copies and validating.

So it would appear that if you utilize storage spaces you don't have to worry about a LSE/URE taking out a drive and then a subsequent LSE/URE taking out another drive, thus taking down your array.