The mysteriously disappearing drive: Are power outages killing your SSDs?

This site may earn affiliate commissions from the links on this page. Terms of use.

SSDs offer enormous benefits over traditional spinning discs. They’re up to an order of magnitude faster in certain operations, weigh less, and consume less power under load. They’ve become increasingly popular with enthusiasts and mainstream customers alike — but a report from the 11th Usenix Conference on File and Storage Technologies (FAST 13), given early this year, suggests most models have a fundamental problem with sudden power loss. While the paper came out in mid-February, I only recently came across it, after a reader asked if I’d look into a rather puzzling recovery program recommended by Crucial for its M4 SSD line.

Crucial recommends that M4 owners whose drives suddenly vanish simply let the drive sit for some 40-60 minutes with the SATA connector disconnected, but the power cable still connected. The company recommends that laptop owners let their systems sit in the BIOS screen, and there’s no word if this is also better for the desktop drives as well. USB 3.0 enclosures are considered sub-optimal. Baffled, I began to poke at this further, then stumbled across the aforementioned report from early this year.

Researchers working with the University of Ohio rounded up 15 different SSDs from five different vendors, as well as a brace of HDDs, and put them through a series of tests designed to measure how they responded to sudden power failures. No vendors are identified, but the drives in question incorporate both MLC and SLC. Some (the SLC versions) are explicitly enterprise drives. Some include supercapacitors, which are designed to mitigate catastrophic power failure.

The results were not encouraging. The group tested for six different kinds of failures:

Bit errors: Random, incorrectly written bits of data

Flying writes: Writes that were correctly written, but end up in the wrong location

Shorn writes: Writes that are below the expected size, due to the power failure

Metadata corruption: Corruption to the Flash Translation Layer (FTL) that sits between the SSD hardware and the operating system

Bricked device: Self explanatory

Unserializability: The storage blocks that are written are not written in the proper operation order

Here’s the surprising part: Of the 15 drives (10 different models, from five vendors), only one drive model, from one vendor, had no failures of any sort. One device failed completely (SSD #1), while one-third of SSD #3 became unusable due to metadata corruption. The other SSDs all exhibited various types of data corruption when they unexpectedly lost power, including the high-end enterprise SSDs with SLC NAND and supercapacitors. According to the research team, part of the problem is that virtually none of the devices actually behave as expected under fault conditions. While all the drives claim to use ECC RAM, for example, many exhibited single-bit errors of the kind of errors that ECC is meant to prevent. While one of the two included hard drives also developed errors, the HDDs are both far cheaper and showed no sign of the disastrous failures that characterized the SSDs.

The impact of sudden power loss

The implications of this research are significant. It suggests that SSDs, including enterprise SSDs, should not be trusted to behave in the proper fashion, or to be as robust as HDDs. Indeed, the number of hits for the phrase “disappearing SSD” is huge, and while many refer to the Crucial M4, that drive is not the only one listed. I myself have run into this problem in the past, with several drives unexpectedly dying after random power cycles. I never thought to check for a wider issue until now.

20nm NAND flash die

Without vendor information, there is little practical advice to be offered. The best thing a user can do is attempt to ensure that the power doesn’t unexpectedly turn off, via even a small uninterruptible power supply (UPS). Laptop users obviously have less to fear on this front, as your systems have batteries built in. Even a 5-10 minute battery would be sufficient to give the user time to shut down. Manufacturers are unlikely to start talking up these issues honestly — no one wants to admit that previous products have been anything less than ironclad.

It’s also not clear if the problems can be avoided, absolutely, without the use of battery backups or power circuitry. SSDs emphasize high performance and often use volatile RAM, which is going to make it inevitably more difficult to design such protections. For now, we recommend SSD users be particularly careful not to risk unexpected power failures. There appears to be no way to minimize the problem, no specific drive has been put forth as ironclad, and even using SLC and supercapacitors does not prevent corruption.

Tagged In

This site may earn affiliate commissions from the links on this page. Terms of use.

ExtremeTech Newsletter

Subscribe Today to get the latest ExtremeTech news delivered right to your inbox.

Email

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our
Terms of Use and
Privacy Policy. You may unsubscribe from the newsletter at any time.