Solid-state revolution: in-depth on how SSDs really work

SSDs use a huge grab bag of techniques to make a computer feel "snappy."

The inevitability of entropy

For all its speed and awesomeness, flash has one big and hairy problem: it can only be used for a finite number of writes. This limitation has led to SSD controller manufacturers implementing an amazing array of workarounds, all geared toward prolonging the life of the flash cells. However, many of the tricks employed by SSD controllers to make them work well as fast random access storage are directly at odds with prolonging cell life, often putting the twin goals of quick storage and long life into an uneasy compromise.

We noted earlier that bits in a flash cell are read by varying the voltages on rows and columns of cells and then measuring the results, but we didn't get into how data is programmed into a NAND flash cell to begin with. Data can only be written to one page—one row of cells—at a time in NAND flash. Briefly, the SSD controller locates an empty page that is ready to be programmed, and then alters the voltage for that row (with the word line) to a high level and grounds the bit lines for each of the columns that need to be changed from a 1 (the default state, containing little or no charge) to 0 (charged). This causes a quantum tunneling effect to occur wherein electrons migrate into the cell and alter its charge. Kerzap!

At some point, charged cells need to be erased and returned to their default state so they can be reused to hold other data, so when it's necessary to do so or when the SSD has free time, blocks of cells undergo an erasure cycle that removes the charge from the cells. However, each time the cells go through a program/erase cycle, some charge is trapped in the dielectric layer of material that makes up the floating gates; this trapped charge changes the resistance of gates. As the resistance changes, the amount of current required to change a gate's state increases and the gate takes longer to flip. Over time, this change in resistance becomes significant enough that the amount of voltage required to write a 0 into the cell becomes so high, and the amount of time it takes for the cell to change becomes so long, that the cell becomes worthless as a component of a fast data storage device—SSDs stripe data across multiple blocks, and so the cells all need to have roughly identical write characteristics.

This degradation doesn't affect the ability to read data from an SSD, since reading is a mostly passive operation that involves checking the existing voltage in targeted cells. This behavior has long been observed on older and well-used USB thumb drives, which slowly "rot" into a read-only state. SSDs undergo the same type of degradation, and it's the job of the SSD controller to keep tabs on the write count of every cell and mark cells as unusable when they begin to degrade past the point of usability.

It turns out that even reading will slowly degrade NAND flash, and flash systems can be forced to eventually rewrite data after it's been read too many times. Fortunately, this effect seems less significant in practice than it sounds in theory.

MLC SSDs are much more susceptible to degradation than SLC SSDs, because each cell in an MLC drive has four possible states and stores two bits, and so each is more sensitive to changes in residual charge or difficulties adding new charge. Additionally, flash cells are continually decreasing in size as we push further and further down the semiconductor process size roadmap. The latest drop from a 25-nanometer to a 20-nanometer process—with the number referring to half of the typical distance between the individual cells—has also decreased the number of program/erase cycles the cells can endure, because they are physically smaller and can absorb less residual charge before they become too unresponsive to be useful. Progress gives—and progress takes away.

The specter of limited cell life looms over all NAND-based solid-state storage, but it's not necessarily a specter that most folks need worry about. The good news is that even the reduced number of program/erase cycles that a current-generation MLC SSD can bear are more than enough for most consumers. A solid-state drive purchased today should yield at least as much life as a spinning disk; because of the inevitable march of disk capacity, the majority of hard disk drives in consumer computers don't see much of a useful service life past five years, and numerous synthetic benchmarks give current-generation consumer SSDs at least 5 years of flawless service before any type of degradation sets in. Enterprise solid-state drives are a different matter, though—disks of any type in an enterprise setting are subjected to much tougher workloads than consumer disks, which is one major reason not to use consumer SSDs in the data center. Enterprise-grade SSDs, even ones based on MLC, are built to yield a much longer service life under much more stressful conditions than consumer drives.

More than a thumb drive to me

To make an SSD, you can't just plug a bare flash chip into your PC; the chip needs a controller of some kind. All flash controllers have to handle some of the same management of pages and blocks and the complexities of writing, but the controllers used in SSDs go much further than that.

On one hand, an SSD does look sort of like a big fat thumb drive in that the flash memory is of the same type as you'd find in a typical USB memory stick. However, there are obviously observable differences in speed—even a smoking fast USB memory stick isn't particularly fast when compared to a SATA 3 solid-state drive, which can read and write data at half a gigabyte per second. This type of speed is accomplished by writing to more than one flash chip at a time.

The SSD's controller—a processor that provides the interface between the SSD and the computer and that handles all of the decisions about what gets written to which NAND chips and how—has multiple channels it can use to address its attached NAND chips. In a method similar to traditional multi-hard disk RAID, the SSD controller writes and reads data in stripes across the different NAND chips in the drive. In effect, the single solid-state drive is treated like a RAID array of NAND.

Briefly, RAID—which stands for Redundant Array of Inexpensive Disks (anyone who says the "I" stands for "Independent" needs to learn about RAID 0 data recovery)—is a method long employed with hard disks to increase the availability of data (by putting data blocks on more than one disk) and the speed (by reading and writing from and to multiple disks at the same time). The most common form of RAID seen today in large storage scenarios is RAID 5, which combines striping—drawing a "stripe" of data across multiple disks—with some parity calculations. If a single disk in the RAID 5 array dies, everything on it can be recovered from the remaining disks in the array. For a more in-depth look at how RAID works and the different types of RAID, check out the classic Ars feature The Skinny on Raid.

At minimum, every SSD controller in every drive on the market today provides at least basic data striping with basic error correction, using that extra space in each page. Most controller manufacturers augment that with their own fancy proprietary striping schemes, which also typically include some level of parity-based data protection. Micron calls theirs RAIN, for Redundant Array of Independent NAND, which offers several different levels of striping and parity protection; LSI/SandForce calls its method RAISE, for "Redundant Array of Independent Silicon Elements," and provides enough data protection that the drive could continue operating even if an entire NAND chip goes bad.

A 3+1 RAID 5 array of disks on top, with the orange and blue files distributed in three data elements and one parity element. On bottom, the same arrangement of data in two rows of NAND in an SSD.

Still, sometimes even streaming across multiple I/O channels to multiple NAND chips isn't enough to keep up with the data coming in across the bus that the computer expects the SSD to accept, so quite a few consumer SSDs contain some amount of DDR2 or DDR3 SDRAM, usually between 128 and 512 MB. Having a chunk of cache sitting there lets the SSD quickly receive data that it needs to write, even if it's too busy to actually write it at the moment; the data sits in the SDRAM cache until the controller is able to find time to send it down and actually commit it to NAND. All this happens transparently to the computer and you, the end user—regardless of whether or not the data has actually been written, the SSD controller reports back to the operating system that the write was completed successfully.

An SSD with a large DRAM cache gets sent a big bunch of data blocks to write. The controller temporarily holds the write in cache while the NAND flash is busy writing other things, and then when the NAND is ready, the whole bunch of blocks are written.

This greatly decreases the effective latency and increases the throughput of the SSD, but there's an obvious problem. The SDRAM in an SSD's cache is the same kind of SDRAM used for main memory—the kind that erases itself if it loses power. If the computer were to suffer a power loss while the SSD has data in cache that hasn't yet been committed to NAND, then that data would be completely gone, with consequences that could range from annoying to catastrophic. The most common consequence of a loss of uncommitted write cache would be file system corruption, which might or might not be repairable with a chkdsk or fsck; depending on what was being held in cache, the entire file system on the SSD could be unrecoverable.

This is obviously a bad thing, and most SSD manufacturers who bolster their drives' performance with a large SDRAM cache also include some mechanism to supply power to the SSD long enough to dump its cache contents out to NAND, usually in the form of a large capacitor. Whatever the mechanism, it only needs to provide the drive with power for a short amount of time, since it doesn't take terribly long to write out even a full 512MB of data to an SSD. However, not every SDRAM-cached SSD has a set of cache-powering capacitors—some have nothing at all, so be aware of the specs when picking one out. Most have something, though, and the speed benefits of stuffing some RAM into an SSD more than balance out the risks, which are quite manageable.

As an aside, it's difficult to find an SSD in the enterprise space that doesn't use SDRAM caching, but enterprise SSD manufacturers can count on the drives being used in, well, an enterprise—which usually means that they're in a server or storage array in a data center with redundant UPS-backed power. On top of that, most enterprise disk arrays like those from companies like EMC, NetApp, IBM, and Hitachi have built-in independent battery units designed to keep their disks spinning long enough to perform a complete array-wide de-stage of cache, which can take a while because those kinds of systems can have a terabyte or more of SDRAM set aside just for cache.

Tricks of the trade

In addition to handling the basic page management, striping, error-correction, and caching, SSD controllers also have to work to keep on top of flash's twin problems: block-level erasure, and finite lifetimes. There are several things that SSD manufacturers do these days to ensure that their drives remain quick throughout their entire life.

The first is over-provisioning, which simply means stuffing more NAND into a drive than it says on the box. Over-provisioning is done on most consumer SSDs today and on all enterprise SSDs, with the amount varying depending on the type and model of drive. Consumer SSDs that list "120 GB" of capacity generally have 128GB of actual NAND inside, for example; over-provisioning in the enterprise space is generally much larger, with some drives having as much as 100% additional capacity inside. Over-provisioning provides some breathing room so that there can potentially be free pages available to receive writes even if the drive is nearing capacity; additionally, in the event that some cells wear out prematurely or go bad, over-provisioning means that those bad cells can be permanently marked as unusable without a visible decrease in the drive's capacity. It's also possible to emulate the beneficial effects of over-provisioning by simply using less than the stated capacity of an SSD—for example, by purchasing a 90GB SSD, creating a 30GB partition, and leaving the rest unallocated. The controller itself doesn't care about the logical constructs built by the operating system—it will happily continue to write to new fresh pages as long as they're available.

Garbage collection is another technique—or collection of techniques—to keep SSDs fast and fresh. As the contents of pages are modified—or more correctly as those pages are rewritten as new pages, since you can't overwrite pages—the SSD keeps track of which pages contain good data and which contain stale data. Those stale pages aren't doing anyone any good just sitting there, but they also can't be individually erased, so when the SSD controller has an opportune moment, it will garbage-collect those pages—that is, it will take an entire block that contains both good and stale pages, copy the good pages to a different block, and then erase the entire first block.

Garbage collection can be augmented with TRIM, which is a specific command that the operating system can send to an SSD to indicate that pages no longer contain valid data. SSDs know when pages need to be modified, but they have no understanding of when pages are deleted, because no modern operating systems really delete files. When Windows or OS X or your Linux distro of choice "deletes" a file, it simply adds a note in the file system saying that the clusters that make up that file are free for use again. Unfortunately for an SSD, none of that information is visible down at the level the SSD operates—the drive doesn't know what operating system or file system is in use. All it understands are pages and blocks, not clusters and files. Deleting files doesn't free up that file's pages to be ignored by garbage collection, which can lead to the unfortunate circumstance where a drive's garbage collection routines are gathering and moving pages that contain "data" that the file system has deleted, and so cannot access anyway. TRIM lets the operating system pass a note down to the SSD saying that a set of pages taken up by a deleted file can be considered stale, and don't have to be relocated along with good data. This can greatly increase the amount of free working space on your SSD, especially if you're regularly deleting large numbers of files.

Exactly when and how often an SSD goes in and performs garbage collection varies according to the controller and its firmware. At first glance, it would seem desirable to have a drive constantly performing garbage collection at a lower priority than user-driven IO; indeed, some SSDs do this, ensuring that there's always a big pool of free pages. However, things are never that simple in SSD land. Constant garbage collection can have significant implications for the life of the SSD, when compared against a less-aggressive garbage collection that strives to maintain a minimum amount of free pages. This is where write amplification comes into play.

Lee Hutchinson / Lee is the Senior Reviews Editor at Ars and is responsible for the product news and reviews section. He also knows stuff about enterprise storage, security, and manned space flight. Lee is based in Houston, TX.