Accelerated Testing

Controller:

With all of the development and design stuff out of the way, lets get into the good stuff. The first point to bring up is the fact that feedback is inherent in this process. The early validation steps may very well need to 'reach back' into the process in order to find solutions. Here's a good example:

Above is a bit error caused by a 'cosmic ray'. No, I'm not kidding. The earth is constantly bombarded by cosmic rays, many of them neutrons. Neutrons are mostly filtered out by the atmosphere, but we still see some of them down here on earth. We also make a bunch of them right here on earth. It doesn't just come from nuclear reactors, simply eating a banana causes your own body to emit neutrons as your body breaks down the potassium. While neutrons carry no electrical charge themselves, they 'excite' atoms they happen to run into along their path. Excited atoms don't like staying that way for long, so they quickly decay back down to a stable state. Part of that decay is in the form of an electron, which effects the charge of the surrounding atoms. If this event happens in a flash cell, the ECC mechanisms correct the error. The same ECC correction happens if a neutron happens to cause a bit to flip in the SRAM. Despite this, there are still places within an SSD where flipping a random bit of data can cause issues. Most of these take place within the controller itself, where a flipped bit can potentially cause data to be misrouted or not routed at all while still reporting to the host that it has been written (i.e. lost). In the worst cases, the controller might not be able to continue executing its firmware and would result in a soft reboot or even bricking of the device.

These cosmic ray events don't happen very often (we're talking billionths of a percent chance spread across thousands of devices), but they remain a possibility and do play into the design on the controller as a whole. Controllers tend to stick with the 'larger' lithography process nodes, so that the charge from a cosmic ray event has less of an effect on the overall voltage present at a given location. Extra checks are added to the firmware as a means of catching incorrect operations caused by flipped bits.

A failed SSD being analyzed on a test bench at Intel.

Now with all of these corrections in place, and with the chances of a neutron flipping a bit so low, we cant exactly put hundreds of thousands of unreleased SSDs out in an open field in the hopes of seeing failures happen. The process needs to be accelerated. "How", you might ask?

Just use an accelerator! Yes, Intel actually sends their prototype SSD controllers (among other things) out to Los Alamos to be bombarded with a neutron beam dozens of orders of magnitude higher than what they will see in normal use. These are literally tests to the point of failure. They then go back in and see what failed, how, and why. The results are again fed back into the design loop and the process is repeated if necessary after firmware (or even hardware) corrections have been made.

Flash:

Accelerating the testing of flash memory in a modern SSD is a tricky proposition. Thanks to advanced wear leveling techniques, writing via the normal method, at full speed, can take months or even years before flash blocks begin to wear to the point of noticeable failure. Tricks implemented from the outside really don't work. 'Short stroking' the SSD by writing to a smaller range of (external) LBA sectors does nothing, as the wear leveling algorithm will still spread those writes across the entire flash area (and this is why SSDs random write performance improves with greater over-provisioning at play - because there is more 'empty' flash to work with). Given the above, accelerating the wearout testing of flash requires a bit of a firmware tweak:

Now remember, we're trying to test the entire production unit here for any possible failures - in addition to flash failures. To accomplish this, Intel makes the most minimum possible modification to the firmware, instructing it to address only a portion of the flash dies within the SSDs. All data channels are still used and all flash dies are still accessed, but the addressable area of each is reduced to a fraction of the full surface. The diagram above depicts using the area at the edges of the dies, because this is where the failures are more likely to occur (due to handling and packaging). This effectively makes the SSD have a much smaller capacity, which means that writing at the same speed translates to increased wear to those focused areas. This is the same 'short stroking' mentioned above, but since it occurs at the die level, wear leveling is restricted to the same smaller area, and those smaller sections of flash can then be tested to failure within a reasonable amount of time (6 weeks in the example above).

great article, super interesting read. However one this is irking me. The line is "The proof is in the EATING of the pudding" I know this is a stupid thing to bitch about but that one just gets on my tits, y'know?