Subscribe my feeds

Anyone who has investigated SSDs for server applications is aware that flash-based SSDs have a limited life span. NAND flash cells have a finite program/erase cycle limit, and though manufacturers use ever more sophisticated wear-level algorithms, the fact remains that the heavier your write activity to a device, the faster you will approach this limit. It would therefore be a good idea to monitor for wear indicators in much the same way that we monitor spinning disks for signs of impending failure (reallocated sectors, etc.)

There is currently no industry standard for exposing the remaining life of an SSD. Intel provides a Media Wear-out Indicator in the form of a SMART attribute (0xE9) that starts at 100 and counts down to 1 as the P/E cycles progress through the media. It’s a rough estimate of remaining life, but it’s better than nothing. Other vendors may do similar things, but a cursory Google search didn’t turn up anything.

Since we use Intel SSDs at $DAYJOB, I thought I’d put together a monitor of this attribute so we can track the endurance of our deployed drives. We monitor various local metrics on our systems with Resmon, which has a simple plugin architecture. After about a half-day’s work (most of which was testing on different systems), I had a Resmon module that produces output like:

The above is from an Intel 320 drive. The module simply runs the smartctl utility from the smartmontools project and extracts the desired information from the output.

As mentioned above, the attribute I care about is Media_Wearout_Indicator, and it’s the normalized value that counts down from 100. Now that this value is exposed via Resmon, I can monitor it in Circonus. Not only will I be able to watch the MWI value, I can also track the serial number and firmware strings for a given drive as text metrics for correlation with other trends.

I made this a generic SMART module, so even if you don’t have any SSDs, you can monitor other important attributes such as reallocated sectors or temperature. The module pulls both the normalized and the raw values. SMART attributes are vendor-specific, so check on your drives’ specs to see which value is most meaningful for a given attribute. Note that in its current form, the SMART module only knows how to interpret typical output for ATA devices, which is presented in a tabular format that is relatively easy to parse. Output for SCSI/SAS devices is much more free-form and will not be picked up by this module.

So far the module has been tested on OmniOS and Linux but it should work on *BSD and other platforms supported by smartmontools. All you need is Resmon and a Perl interpreter!