Introduction to SMART

Did you know your drive was SMART? Actually: Self-Monitoring, Analysis, and Reporting Technology. It can be used to gather information about your hard drives and offers some additional information about the status of your storage devices. It can also be used with other tools to help predict drive failure.

SMART (Self-Monitoring, Analysis, and Reporting Technology) is a monitoring system for storage devices, usually hard drives, that provides both information about the status of a drive as well as the ability to run self tests. It can be used by storage administrators to check on the status of their storage devices and force self-tests to determine the state of the device. While some people advocate using the data for predicting drive failure giving you time to get data off of the drive in the event of imminent failure, other references say that some SMART data may not be the best predictor of failure.

SMART and SMART – What is SMART?

Since you are reading this article it is likely that you understand that data is important. This also means that you have backups of your most important data and you do daily backups – right? The reason we have backups is that hard drives fail for various and sometimes mysterious, reasons. From an administrative point of view it would be nice to be able to correlate drive failures with certain drive characteristics or with load or even track a batch of drives. Perhaps even better, it would be nice to be able to predict if a drive is failing or when a drive might fail. Then you can make sure you have all of the data backed up and you can remove the failing drive and replace it with a new one. To do any or all of this we need data about the storage devices.

IBM was the first company to add some monitoring and information capability to their drives in 1992. Other vendors followed suit and then Compaq led an effort to standardize on an approach to monitoring drive health and reporting it. This drive for standardization led to S.M.A.R.T. (this is the correct abbreviation rather than SMART but it’s not nearly as easy to type). Over time SMART capability has been added to many hard drives including PATA, SATA, the many varieties of SCSI, and SAS. The standard was based on the approach that the drives would measure the appropriate health parameters and the results would then be available for the OS or other monitoring tools. But, each drive vendor was free to decide which parameters were to be monitored and what their thresholds would be (a threshold is the point at which the drive has “failed”).

For a drive to be considered “SMART” all it has to have is the ability to signal between the internal drive sensors and the host computer. There is nothing in the standard about what sensors are in the drive nor how this data is exposed to the user. But at the lowest level SMART provides a simple binary bit of information – the drive is OK or the drive has failed. This bit of information is called the SMART status. Many times the “drive is failed status” doesn’t indicate that the drive has actually failed but that the drive may not meet its specifications.

But it is fairly safe to assume that all modern drives have, in addition to the SMART status, SMART attributes. These attributes are completely up to the drive manufacturers and consequently are not standard. So each type of drive has to be scanned for various SMART attributes and possible values. In addition to SMART attributes the drives may also contain some self-tests with the results stored in the self-test log. These logs may be scanned or read to track the state of drive. Moreover you can also tell the drives to run self tests.

The difficulty in reading the SMART attributes is that the attributes have a threshold value beyond which the drive will not pass under ordinary conditions (sometimes lower values are better and sometimes larger values are better). But these threshold values are only known to the manufacturer and may not be published. In addition, each attribute returns a raw value who’s measurement is up to the drive manufacturer and a normalized value that has a value from 1 to 253. A “normal” attribute value is completely up to the manufacturer as well. So you can see that it’s not always easy getting SMART attributes from various drives nor is it easy to interpret the values.

For most of the drives sold in the last few years there are a number of SMART attributes. The article about SMART on Wikipedia has a pretty good list of common attributes and their meaning. You will notice in the list that some attributes are better when the value is larger and some are better when the value is smaller.

Using many of the SMART attributes one would think that you could predict failure. For example, if the drive was running too hot, then it might be more susceptible to failure. Or if bad sectors were developing quickly one might think the drive was also failing. Perhaps you can use the attributes with some general models of drive failure to predict when drives might fail and then work to minimize the damage.

However using SMART for predictive failure of drives has been a difficult proposition. Google reported a study where they examined over 100,000 drives of various types for correlations between failure and SMART values. The disks are a combination of consumer grade drives (SATA and PATA) with speeds from 5,400 rpm to 7,200 rpm and capacities ranging from 80GB to 400GB. There are several drive manufacturers in the population of drives with at least nine different models in total. The data in the study was collected over a 9 month window.

In the study they monitored the SMART attributes of the population of drives and also which drives failed. Google chose the word “fail” to mean that the drive is not suitable for use in production even if the drive tests good (sometimes the drive would test correct but immediately fail in production). From their study the authors concluded,

Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

However, despite the overall message that they had difficult developing correlations, they did find some interesting trends.

Google agrees with the common view that failure rates are known to be highly correlated with drive models, manufacturers, and age. However, when they normalized the SMART data by the drive model, none of the conclusions changed.

There was quite a bit of discussion about the correlation between SMART attributes and failure rates. One of the best summaries in the paper is, “Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives. …”.

Temperature effects are interesting in that high temperatures start affecting older drives (3-4 years or older). But lower temperatures can also increase the failure rate of drives regardless of age.

A section of the final paragraph of the paper bears repeating here. “We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART.”

The paper tried to sum all the factors that contributed to drive failure that they observed contributed such as errors or temperature but they still missed about 36% of the drive failures.

The paper gives some good insight into the drive failure rate of a large population of drives. As mentioned previously there is some correlation of drive failure with scan errors but that doesn’t account for all failures of which a large fraction did not show any SMART error signals. It’s also important to mention that the comment in the last paragraph mentions that, “… SMART models are more useful in predicting trends for large aggregate populations than for individual components. … “. However, this should not deter one from watching the SMART error signals and attributes to track the history of the drives in your systems. Again, there appears to be some correlation between scan errors and failure of the drives and this might be useful in your environment.

Before finishing this section I wanted to give out bonus points to anyone who can tell me the origins of the title of this section. Think hard and post to the comments section.

Next Time

SMART has a great deal of potential for helping administrators. While it may not be the best predictor of drive failure it can give you some indication that drives are having problems and it can definitely give you the history of the drive. Good administrators can use this information and correlate it with workload history to better map the behavior of the drive.

In the next article we’ll explore smartmontools that allows you to examine the SMART attributes of your drives and run self tests.

Comments on "Introduction to SMART"

laray

\”Google agrees with the common view that failure rates are known to be highly correlated with drive models, manufacturers, and age. However, when they normalized the SMART data by the drive model, none of the conclusions changed.\”

Sooo – now list the manufactures that had high failures and complete the information delivery for the article. This would be important since the SMART parameters themselves have been determined (by Google) to be useless!

Good never mentions that drive manufacturers. Their article also points out that in general failure rates follow drive batches. But tracking a batch and testing can be very difficult (if not impossible).

I also wouldn\’t say that google determined the SMART parameters are useless. Rather the SMART parameters by themselves cannot be reliably used to predict _all_ drive failures. They can capture some some drive failures but you need to use some heuristics to use the parameters for making predictions. Even if you use all of the heuristics and other metrics that Google used, they still couldn\’t predict all of the drive failures (or close to them).

Please read the Google paper for more details. It\’s a very very interesting and useful paper.

Ummm – I guess that\’s just semantics on the usefulness of SMART.
It is interesting that Google dives into great detail about how they collect the SMART information and the robust nature of their data collection mechanisms, but yet one of the simplest data sets to collect would be the manufacturer of the drive. This is reported by the SMART technology. If its this information is difficult to obtain or almost impossible, then how can they make the statement that failures do correlate to manufactures in the first place? I have ideas on why Google decided not to report this critical piece of information.
Since you have reported this story, please follow up with Google and let us know either why Google did not report this critical and important information or what manufactures had the highest drive failures. This would certainly help the the IT folks, like me, and provide an incentive for drive manufactures to improve their products.

The problem I have with SMART technology is that it is a one-way street. Once a drive has been flagged as \”failing\”, you cannot un-flag it (at least not legally – and if you could un-flag it, it would just get flagged again as soon as performance fell out of spec again). I have had drives that work perfectly fine for years after being flagged by SMART. The fact that SMART tests aren\’t a 99.9% accurate predictor of failure is enough that permanently flagging a drive as \”failed\” becomes a very bad idea.

The argument on the drive manufacturer side is that if that drive was to be resold, the potential buyer should have some warning that a problem is imminent. But if we applied the same logic to cars, there wouldn\’t be enough room in all the landfills on the planet to dispose of all the unsellable cars flagged as \”failed\” just because they have some minor mechanical issue.

Far too simplistic. It may be one particular model that\’s a lemon. It\’s probable that this didn\’t become apparent until after the manufacturer had stopped selling it (i.e., it\’s obsolescent). Or it may be a manufacturer who has fallen foul of a batch of dud components that could have been sold to anyone. Many other possibles.

The fact is, every new drive on the market is in many senses a prototype. By the time it\’s been in service for long enough to prove reliable, it\’s also obsolescent. The manufacturers do accellerated ageing tests, but these are a poor approximation to the passage of years of real time.

The moral – if you construct RAID arrays out of disks with near-consecutive serial numbers, you maximise your chances of a multi-drive failure! To eliminate several common-mode failures, build an array out of (say) one WD, one Seagate and one Hitachi disk. For a bigger array if you run out of manufacturers you can use different sizes or series from the same manufacturer. It may even reduce the risk to buy the same drive from two different suppliers – the serial numbers are less likely to be near-consecutive, you may dodge a batch problem that way.

Other morals – if a drive in a RAID array fails assume it\’s a pointer to another being about to go the same way! And with drives as cheap as they are, consider RAID-6 rather than RAID-5, just in case.

Warning sales pitch: Please see my book \”From Computer to Brain\” where I use this timeless quote as follows:

The brain has long been mysterious in this way. It was purchased
from an outside vendor. It fits neatly into a slot at the top of
the case. It picks up signals from the rest of the system and
provides output signals that are used to navigate and maneuver the
entire device. In a classic Star Trek\’\’ episode, a nasty alien
stole Spock\’s brain and plugged it into a central control station so that it would run the air conditioning or something. Complex plot
twists led to the immortal line: Brain, brain … what is brain?\’\’

When I originally commented I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get three e-mails with the same comment. Is there any way you can remove me from that service? Appreciate it!

Hello, Neat post. There’s an issue along with your web site in web explorer, would check this?K IE still is the marketplace leader and a good part of people will pass over your excellent writing because of this problem.

Wonderful beat ! I want to apprentice while you amend your
website, how can i subscribe for any blog web site?
The account aided me a acceptable deal. I had been just
a little bit acquainted of this your broadcast offered bright clear idea

Wonderful story, reckoned we could combine a number of unrelated information, nonetheless definitely worth taking a search, whoa did one particular master about Mid East has got extra problerms as well.