Monitoring Hard Disks with SMART

One of your hard disks might be trying to tell you it's not long for this world. Install software that lets you know when to replace it.

Listing 4. Output of smartctl -l error
/dev/hda

SMART Error Log Version: 1
No Errors Logged

The final part of the smartctl output (Listing 5) is a report of
the self-tests run on the disk. These show two types of
self-tests, short and long. (ATA-6/7 disks also may have conveyance and
selective self-tests.) These can be run with the commands
smartctl -t short /dev/hda and smartctl -t long
/dev/hda and do not corrupt data on the disk.
Typically, short tests take only a minute or two to complete, and long
tests take about an hour. These self-tests do not interfere with the
normal functioning of the disk, so the commands may be used for
mounted disks on a running system. On our computing cluster nodes, a
long self-test is run with a cron job early every Sunday morning. The
entries in Listing 5 all are self-tests that completed without errors; the
LifeTime column shows the power-on age of the disk when the self-test
was run. If a self-test finds an error, the Logical Block
Address (LBA) shows where the error occurred on the disk. The Remaining
column shows the percentage of the self-test remaining when the error
was found. If you suspect that something is wrong with a disk, I
strongly recommend running a long self-test to look for problems.

The smartctl -t offline command can be used to carry out
off-line tests. These off-line tests do not make entries in the
self-test log. They date back to the SFF-8035i standard, and update
values of the Attributes that are not updated automatically under
normal disk operation (see the UPDATED column in Listing 3). Some disks
support automatic off-line testing, enabled by smartctl -o
on, which automatically runs an off-line test every few hours.

The SMART standard provides a mechanism for running disk self-tests
and for monitoring aspects of disk performance. Its main shortcoming
is that it doesn't provide a direct mechanism for informing the OS or
user if problems are found. In fact, because disk SMART status
frequently
is not monitored, many disk problems go undetected until they
lead to catastrophic failure. Of course, you can monitor disks on a
regular basis using the smartctl utility, as I've described, but this is a nuisance.

The remaining part of the smartmontools package is the smartd
dæmon that does regular monitoring for you. It monitors the disk's SMART data for
signs of problems. It can be configured to send e-mail to users or
system administrators or to run arbitrary scripts if problems are detected. By
default, when smartd is started, it registers the system's
disks. It then checks their status every 30 minutes for failing
Attributes, failing health status or increased numbers of ATA errors
or failed self-tests and logs this information with SYSLOG
in /var/log/messages by default.

You can control and fine-tune the behavior of smartd using
the configuration file /etc/smartd.conf. This file is read
when smartd starts up, before it forks into the background.
Each line contains Directives pertaining to a different disk. The
configuration file on our computing cluster nodes look like this:

The first column indicates the device to be monitored. The -o
on
Directive enables the automatic off-line testing, and the -S
on
Directive enables automatic Attribute autosave. The
-m Directive is
followed by an e-mail address to which warning messages are
sent, and the -a Directive instructs smartd to monitor all
SMART features of the disk. In this configuration, smartd
logs changes in all normalized attribute values. The -I
194
Directive means ignore changes in Attribute #194, because disk
temperatures change often, and it's annoying to have such changes
logged on a regular basis.

Normally smartd is started by the normal UNIX init
mechanism. For example, on Red Hat distributions,
/etc/rc.d/init.d/smartd start and /etc/rc.d/init.d/smartd
stop can be used to start and stop the dæmon.

Further information about the smartd and its config file can
be found in the man page (man smartd), and summaries can be
found with the commands smartd -D and
smartd -h.
For example, the -M test Directive sends a test e-mail warning
message to confirm that warning e-mail messages are
delivered correctly. Other Directives provide additional flexibility, such as
monitoring changes in raw Attribute values.

What should you do if a disk shows signs of problems? What if
a disk self-test fails or the disk's SMART health status fails?
Start by getting your data off the disk and on to another system as soon as
possible. Second, run some extended disk self-tests and see if the
problem is repeatable at the same LBA. If so, something probably is
wrong with the disk. If the disk has failing SMART health status and
is under warranty, the vendor usually will replace it. If the disk is
failing its self-tests, many manufacturers provide specialized disk
health programs, for example, Maxtor's PowerMax or IBM's Drive
Fitness Test. Sometimes these programs actually can repair a
disk by remapping bad sectors. Often, they report a special error
code that can be used to get a replacement disk.

This article has covered the basics of smartmontools. To learn
more, read the man pages and Web page, and then write to the
support mailing list if you need further help. Remember,
smartmontools is no substitute for backing up your data. SMART cannot and
does not predict all disk failures, but it often provides
clues that something is amiss and has helped to keep many computing
clusters operating reliably.

Several developers are porting smartmontools to FreeBSD, Darwin and
Solaris,
and we recently have added extensions to allow smartmontools to monitor
and control the ATA disks behind 3ware RAID controllers. If you would
like to contribute to the development of smartmontools, write
to the support mailing list. We especially are interested in
information about the interpretation and meaning of vendor-specific
SMART Attribute and raw values.

Bruce Allen is a professor of Physics at the University of
Wisconsin - Milwaukee. He does research work on gravitational waves
and the very early universe, and he has built several large Linux clusters
for data analysis use.

Comments

Comment viewing options

Bruce's original answer worked for me...
Your disk has one or more unreadable sectors. This does NOT mean that the disk is failing, but it has lost some information on those sectors. Run an extended self test:

After the test is over, the self-test log (-l) will show what sector is unreadable. This will probably agree with what is shown in SYSLOG. Then look at BadBlockHowTo (linked from smartmontools home page) for instructions about how to identify if there is a file stored on that bad sector. If you have no data that you need, you can fix the problem by overwriting the bad partition with zeros using dd.

But be careful not to zero out regions of the disk that store data that you need!

it happened to me as well. i tried everything on the net but nothing worked for me. finally, i went to the repair shops and they said the HDD has to be replaced. It is replaced now and works well. anyway, no complains it was a old HDD.

The 9000 system indicated our AE-35 unit had a high probability of failure within 72 hours. We replaced it and could find no indication of problems in the original unit. HAL said the replacement was about to fail, too. The ground recommendation was to leave the unit in service and let it fail. HAL now says it has failed, and it acts like it has, but I am wondering what I will find when I go out to replace it. Dave is monitoring the status from inside, and our link to the base is still down. I will report back when I learn more.

Some of us with SSD (solid state disk) units have been bitten by filesystem corruption after updating to the latest Ubuntu and Fedora. We have recently determined the corruption was (ironically) triggered by the SMART utilities.

I don't believe we know at this moment whether to expect a firmware update for the SSDs or some other remedy, but beware if you notice strange behavior in your device, especially immediately after probing its SMART status.

Note that SMART may be invoked without your direct knowledge -- Beginning with Ubuntu 9.10 libatasmart is being invoked automatically by DBUS. Also in Ubuntu, running GNU parted probes the disk's SMART status, though running fsck does not.

"Studies have shown that lowering disk temperatures by as little as 5°C significantly reduces failure rates, though this is less of an issue for the latest generation of fluid-drive bearing drives. One of the simplest and least expensive steps you can take to ensure disk reliability is to add a cooling fan that blows cooling air directly onto or past the system's disks."

Which studies are those? Google's study of over 100,000 drives found that disks failed MORE often when they were cooled, and ran better hot:

"In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."

-----------

The results from smartctl are very confusing and hard to understand. Wikipedia clarifies what some of the values mean, though there's still a lot of uncertainty:

You shouldn't quote articles without reading the whole piece. Further in that article, they said the failures were suspected not to come from the lower drive temperatures, but rather from the power fluctuations that were coming from the increased electrical load of the air conditioners...

In very simple terms, your disk has had problems in the past and has either repaired itself or been repaired by some external utility. However, this repair has left your disk vulnerable to further failure. Back up immediately.

Thank you very much for posting this! I already feared that my hard disk is dying because of the strange noises the PC made on start up. (I don't even know now where the noise came from. Could be the other things too, right?)

I miss SMART for USB disks too. I wonder why I cannot monitor my disk connected over USB in Linux. Is it Linux driver limit? Or HW limit? My USB/ATA controler is based on Genesys Logic (05e3:0702), my disk supports SMART; I know I can read SMART statistics when I connect my disk over PATA cable.

Bruce Allen's reference to SMARTs ability "query the disk's health status, run disk self-tests..." suggests that you could at least get some kind of condition report by removing your Hard Disk from its USB housing, connecting it to an ATA cable in a desktop, then running the query utility. I use a cheap ($6) adapter to connect my 2.5 inch laptop drives to my desktop. Though this ritual does not allow continuous monitoring of the USB drive, it may give a clue as to its current status.

Sad to say, other distractions, such as picking a distribution, have prevented me from trying out my own suggestion.
cheers...

Hello,
why doesn't smartctl show a summary of bad or pending sectors? One such message can be found in /var/log/messages like "Aug 27 12:17:51 91-64-143-104-dynip smartd[4483]: Device: /dev/hdb, 7 Currently unreadable (pending) sectors", however, it would be more convenient to get this information directly from smartctl. How can I get this information?

Thank you for a nice written and informative article.
I tried it on one of my scsi drives which tends to be busy.
It gave me some other information:
The overall status is ok, should i worry about the errors ?

with attrib 1,7,195
I find this with all my seagate drives
which worries me where I read that part about ata4 standard and drives not keeping the attributes anymore

i think my non-seagate drives are now just too dumb to realise they are failing.
if my seagate drives get that error value down in the 60's, they are going out soonish
not real quick today soon
but the system is just iffy and had to play with
if you get spinrite to run over the drive, it will improve, some, for a while

The problem is I always have to be around to press F1 whenever the system boots up. Other than that, the disk (and the OSes) seem to work fine. I tried disabling BIOS harddrive monitoring but that did not help. Also disabling smart through smartctl and rebooting but that did not help either. Somehow the disk always remembers the SMART error.

The disk is a Maxtor 91531U3.

Is there anyway I get rid of that SMART warning at startup. Any help would be much appreciated.

I get messages from bios when I switch on the laptop, that "HDD status bad , back up and replace. I want to stop this message appearing so that windows will load normally. I cant disable it via BIOS as it has got no such an option. Will this tool help me?

Yes, this probably means that your disk has a bad sector. Read the BadBlocksHowTo linked from the smartmontools home page, to see how to identify if there is a file being stored on that bad part of the disk, and how to force the drive to reallocate that sector.

I have one thing that I cannot quite understand. If I read it correctly, the value is the current snapshot of what smartctl sees. In the case below, that is 045. The funny thing is it stats that the "worst" it has seen is 054.

194 Temperature_Celsius 0x0022 045 054 000

Is the temperature attribute the exception to the rule that the worst value is the "smallest value attained since SMART was enabled on the disk"

I suppose this would make sense as the worst temperature in a real life system would be a high temperature in most cases. Either that or I am way off base!

This must be a SEAGATE disk. Seagate ignores the smart standard and just stores the temperature (in Celsius) in these variables. So your current disk temperature is 45C and the hottest it has ever been is 54C.

I am running Centos release 4 with SATA drives on the digital video recorders we are building. I want to utilise the SMART suite but I have found that the SMART daemon fails to start during bootup. DO SATA drives support SMART?

Here is the quote from developers of smartmontools:
"Smartmontools should work correctly with SATA drives under both Linux 2.4 and 2.6 kernels, if you use the standard IDE drivers in drivers/ide. If you use the new libata drivers, it won't work correctly because libata doesn't yet support the needed ATA-passthrough ioctl() calls. Jeff Garzik, the libata developer, says that this support will be added to libata in the future. When this happens, we'll add support to smartmontools for a new SATA/libata device type '-d sata'. Typically, to force an SATA disk to run using the standard (non-libata) drivers, you must use the BIOS to select "legacy mode" for the controller. If the IDE driver doesn't support your particular SATA controller, or the controller doesn't have a legacy interface, then only libata can be used. Unless the hard disk controller on the system motherboard is Intel, VIA or nVidia, standard IDE drivers may not work

Note: an unofficial patch to libata that allows smartmontools to be used with the standard '-d ata' device type was posted to the linux kernel mailing list at the end of August 2004. The patch is included in the libata-dev patchset that can be applied to a recent Linux kernel (>= 2.6.9). With a SATA disk driven by a libata driver, smartmontools can now be used by specifying both the device type 'ata' and the SCSI device corresponding to this disk, for example, smartctl -i -d ata /dev/sda. The patch is still under development and it is probably best to make sure that the disk is idle before trying smartmontools. "

Well well - I had some ECS-AMD-Mainboard and activated the S.M.A.R.T. ... but actualy 2 Seagate-Harddisks died (the slow way - losing information) ... without SMART telling me that there is a Problem 8-)

I would like to know are there any direct reflection between the kernel I/O Error report and SMART test report?

I had a harddisk in Linux server, being reported I/O Seek Complete Error from Kernel nearly a year ago. I just leave that partition unused and used another harddisk to replace the mount point for that partition and let the server continues running.

After i read this article, i just go with a testing -a at that "Kernel reported problematic" harddisk.

After the test is over, the self-test log (-l) will show what sector is unreadable. This will probably agree with what is shown in SYSLOG. Then look at BadBlockHowTo (linked from smartmontools home page) for instructions about how to identify if there is a file stored on that bad sector. If you have no data that you need, you can fix the problem by overwriting the bad partition with zeros using dd.

But be careful not to zero out regions of the disk that store data that you need!

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.