Comments

Comment viewing options

Bruce's original answer worked for me...
Your disk has one or more unreadable sectors. This does NOT mean that the disk is failing, but it has lost some information on those sectors. Run an extended self test:

After the test is over, the self-test log (-l) will show what sector is unreadable. This will probably agree with what is shown in SYSLOG. Then look at BadBlockHowTo (linked from smartmontools home page) for instructions about how to identify if there is a file stored on that bad sector. If you have no data that you need, you can fix the problem by overwriting the bad partition with zeros using dd.

But be careful not to zero out regions of the disk that store data that you need!

it happened to me as well. i tried everything on the net but nothing worked for me. finally, i went to the repair shops and they said the HDD has to be replaced. It is replaced now and works well. anyway, no complains it was a old HDD.

The 9000 system indicated our AE-35 unit had a high probability of failure within 72 hours. We replaced it and could find no indication of problems in the original unit. HAL said the replacement was about to fail, too. The ground recommendation was to leave the unit in service and let it fail. HAL now says it has failed, and it acts like it has, but I am wondering what I will find when I go out to replace it. Dave is monitoring the status from inside, and our link to the base is still down. I will report back when I learn more.

Some of us with SSD (solid state disk) units have been bitten by filesystem corruption after updating to the latest Ubuntu and Fedora. We have recently determined the corruption was (ironically) triggered by the SMART utilities.

I don't believe we know at this moment whether to expect a firmware update for the SSDs or some other remedy, but beware if you notice strange behavior in your device, especially immediately after probing its SMART status.

Note that SMART may be invoked without your direct knowledge -- Beginning with Ubuntu 9.10 libatasmart is being invoked automatically by DBUS. Also in Ubuntu, running GNU parted probes the disk's SMART status, though running fsck does not.

"Studies have shown that lowering disk temperatures by as little as 5°C significantly reduces failure rates, though this is less of an issue for the latest generation of fluid-drive bearing drives. One of the simplest and least expensive steps you can take to ensure disk reliability is to add a cooling fan that blows cooling air directly onto or past the system's disks."

Which studies are those? Google's study of over 100,000 drives found that disks failed MORE often when they were cooled, and ran better hot:

"In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."

-----------

The results from smartctl are very confusing and hard to understand. Wikipedia clarifies what some of the values mean, though there's still a lot of uncertainty:

You shouldn't quote articles without reading the whole piece. Further in that article, they said the failures were suspected not to come from the lower drive temperatures, but rather from the power fluctuations that were coming from the increased electrical load of the air conditioners...

In very simple terms, your disk has had problems in the past and has either repaired itself or been repaired by some external utility. However, this repair has left your disk vulnerable to further failure. Back up immediately.

Thank you very much for posting this! I already feared that my hard disk is dying because of the strange noises the PC made on start up. (I don't even know now where the noise came from. Could be the other things too, right?)

I miss SMART for USB disks too. I wonder why I cannot monitor my disk connected over USB in Linux. Is it Linux driver limit? Or HW limit? My USB/ATA controler is based on Genesys Logic (05e3:0702), my disk supports SMART; I know I can read SMART statistics when I connect my disk over PATA cable.

Bruce Allen's reference to SMARTs ability "query the disk's health status, run disk self-tests..." suggests that you could at least get some kind of condition report by removing your Hard Disk from its USB housing, connecting it to an ATA cable in a desktop, then running the query utility. I use a cheap ($6) adapter to connect my 2.5 inch laptop drives to my desktop. Though this ritual does not allow continuous monitoring of the USB drive, it may give a clue as to its current status.

Sad to say, other distractions, such as picking a distribution, have prevented me from trying out my own suggestion.
cheers...

Hello,
why doesn't smartctl show a summary of bad or pending sectors? One such message can be found in /var/log/messages like "Aug 27 12:17:51 91-64-143-104-dynip smartd[4483]: Device: /dev/hdb, 7 Currently unreadable (pending) sectors", however, it would be more convenient to get this information directly from smartctl. How can I get this information?

Thank you for a nice written and informative article.
I tried it on one of my scsi drives which tends to be busy.
It gave me some other information:
The overall status is ok, should i worry about the errors ?

with attrib 1,7,195
I find this with all my seagate drives
which worries me where I read that part about ata4 standard and drives not keeping the attributes anymore

i think my non-seagate drives are now just too dumb to realise they are failing.
if my seagate drives get that error value down in the 60's, they are going out soonish
not real quick today soon
but the system is just iffy and had to play with
if you get spinrite to run over the drive, it will improve, some, for a while

The problem is I always have to be around to press F1 whenever the system boots up. Other than that, the disk (and the OSes) seem to work fine. I tried disabling BIOS harddrive monitoring but that did not help. Also disabling smart through smartctl and rebooting but that did not help either. Somehow the disk always remembers the SMART error.

The disk is a Maxtor 91531U3.

Is there anyway I get rid of that SMART warning at startup. Any help would be much appreciated.

I get messages from bios when I switch on the laptop, that "HDD status bad , back up and replace. I want to stop this message appearing so that windows will load normally. I cant disable it via BIOS as it has got no such an option. Will this tool help me?

Yes, this probably means that your disk has a bad sector. Read the BadBlocksHowTo linked from the smartmontools home page, to see how to identify if there is a file being stored on that bad part of the disk, and how to force the drive to reallocate that sector.

I have one thing that I cannot quite understand. If I read it correctly, the value is the current snapshot of what smartctl sees. In the case below, that is 045. The funny thing is it stats that the "worst" it has seen is 054.

194 Temperature_Celsius 0x0022 045 054 000

Is the temperature attribute the exception to the rule that the worst value is the "smallest value attained since SMART was enabled on the disk"

I suppose this would make sense as the worst temperature in a real life system would be a high temperature in most cases. Either that or I am way off base!

This must be a SEAGATE disk. Seagate ignores the smart standard and just stores the temperature (in Celsius) in these variables. So your current disk temperature is 45C and the hottest it has ever been is 54C.

I am running Centos release 4 with SATA drives on the digital video recorders we are building. I want to utilise the SMART suite but I have found that the SMART daemon fails to start during bootup. DO SATA drives support SMART?

Here is the quote from developers of smartmontools:
"Smartmontools should work correctly with SATA drives under both Linux 2.4 and 2.6 kernels, if you use the standard IDE drivers in drivers/ide. If you use the new libata drivers, it won't work correctly because libata doesn't yet support the needed ATA-passthrough ioctl() calls. Jeff Garzik, the libata developer, says that this support will be added to libata in the future. When this happens, we'll add support to smartmontools for a new SATA/libata device type '-d sata'. Typically, to force an SATA disk to run using the standard (non-libata) drivers, you must use the BIOS to select "legacy mode" for the controller. If the IDE driver doesn't support your particular SATA controller, or the controller doesn't have a legacy interface, then only libata can be used. Unless the hard disk controller on the system motherboard is Intel, VIA or nVidia, standard IDE drivers may not work

Note: an unofficial patch to libata that allows smartmontools to be used with the standard '-d ata' device type was posted to the linux kernel mailing list at the end of August 2004. The patch is included in the libata-dev patchset that can be applied to a recent Linux kernel (>= 2.6.9). With a SATA disk driven by a libata driver, smartmontools can now be used by specifying both the device type 'ata' and the SCSI device corresponding to this disk, for example, smartctl -i -d ata /dev/sda. The patch is still under development and it is probably best to make sure that the disk is idle before trying smartmontools. "

Well well - I had some ECS-AMD-Mainboard and activated the S.M.A.R.T. ... but actualy 2 Seagate-Harddisks died (the slow way - losing information) ... without SMART telling me that there is a Problem 8-)

I would like to know are there any direct reflection between the kernel I/O Error report and SMART test report?

I had a harddisk in Linux server, being reported I/O Seek Complete Error from Kernel nearly a year ago. I just leave that partition unused and used another harddisk to replace the mount point for that partition and let the server continues running.

After i read this article, i just go with a testing -a at that "Kernel reported problematic" harddisk.

After the test is over, the self-test log (-l) will show what sector is unreadable. This will probably agree with what is shown in SYSLOG. Then look at BadBlockHowTo (linked from smartmontools home page) for instructions about how to identify if there is a file stored on that bad sector. If you have no data that you need, you can fix the problem by overwriting the bad partition with zeros using dd.

But be careful not to zero out regions of the disk that store data that you need!

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.