The second part of the output (Listing 2) shows the results of the health
status inquiry. This is the one-line Executive Summary Report of
disk health; the disk shown here has passed. If your disk health
status is FAILING, back up your data immediately. The remainder of
this section of the output provides information about the disk's
capabilities and the estimated time to perform short and long disk
self-tests.

The third part of the output (Listing 3) lists the disk's table of up to
30 Attributes (from a maximum set of 255). Remember that
Attributes are no longer part of the ATA standard, but most
manufacturers still support them. Although SFF-8035i doesn't define the
meaning or interpretation of Attributes, many have a de facto
standard interpretation. For example, this disk's 13th
Attribute (ID #194) tracks its internal temperature.

Studies have shown that lowering disk temperatures by as little as 5°C significantly reduces failure rates, though this is less of an
issue for the latest generation of fluid-drive bearing drives. One of
the simplest and least expensive steps you can take to ensure disk
reliability is to add a cooling fan that blows cooling air directly
onto or past the system's disks.

Each Attribute has a six-byte raw value (RAW_VALUE) and a one-byte
normalized value (VALUE). In this case, the raw value stores three
temperatures: the disk's temperature in Celsius (29), plus its lifetime
minimum (23) and maximum (33) values. The format of the raw
data is vendor-specific and not specified by any standard. To track
disk reliability, the disk's firmware converts the raw value to a
normalized value ranging from 1 to 253. If this normalized value
is less than or equal to the threshold (THRESH), the Attribute is
said to have failed, as indicated in the WHEN_FAILED column. The column is
empty because none of these Attributes has failed. The lowest
(WORST) normalized value also is shown; it is the smallest value attained
since SMART was enabled on the disk. The TYPE of the Attribute
indicates if Attribute failure means the device has reached the
end of its design life (Old_age) or it's an impending disk
failure (Pre-fail). For example, disk spin-up time (ID #3) is a
prefailure Attribute. If this (or any other prefail Attribute) fails,
disk failure is predicted in less than 24 hours.

The names/meanings of Attributes and the interpretation of their raw
values is not specified by any standard. Different manufacturers
sometimes use the same Attribute ID for different purposes. For
this reason, the interpretation of specific Attributes can be modified
using the -v option to
smartctl; please see the man page for
details. For example, some disks use Attribute 9 to store the power-on
time of the disk in minutes; the -v 9,minutes option to
smartctl correctly modifies the Attribute's
interpretation. If your disk model is in the smartmontools database,
these -v options are set automatically.

The next part of the smartctl -a output (Listing 4) is a log of
the disk errors. This particular disk has been error-free, and the log is empty.
Typically, one should worry only if disk errors start to appear in
large numbers. An occasional transient error that does not recur usually is
benign. The smartmontools Web page has a number of examples
of smartctl -a output showing some illustrative
error log entries. They are timestamped with the disk's power-on
lifetime in hours when the error occurred, and the individual ATA
commands leading up to the error are timestamped with the time in
milliseconds after the disk was powered on. This shows whether the errors
are recent or old.

we are using the smart tools to test the 2.5" Fugitsu HD on one of embedded cPCI dual P-III board. We used it due to a problem we met - freezed "black screen" after POST (happened during the OS loading). It happened only with XP OS. The "Black screen" lead the user to manualy reset the card and hope it will not happened again in the next power cycle.

The command line used to run the test was smartctl -t long –d ata /dev/hda - the test run 40 minutes ! and fix the problem - but we have no idea what cause the problem and how this tool solve it if it should be only a TEST TOOL.

Can you assist with the following:
A. What tests is it running? read only? writes? does it change the HD controller working parameters?
Can you specifiy locations of the strings written/readden ?or this is random ?
B. Does it change something in the operating system? if it does, then what?
C. Does it change the Disk structure? if yes, how ?

A: SMART extended self test. No it does not change drive parameters.
B. No, it does not change something in the OS.
C. It does not change the disk structure. However as the previous responder says, when the self-test is run, the disk firmware may find and correct some types of problems on the disk surface.

As far as I know the given command line runs the diagnostic procedure *embedded* inside the hd firmware so it is to hd manufacturers discretion what is actually done. What may be possible in your case is that there was a bad sector in the disk that couldn't be read most of the times. When you run the procedure it might have happened that the sector could have been read one time and it have been immediately remaped (i.e. its data was moved to the disk spare area and the sector itself marked as unavailable).

Otherwise than that the tool doesn't affect HD state other than smart monitoring enable/disable flag and heath attributes values.

And I cant enable SMART? I use the HD under Win2000. I have tried:
C:\s\bin>smartctl -s on /dev/hda
smartctl version 5.33 [i386-pc-mingw32] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

Its only a note. The HD works fine and I only want to be informed about the status...

I have a drive that I can't even get into with fdisk, but I can access mounted file systems. It's temperature is indicating 53 celsius, but there is no max-min. It's in my machine next to another drive, and I'm going to try moving them apart. But any thoughts on fdisk not being able to get into the partition table even? It reads the other drive fine.

I don't have any idea why fdisk can't provide information. What's the command line to and error message from fdisk?

The temperature min/max isn't provided by all vendors. IBM/Hitachi and Toshiba (recent disks) have this, but many other vendors and older disks either have no temperature information, or (as with your disk) just the current temperature.

(1) You can unmount the disk while SMART activity is taking place, with no harmful effects. In fact you can get all smart data (run smartctl for example, or smartd) on an unmounted disk. This is because SMART operates at the level of the disk hardware, whereas 'mounting' refers to file systems on the disk and is at a much higher level. When you ask 'can it be terminated abruptly?' I assume that you mean, can you power down or unmount a disk which is carrying out SMART operations? The anwer is yes.

(2) This question does not make sense. SMART does not 'access a disk'. A utility such as smartctl or smartd 'accesses a disk'. These utilities can only access a disk connected to the local machine, such as /dev/hda or /dev/sda. These utilities can not access a disk 'over the network': see my response (1) above. Linux only allows you to access FILE SYSTEMS over a network (eg, using NFS), not 'raw disks'.

Today I finally found out why hda was always spinning despite a
hdparm -S 241 /dev/hda
command, and no file systems mounted on that disk.

It's the smartd that comes with Fedora Core 1. Whenever smartd checks
the disk, the disk spins up (in my case it's a Seagate ST360020A, but
this seems to be a general problem). Newer versions of smartd have
the -n option that allows suppressing the check for disks that are in
various power-saving modes, but not the version that comes with Fedora
Core 1. The workarounds are: use the newer smartd with the
appropriate option, don't use smartd for this particular disk, or
disable smartd altogether. Being lazy, I chose the latter, and now
hda spins down and stays that way.

I read this article with great interest, then I went and tried it out, and viola! one of my servers reported a problem. With the long diagnostics output in hand, the vendor replaced the disk, no questions. I then deployed this across all my machines, including a lab, plus client machines and bingo! found another one.

As before, the vendor ponied up the replacement disk without any difficulties.

It was a little awkward explaining to the client that I needed to replace the disk BEFORE it died. That's a new concept, one they are not familiar with, to say the least.

I now run the smartctrl demon and feel much better knowing it's keeping tabs on my disks. Prior to this, ignorance was NOT bliss, it was just this nagging feeling lurking in the back of my mind, waiting for a disk failure.

Excellent article! Thanks for bringing this to our attention. This one article alone more than paid for my whole subscription. As is typical of every issue of LJ, I always find several articles I just have to try out. A real 'hands-on' magazine.

If your disks are failing their self-tests, but still have 'passing' SMART health status, it may just be that they have one or more sectors that are unreadable. Next time this happens, you can probably fix it by forcing the disk to reallocate the sectors. Two ways to do this are:

I'm not sure if your comment about smartmontools on one server reading the health of all disks on a network is serious or not! It might be possible if
the raw disk SMART data were made available in some exportable file-system like way, but I doubt that this will happen anytime soon. (I did post a patch to
the linux kernel mailing list for this purpose, but didn't get any response.)