Hard Drive Burn-In Testing - Discussion Thread

Go to page

Go to page

FreeNAS Experienced

Mod note: This document has been ported over to the Resources section. To get to the document itself, just use the tabs above and click the "Overview" tab.

This thread remains as the discussion thread, as before. The original version of the document follows below, inside the spoiler tags.

- Ericloewe

jgreco did a nice system build/test/burn-in guide here, but I (and many others) found the details a bit lacking in the hard drive section. He mentions S.M.A.R.T. tests, but doesn't go over how to run them, or how to view the results, etc. and then just kinda throws around dd commands without a lot of explanation there either. Yes, this information is available elsewhere, but for somebody (such as myself) looking for a single cohesive guide to burn-in testing, I figured it'd be nice to have all of the info in one place to just follow, with relevant commands. So, having worked my way through reading around and doing my own testing, here's a little more n00b-friendly guide, written by a n00b, so please feel free to chime in with suggestions or criticisms if you have any. I'm basing this guide more off of cyberjock's post here than jgreco's guide.

UPDATE: Thanks to cyberjock, I've updated the section on badblocks to include instructions for using tmux to test all drives in parallel. Considering that badblocks with default settings takes over 24 hours for a 2TB drive, that should significantly decrease testing times, especially for large arrays.

First of all, the S.M.A.R.T. tests. The first thing that someone unfamiliar with S.M.A.R.T. tests might find strange is the fact that no results are shown when you run the test. The way these tests work is that you initiate the test, it goes off and does its thing, then it records the results for you to check later. So, if this is an initial burn-in test for your entire system, you can initiate tests on all of the drives simultaneously by simply issuing the test command for each drive one after another.

The first test to run is a short self-test:

Code:

smartctl -t short /dev/adaX

It should indicate that the test will take about 5 minutes. You can immediately begin the same test on the next drive, but you can only run one test on each drive at a time. Once it has completed, run a conveyance test:

Code:

smartctl -t conveyance /dev/adaX

Again, wait for the test to complete (about 2 minutes this time). Finally, a long test:

Code:

smartctl -t long /dev/adaX

------
Note added by @wblock 2018-01-10: this section recommended enabling the kern.geom.debugflags sysctl. Many people still think it has something to do with allowing raw writes. It does not. Instead, it disables a safety system that is intended to prevent writes to disks that are in use (say, by having a mounted filesystem). From man 4 geom:

0x10 (allow foot shooting)
Allow writing to Rank 1 providers. This would, for example,
allow the super-user to overwrite the MBR on the root disk or
write random sectors elsewhere to a mounted disk. The
implications are obvious.

To summarize, this option should generally not be needed. It only makes it possible to harm data. Any disk you are going to overwrite with data should not be mounted or have anything you wish to keep. In fact, best practice is to not be erasing or stress-testing drives on a system that has actual data on it. Since those disks will not have mounted filesystems, this sysctl will not affect being able to write to them. In fact, it will only make it possible to blow away things that are in use.
------

Now, before we can perform raw disk I/O, we need to enable the kernel geometry debug flags.

This carries some inherent risk, and should probably not be done on a production system. This does not survive through a reboot, so when you're done, just reboot the machine to disable it:

Code:

sysctl kern.geom.debugflags=0x10

Now that we can execute raw I/O, run a badblocks r/w test.​

Unlike the S.M.A.R.T. tests, badblocks runs in the foreground, so once you start it, you won't be able to use the console until the test completes. It also means that if you start it over SSH and lose your connection, the test will be canceled. The answer to this is to use a utility called tmux:

Code:

tmux

You should now see a green stripe at the bottom of the screen. Now, we can run badblocks. THIS TEST WILL DESTROY ANY DATA ON THE DISK SO ONLY RUN THIS ON A NEW DISK WITHOUT DATA ON IT OR BACK UP ANY DATA FIRST:

Code:

badblocks -ws /dev/adaX

badblocks also offers a non-destructive read-write test that (in theory) shouldn't damage any existing data, but if you do choose to run it on a production drive and suffer data loss, on your own head be it:

Code:

badblocks -ns /dev/adaX

It has been brought to my attention that badblocks has some limitations with larger drives >2TB. The easy workaround is to manually specify a larger block size for the test.

Code:

badblocks -b 4096 -ws /dev/adaX

or

Code:

badblocks -b 4096 -ns /dev/adaX

Once you've started the first test, press Ctrl+B, then " (the double-quote key, not the single quote twice). You should now see a half-white, half-green line through the screen (in PuTTY, it's q's instead of a line, but same thing) with the test continuing in the top half of the screen and a new shell prompt in the bottom. Run the badblocks command again on the next disk, then press Ctrl+B, " again to create another shell. Continue until you've started a test on each disk. If you are connecting over SSH and your session gets disconnected, all of the tests will continue running. When you reconnect, to resume the session and view the test status, simply type:

Code:

tmux attach

As with the S.M.A.R.T. tests, you can only run one test at a time per drive, but you can test all of your drives simultaneously. In my experience, the tests run just as fast with all drives testing as with a single drive, so for your initial burn-in, there's really no reason not to test all of the drives at once. Also, be prepared for this test to take a very long time, as it is basically the "meat and potatoes" of your burn-in process. For reference, the default 4-pass r/w test took a little over 24 hours on my WD Red 2TB drives, YMMV.

Because S.M.A.R.T. tests only passively detect errors after you've actually attempted to read or write a bad sector, you should run the S.M.A.R.T. long test again after badblocks completes:

Code:

smartctl -t long /dev/adaX

At this point, you have fully tested all of your drives, and now it's time to view the results of the various S.M.A.R.T. tests:

Code:

smartctl -A /dev/adaX

This should produce something like this (sorry for the formatting fail):

Some of the more important fields right now include the Reallocated_Sector_Ct, Current_Pending_Sector, and Offline_Uncorrectable lines. All of these should have a RAW_VALUE of 0. I'm not sure why the VALUE field is listed as 200, but as long as the RAW_VALUE for each of these fields is 0, that means there are currently no bad sectors. Any result greater than 0 on a new drive should be cause for an immediate RMA.

Once all of your tests have completed, you should reboot your system to disable the kernel geometry debug flags.

Inactive Account

iozone is for benchmarking. It's not useful for diagnostics at all unless you are trying to stress the disk hard into breaking. You can run multiple badblocks and SMART test simultaneously(but not more than 1 on any disk at any time) using something like tmux or screen.

FreeNAS Experienced

Thanks for the tip on tmux, I had started a test on my first disk earlier today, and now I just started simultaneous tests on the remaining 5 drives, so I'll run it for a bit and compare it to the speed on my initial test to see if there's any performance hit for parallel tests. S.M.A.R.T. tests are asynchronous, so you don't need it for them, but for badblocks, you do.

Edit: only 5% into the first pass, but the speed seems to be right on par with the first test, so that's nice. Glad I was able to get this started now, because my initial estimates of the test duration didn't take into account the fact that the readback pass occurs separately, or that the status bar only displays the status of the current pass, so it's looking like a little over 8 hours *per pass* on a 2TB WD Red, which is going to end up taking more than 24 hours for a complete 4-pass test. That's fine, since I'll be gone for the weekend, but again, REALLY glad you gave me the tip on tmux when you did, since that means I'll be able to have all 6 drives tested by the time I get back

Newbie

Now I don't claim to have read all the posts on this forum, but I have read this one and the ones it points to and several other related ones and have used the search box, and while they are a great help at explaining things and I feel that I follow everything in this guide (Thumbs UP!), its the very simple beginning that I do not understand. For example, to run these tests, do I install freenas and then ssh into the machine? Do I run my box off a usb-stick with a Linux or freedos distro? How do I get to the point where I can type in these commands? There seems to be some kind of common understanding that is just beyond me. Or maybe I'm the wrong kind of noob. (20+ years of experience, exclusively with embedded systems and DIY PCs under windows though).

I've been running memtest86 (&+) as well as Mprime23 and a couple of other CPU stressers for a couple of days now off Ultimate Boot CD and would really like to move on to hard drive testing, before the window my store allows for quick hardware returns closes. I don't want to wait for a couple of weeks while Western Digital processes my RMA (been there, done that).

FreeNAS Experienced

I ran all of this from the FreeNAS shell, so yes, you'll need to set up FreeNAS on a flash drive and boot into it to run these tests. You could probably run them from some other BSD or *NIX environment, but for the sake of this forum, I'll just suggest using FreeNAS. If you're not sure how to install FreeNAS to the USB stick, that's a bit out of scope for this guide, but thankfully, the process is already documented here

Not-very-passive-but-aggressive

tmux is very unintuitive at first. My recommendation to get 6 nicely distributed screens is to first carelessly open 6 of them. Then toggle between display options until you reach tiled (you absolutely need the man page for tmux).

FreeNAS Experienced

I had zero problems with tmux even when I first started using it. Then again, I read the man page extensively before I actually started using it though.

On a very high level, I find it rather simple actually. You really only ever need to know two commands: tmux (to start), tmux attach (to existing session).
Once you're attached to a session, it's all ctrl-b (or whatever you rebind this to) and some_key. That's all there is to it, really.

Not sure where the OP got reattach from. You can find "tmux attach" barely one page into the man page without much reading at all.

FreeNAS Guru

I just finished running badblocks on two WD40EFRX WD RED 4TB drives and it took slightly over 72 hours to run 4 passes per disk.
Each pass is composed of a write sequence (9 hours write to the disk) and followed by a read and compare sequence (another 9 hours to read from the disk). Total per pass is about 18 hrs.
It is unfortunate there is no estimated time for the test completion.

Newbie

So I started doing the badblocks tests and had to log out of the gui view. Now when I log back in, I can't access the shell anymore. I can see that the tests are running since the drive activity light is lit and if I go to reporting, I can see solid drive activity over the last 40 hrs.

How do I get back in to view progress and how long should it take to do 6 X 3 tb WD Red hard drives?

FreeNAS Guru

So I started doing the badblocks tests and had to log out of the gui view. Now when I log back in, I can't access the shell anymore. I can see that the tests are running since the drive activity light is lit and if I go to reporting, I can see solid drive activity over the last 40 hrs.

How do I get back in to view progress and how long should it take to do 6 X 3 tb WD Red hard drives?

Have you tried running 6 concurrent badblocks test using tmux? if so then you can log back to it using the "tmux attach" command in shell.

Based on my setup, it took 72hr to run 4 passes on my 4TB drives. I would think you should expect something like 75% so 54 hours should be your target. Give a few extra hours to be safe. I had 1 drive finishing sooner than the other, maybe 10% speed gain, so your milage may vary.

If you can't attache back to tmux, I would suggest to run "top" and see which process is running.