I have been doing some research into which drives might be compatible with my Adaptec 5805 raid adapter, or any hardware RAID adapter for that matter.

In this search and my experience over the last years, RAID type disks do not exist for nothing. These disks are often picked from batches, better constructed, heavier and more robust to withstand 24H duty and thrashing around the disk. Also they often have tuned firmware to deliver the best possible results in a RAID and not so much in a desktop. And they sometimes have a different warranty period, compared to a desktop disk.

Desktop disks have been traditionally a lot cheaper but not always safe to you use in a (hardware) RAID array. Sometimes problems would surface immediately, and sometimes only after a while. This lead me to find out what could be the main cause of this. In my opinion it has to do with TLER / CCTL. Everything else isn't really needed to make your RAID controllers understand the drive better.

Let me try to explain :

Which action should a HDD take in a situation where it has a hard time reading or writing something :

RAID Controller with RAID disks. A disk develops a fault, tries for 7 seconds, then tells the RAID controller that it failed to read a sector. RAID controller acknowledges this and rebuilds sector from it's RAID, writes data to a different sector on the disks and marks the original sector as bad. RAID controller happy, Disk happy, no dropping disks!

As you can see, mixing a desktop disk with a raid controller can produce unwanted results. Disks will be dropped, while nothing is really wrong with it, etc. In the end, it will become almost impossible to rebuild the RAID array anymore. Over time (when disks inevitably start to develop minor faults) this problem will surface and gradually become worse.

It became known

It first came to light that any disk can do either, and that it's just a value in the firmware when WD leaked a tool with which this could be set. They quickly disabled the functionality in their newer revisions of firmware and drives, to try to keep selling their much more expensive RAID editions (50% increase in price, nothing in cost, good deal for them). There is a good topic over at the HardOCP forum which work and which don't. WDTLER changed to setting in such a way that it would survive a reboot, actually altering the firmware itself.

Breakthrough 1

But since the newest patches/SVN of smartmontools or smartctl, it's now possible to use SCT control to change the value the SCTECR (Error correction) is set to. Since this is done through SMART, any modern disk should support it. A big thank you goes out to r.gregory whom made this possible!

Breakthrough 2

It's even now possible, using most modern RAID controllers, to use these commands on disks inside of a hardware RAID. And that's where it got interesting! Because if we can manually set the Error control values, we could ensure that at least this problem will not bite us again in the future!

There is only a slight problem, and that is that the set value will be reset on every power cycle. Reboots mostly do NOT affect the value. Thus, running Windows or Linux, this is no problem, just create a script which sets it upon boot. Myself I run VMware ESX or ESXi on my servers and it posed a slight problem. My solution is a custom Fedora 12 USB bootstick, which boots in about 30 seconds, sets the value and automatically reboots. Pull the stick out during the reboot and the system boots from HDD with VMware on it. Problem fixed.

So the only thing left now is to find out which disks are compatible with setting this command, and which are not.

So, how do we test this.....

Your controller does not matter (RAID controllers are bit more tricky and need a different approach though, see the smartmontools website).

Windows users, download the windows installer. This is the newest 5.40 build. This gives you the smartctl executable, which we'll need. Users of other OS flavors will need to build their version from SVN, since the current binary versions do not yet have this functionality in it.

Smartctl can be used for many other things besides this. It can tell you much about the life of your disk or say the current temperature for monitoring purposes, etc. etc.

None of these tests below will touch your data.

When you have that installed, and for example the disk you wish to check is your "d:" drive, execute the following :

The Tests

smartctl -l scterc d:

If correct, this will give you the following feedback on a desktop disk

SCT Error Recovery Control:

Read: Disabled

Write: Disabled

Now we are going to try and change that value :

smartctl -l scterc,70,70 d:

If that works, you will see the following feedback :

Read: 70 (7.0 seconds)

Write: 70 (7.0 seconds)

To put it back to original values again, either turn off the power of your system, or run "smartctl -l scterc,0,0 d:"

Interesting tests are also if the value survives a reboot or even a power-cycle (from what I understand, it should not).

Since I think this information is needed for anyone trying to build a home RAID array, we should keep a sort of database in this topic, maybe something like this:

Share this post

Link to post

Share on other sites

ut, doing closer research this is only because of one single setting : TLER / CCTL / etc.

Just to clarify... It's more than that. Much more in terms of internal firmware tweaks, and some often have physical changes to permit better vibration tolerance. However, you are partially correct in that error recovery period is one of the more important reasons.

Also edited the part where it states that there is no other difference. That was a bit of a too harsch statement. Setting this value does not give you a RAID drive in firmware or hardware (more durable, specefically tuned), it only resolves not being able to use the drive in a RAID. Quite right you are.

Share this post

Link to post

Share on other sites

I'm also after the information regarding the 2T Samsung drives. I will probably just get them in a week or so and give it a go.

My solution is a custom Fedora 12 USB bootstick, which boots in about 30 seconds, sets the value and automatically reboots. Pull the stick out during the reboot and the system boots from HDD with VMware on it. Problem fixed.

I'm sort of in the same boat as I'm using a hardware RAID card with a Windows OS. This means I will have to run a Linux version of the tool (for hardware RAID support) and then reboot before loading Windows.

I'd be forever grateful if you would like to share a copy of your USB stick if you don't mind. I could then edit the command to suit my setup.

Share on other sites

I'm also after the information regarding the 2T Samsung drives. I will probably just get them in a week or so and give it a go.

I'm sort of in the same boat as I'm using a hardware RAID card with a Windows OS. This means I will have to run a Linux version of the tool (for hardware RAID support) and then reboot before loading Windows.

I'd be forever grateful if you would like to share a copy of your USB stick if you don't mind. I could then edit the command to suit my setup.

Thanks

My 7x 2TB samsung disks should arrive tomorrow. Sadly, I did not want to wait any longer for anyone else to test it.

I will certainly share my experience with them, maybe post some benchmarks, etc. (I'll make a different topic for that).

Sharing my bootstick. Hmm, I'll have a go at that. I'll make a backup using "DD" and then zip it to see how big it becomes. Off course, it will only be useful if your using an adaptec card (2 and 5 series). Otherwise it's almost easier to create a new one from scratch, integrate your drivers and admin console (mine boots the adaptec agent so I can connect to it from a different host to manage the adaptec itself, which cannot be done from VMware).

Don't expect too much from it. It's quite a crude installation. Just a full install, which I set to runlevel:3 to not boot the graphical UI and then run some scripts in the background and reboot again. Oh, and I also compiled the newest SVN of smartctl on there. That worked for me, so I stuck with it!

Share this post

Link to post

Share on other sites

I'm afraid I have some catching up in the Linux space and would have issues doing half of what you mentioned above :S

I'm using a Dell PERC 5/i card (rebranded LSI MegaRAID card) that should work fine with the smartmontools. I tried one of the Linux Live CD's yesterday and it seems to be recognizing the card, but I'm having issues getting the commands to work. I know the ERC commands won't work, but I kept on getting errors just trying to query the HDD connected to the card. I guess I'll have to look into that a bit more.

I'm thinking that as long as Greg's ERC version of smartmontools is compiled and loaded onto a bootable live distro, that should suffice. I might be wrong as I'm only speculating.

Anyways, I would appreciate if you could upload an image of the USB stick somewhere (or PM me for e-mail) so I could give it a go.

I'm afraid I have some catching up in the Linux space and would have issues doing half of what you mentioned above :S

I'm using a Dell PERC 5/i card (rebranded LSI MegaRAID card) that should work fine with the smartmontools. I tried one of the Linux Live CD's yesterday and it seems to be recognizing the card, but I'm having issues getting the commands to work. I know the ERC commands won't work, but I kept on getting errors just trying to query the HDD connected to the card. I guess I'll have to look into that a bit more.

I'm thinking that as long as Greg's ERC version of smartmontools is compiled and loaded onto a bootable live distro, that should suffice. I might be wrong as I'm only speculating.

Anyways, I would appreciate if you could upload an image of the USB stick somewhere (or PM me for e-mail) so I could give it a go.

Thanks!

Hahaha, my linux skills are at noobish levels at best! I just try until it works.

Again, I can put the zipped DD files of the stick on my FTP server for you if you really wish. But you will need an identical USB stick (exactly identical, in my case 8GB) and then the stuff I made only works for ADAPTEC cards, your PERC does not work in that way. So it would be completely useless for you..... The stick I have now is also a very crude big install, etc.

My best bet for you would be, download a debian ISO (Debian is easier then fedora in my opinion, but fedora is best for adaptec), install that to your USB stick. Compile the svn of smartmontools (apt-get install subversion gcc automake "and euhm c++ modules for gcc") then follow the instructions on the smartmontools homepage. And then try to use smartctl using the megaraid switch (The Dell PERC 5/i is an OEM version of a LSI card (exact type not in my head right now).

Sorry to disappoint you in this. Also took me 2 days to get it working for me, but believe me when I say my stick will not work for you. When you figure it out, it's not that hard though....

To everybody else, come on people, submit those specs! Single disks connected to motherboard controllers are fine, etc.

Share on other sites

I am unsure that RAId controllers keep a 8sec timeout at Rebuild time...

==> That would still mean there is a problem with HDDs not timing out after 7sec...but that would not lead the all array to an unrecoverable situation.

Sorry to say, I am not quite clear what you mean exactly...

The 7 seconds is chosen because most RAID edition HDD's from various manufacturers are also set to 7 seconds and mention that RAID controllers wait for 8 seconds. So this value isn't "guess work". Sure, some RAID controllers will wait 20 or 30 seconds, that still should not form a problem. If a RAID controller waits under the stated 8 seconds, you will still run into problems. But, since you can change the value to whatever you'd like, you can change it to the lower value needed.

The whole reason we are looking at TLER is because if the RAID array evicts a disk because of an error (which could or could not have been prevented with TLER) it wishes to rebuild using a new disk. If during this process another disks encounters an error (and the disk does not report this in time) the rebuild would stop and kill your array altogether. But if TLER catches the error and tells the RAID controller, then RAID controller can choose what to do.

Conclusion, we wish the disk to try and fix/reallocate the sector up to 7 seconds, if it cannot in that time, report the sector as bad and report to the RAID controller what to do. In a rebuild situation as you speak of, mostly the RAID controller will tell the disk to try again and again till it can, because there is no other situation to build it from.

Share this post

Link to post

Share on other sites

So presumably this tweak is only useful when hardware raid controllers are used? Dows it make a difference in software raid systems like BYOD NAS servers etc which are usually linux softraid over ich contollers?

Share this post

Link to post

Share on other sites

really, how important is this? how often does a drive hang on a read/write? can someone explain the nitty gritty of this? is this something simple like, "oh, fragmented file, or corrupted data, let me fix it, oops took to long, raid controller booted me."

i hear all the time about how drives get booted, then when a detailed diagnostic is undertaken the drive is fine. i'm guessing because so many (subjective) drives are booted from controllers it isnt an actual error, but an impatient controller. how many times does his happen during day to day computing and we dont realize it? or is this hardly ever happening but we never want it to?

could anyone give a detailed explanation on what exactly is happening when the drive times out and what, if any, precautions we can take to ovoid it?

i know i have tried accessing files where the drive took longer then normal to read, or i coppied one file from one directory to a different directory on the same drive and it took the better part of forever. are these scenarios where, had the drives been in raid, it would have been booted?

really, how important is this? how often does a drive hang on a read/write? can someone explain the nitty gritty of this? is this something simple like, "oh, fragmented file, or corrupted data, let me fix it, oops took to long, raid controller booted me."

i hear all the time about how drives get booted, then when a detailed diagnostic is undertaken the drive is fine. i'm guessing because so many (subjective) drives are booted from controllers it isnt an actual error, but an impatient controller. how many times does his happen during day to day computing and we dont realize it? or is this hardly ever happening but we never want it to?

could anyone give a detailed explanation on what exactly is happening when the drive times out and what, if any, precautions we can take to ovoid it?

i know i have tried accessing files where the drive took longer then normal to read, or i coppied one file from one directory to a different directory on the same drive and it took the better part of forever. are these scenarios where, had the drives been in raid, it would have been booted?

I will give you a reply soon. It's all in the start post, but I'll try to explain with a realistic situation.

Share this post

Link to post

Share on other sites

Critical. It is not acceptable for otherwise good drives to drop from arrays at random, especially in production usage. Seeing a 10x to 100x increase in failure rate in what are good drives is utterly unacceptable, especially in a product you expect to see 99.9% reliable on initial power up.

When you have to fail twenty drives a day due to poor firmware on the HDD maker's part, something isn't right.

And keep in mind, especially in RAID5, you can only tolerate a single disk failure. What if an actual bad disk happens? Then while you rebuild-- which can take hours to days in heavy load scenarios-- you are intensely vulnerable to a 2nd disk failure.

The point of RAID is generally to maximize uptime. Disks being kicked out of the array under heavy use every few hours when it should be days/weeks/months between an actual failed disk is an order of magnitude to two orders of magnitude more increase in failure rate. Utterly unacceptable.

Share this post

Link to post

Share on other sites

The problem with these SCT values is that just because a drive reports it supports it and accepts a value doesn't mean it actually *works*. Some drives report they support it and will happily accept and keep values you set but not actually abort recovery once the set timer expires.

Share on other sites

The problem with these SCT values is that just because a drive reports it supports it and accepts a value doesn't mean it actually *works*. Some drives report they support it and will happily accept and keep values you set but not actually abort recovery once the set timer expires.

I too would like to know where you found this information or have experience with it. All my research seems to suggest that these values work, when set. It's also mandatory since the ATA-8 specifications, so not actually doing the work invalidates your drives to the specifications. WD is the only one that I know of that has disabled it. Or rather, they have disabled their WDTLER tool, don't have the drives so I cannot test with smartctl.

Any information welcome!

Share this post

Link to post

Share on other sites

Why would a drive maker do that? That doesn't make any sense. Do you have any proof that this has happened?

Why would they do that? I have no idea. Same reason they do anything else that doesn't make sense, like password protection that doesn't work or 1.5/3gbps jumpers that don't work, or utilities that set 28-bit LBA mode that the drive doesn't obey. It's not an advertised feature, nor is it compulsory. Since it's not a feature that's advertised I don't think it's unusual that some manufacturers implement it while others don't and others still partially implement it in a way that doesn't work. Par for the course really.

I too would like to know where you found this information or have experience with it. All my research seems to suggest that these values work, when set. It's also mandatory since the ATA-8 specifications, so not actually doing the work invalidates your drives to the specifications. WD is the only one that I know of that has disabled it. Or rather, they have disabled their WDTLER tool, don't have the drives so I cannot test with smartctl.

Any information welcome!

I found this information by setting various SCT error recovery timer values then telling the drive to read known bad sectors and observing how long the drive took before returning an error.

Most drives that "support" SCT don't state anywhere that they are ATA-8 compliant. Nowhere on the drive or it's packaging does it say it should or does comply with ATA-8 specifications. Again, as an optional feature that doesn't even exist as far as the manufacturer is concerned, there's no obligation for them to implement it correctly. Aside from enterprise drives which are specifically sold on the basis that they have variable ERC, one cannot rely on simply being able to query and set this value to be indicative of it actually working. Unfortunately the only way to reliably test if it works is on a drive with known bad sectors. As a fact, I've tested several Toshiba drives with bad sectors and they behave in this way. I'll test a few more when I get time. None of my Samsung HD154UI's have any bad sectors unfortunately so I can't say if they obey the value or not.

All I'm saying is just because you can set it doesn't mean it'll actually work.

Share on other sites

All I'm saying is just because you can set it doesn't mean it'll actually work.

Ok good info. Your information seems plausible.

Right now though, there is no information on the drives currently in the topic if it will or will not work.

But as this thread is about collecting information about SCT, this is very good info! Sadly, as you say, testing it would be quite hard.

I'll monitor my drives and if I develop a bad sector I will try to test it as you say. For now, I am going to keep advising people to set the settings on their drives, since, if it works, it's beneficial, and if it does not, no harm is done. On the current drives, it's unknown is it actually does something or not in reality, per your assumptions. I believe it actually does work, but as said, your information seems very plausible and as you said you have done tests, which I have not, because I have no drives with bad sectors.

Would you mind sharing your exact testing procedure, so people who might have bad sectors can try it?