3Ware tw_cli getting DEGRADED and ECC-ERROR on rebuild

3Ware tw_cli getting DEGRADED and ECC-ERROR on rebuild

I have some older servers still running 3Ware RAID cards. They work great, and have a nice command line interface to managing things using tw_cli. I recently had a drive fail, then when I went in to do the rebuild it errored with an ECC error and the rebuild never finished.

Versions:
CentOS 6.9
3Ware 9600S-8 Card
tw_cli

These are the steps which I did to resolve the issue and get everything back into working order. First we’re going to remove the DEGRADED or FAILED disk.

Find the failed RAID drive

Each of my servers has a different cX card number, so I always issue a show first to find the RAID card, and then find the failed drive.

You can see it’s not present at this time. Now this is where you would pull the drive if it’s bad and replace with a new one if the drive failed in the raid. In my case, I had already installed a new drive and the rebuild is what failed. So I know the drive is good, it failed because of the ECC error.

Add new spare drive for rebuild

After you have inserted the replacement disk we need the controller to scan drives.

[root@host3 log]# tw_cli /c0 rescan

You can double check that the drive is now showing up if you issue a show command, now lets add the new drive as type spare.

[root@host3 log]# tw_cli /c0 add type=spare disk=5

If you run show after adding the spare you’ll see that we just added u1 which is type spare, this will be used by the RAID when performing a rebuild.

Rebuild the RAID and and Ignore ECC

Now that we’ve got the new drive in and ready to go, we’ll need to issue a command to rebuild the RAID.

We’re going to ignoreecc to get the rebuild to work. This will get us past the problem, but be aware there could be some undetectable corruption on disk where the ECC error occurred that we’re ignoring.

I’ve done this a few times in situations, and have lucked out where it didn’t effect any data that you never know.