3ware 9690: problems building RAID6 (but RAID0 is ok)

Recommended Posts

I've got a strange problem building a RAID6. This is a somewhat long-winded story, but I'll try to be concise.

The hardware is a Supermicro X7SB4/E with a 3ware 9690SA-4I controller (firmware FH9X 4.10.00.027, driver 2.26.02.011). A LSILOGIC SASX36 A.0 backplane with 24 bays is connected to the controller via a single cable. 12 bays are filled with WD 2TB drives in a RAID6, and this RAID has been running for ~2 years with no problems. Now more space was needed and I purchased 12 Hitachi Deskstar 7K3000 3TB drives to fill the remaining 12 bays in a RAID6 config, and that is where the trouble started.

When I put in the drives, all were correctly identified by the controller. A few moments after initialising a RAID6, however, several drives failed, throwing a "degraded" error for the drive, and the initialise aborted. So I deleted the RAID6 and started again, and this time, one or two _other_ drives failed after a few moments. I can swap drives from bay to bay, to no avail. Every time I build a RAID5/6, some drives will fail after a few moments and the initialise is aborted.

So then I tried to build a RAID0, and lo and behold, it worked. The server is running Openfiler, and the RAID0 is recognised as a new block device in the system, and I can export it via iscsi and format and use it.

I contacted LSI support and they instructed me how to run a diagnostic tool and to send them the output. They came back to me with the following issues:

-the old 2TB drives are connected with 3.0 Gbps, the new ones with 1.5Gbps. Why that is I don't understand. I tried to enforce 3.0Gbps without any effect. When I enforce 1.5Gbps, the WD drives are no longer visible to the controller, and since they hold the OS, the machine won't boot. So I've set the connection speed back to Auto again. The new Hitachi drives can do 6Gbps, so why they connect with just 1.5Gbps I don't know.

-the controller logs report "Cable CRC errors". Of course that's where LSI's support rep homed in on. So I opened the case, removed the cable from the controller and backplane (the connectors sat really snugly), cleaned them and re-attached them. This didn't change anything. Also, if there were serious issues with the cable, why would a RAID0 work but a RAID6 wouldn't? I have to note, however, that there are 4 so-called "phys" on the controller (a phy seems to be some sort of transmitter for the signals between controller and backplane). Apparently, each of them controls 6 bays, and I can see that in the cable are four sub-cables. So in principle it is possible that phy 0-1 deal with the old drives and phy 2-3 with the new ones. But that still doesn't explain why I can make a RAID0 with the new drives.

So that's where I'm stuck. Does anyone have ideas what I could try? The obvious thing, of course, is to order a new cable (and I will probably do that), but I have my doubts that it will change anything.

Share this post

Link to post

Share on other sites

Can you post the SMART data of each harddrive? In particular, bad sectors and cabling errors are interesting.

You have used OCE or Online Capacity Expansion. This procedure is not without risk in particular bad sectors are a high risk during the rebuild. You should always do a rebuild before expanding, to try to minimise the risk of making your array inaccessible due to an aborted expansion attempt.

Share this post

Link to post

Share on other sites

Can you post the SMART data of each harddrive? In particular, bad sectors and cabling errors are interesting.

I have included the smartctl output for two drives below - one is an older WD 2TB drive of the RAID6 which has been working well for years now, the other one from one of the 3TB drives I have recently added.

You have used OCE or Online Capacity Expansion. This procedure is not without risk in particular bad sectors are a high risk during the rebuild. You should always do a rebuild before expanding, to try to minimise the risk of making your array inaccessible due to an aborted expansion attempt.

I'm not sure I understand what you mean. There was an existing 18TB RAID6 consisting of 12 WD 2TB drives, and I haven't modified that system. I have simply filled 12 empty bays with new 3TB drives to create a new, separate RAID6 with ~28TB capacity, and this doesn't work. From the controller's point of view there are two separate units which don't have anything to do with one another.

Any comments are much appreciated.

Regards,

Enno

Here is the smartctl output of one of the new 3TB Hitachi drives. The number of errors varies from drive to drive, but is of order 50-100:

Share this post

Link to post

Share on other sites

it appears that I've found the culprit: it's the backplane (SAS846EL1). It's only specified for SATA-2, so attaching SATA-3 drives to it is not expected to work flawlessly (which the manufacturer confirms).

So I'll have to work my way around that. Anyway, many thanks for your replies, that is much appreciated.