Re: RaidFrame errors

On Sun, 2 Jun 2013 17:14:17 +0100
William Ross <williamrossmbsw%gmail.com@localhost> wrote:
> I have a NetBSD box running 6.0.1 i386. It has four 3TB HDs with two
> raidframe raid arrays configured.
> The first raid array is a raid0 for / (currently over wd0 and wd3,
raid0 is a RAID 1 set, yes?
> using 5GB on each disk), the second a raid5 for a data partition
> (over wd0, wd1, wd2, wd3 using all remaing space (reporting 8TB in
> total)).
>
> A week ago the system became unresponsive with many errors like the
> following in /var/log/messages:
>
> Jun 2 14:50:31 ex-fl-sr-03 /netbsd: wd1d: error reading fsbn
> 5804545856 of 5804545856-5804545983 (wd1 bn 5804545856; cn 5758478 tn
> 0 sn 32), retrying Jun 2 14:50:31 ex-fl-sr-03 /netbsd: wd1:
> (uncorrectable data error) Jun 2 14:50:31 ex-fl-sr-03 /netbsd:
> ahcisata0 port 1: device present, speed: 3.0Gb/s
>
> At that point the / raid1 was running on wd0 and wd1 and had the
By this you mean a RAID 1 set for / ?
> component running on wd1 listed as failed. I added a preprepared
> partition on wd3 to that mirror and rebuilt it. At present both the
> part on wd0 and wd3 are reporting as optimal.
>
> The odd part was that raidframe had listed the part of the raid5 data
> partition on wd0 as failed (the errors in /var/log/messages only ever
> referred to wd1) and the part on wd1 as optimal.
If I had to guess, I'd bet that at some point a component label didn't
get written to wd0 for the RAID 5 set, and that caused it to be marked
as 'failed' on a subsequent boot.
> I reseated the drives, rebooted the system and all the drives seemed
> OK. As there were no errors reported for wd0, and raidframe seemed
> happy with the part of the raid5 on wd1 I set the array rebuilding on
> wd0.
Well.. at this point you've had errors with wd1, and (technically) a
potential 2-drive failure on the RAID 5 :(
> Today (5 days later - this are 3TB drives) the rebuild failed at 99%.
> Again there are errors in /var/log/messages about wd1 (see above).
> Again the raid5 has failed on the section on wd0 (although in this
> case it never completed rebuilding). The rebuild failed 17 seconds
> after these errors started being printed to the log:
>
> Jun 2 14:50:48 ex-fl-sr-03 /netbsd: raid1: Recon read failed: 5
> Jun 2 14:50:48 ex-fl-sr-03 /netbsd: raid1: reconstruction failed.
> Jun 2 14:50:48 ex-fl-sr-03 /netbsd: ahcisata0 port 1: device present,
> speed: 3.0Gb/s
This shouldn't be wd0 -- the above should be for wd1. It's having
problems reading wd1, and thus can't fix wd0 (which remains failed,
because it couldn't get it back to 100% rebuilt).
> My reading of the situation is that raidframe in incorrectly failing
> the part of the raid5 on wd0 due to read errors on wd1.
No.... wd0 had failed before, and hadn't been in use. For how long, I
have no idea, but my guess would be for at least a few reboots. Read
errors on wd1 are causing issues with getting wd0 to rebuild.
> As there are
> read errors on the part of the raid array on wd1 (with no redundancy
> as one member of raid has been failed) I need to get as much of the
> data off the raid as possible and rebuild from scratch, probably
> after replacing wd1 as a failed drive.
>
> Do you agree?
Yes.
> Any idea why raidframe seems to be failing the wrong member of the
> raid5 thus invalidating the whole thing?
See above. (It's leaving wd0 as 'failed' because it can't rebuild it,
and it was marked as 'failed' (for whatever reason) before the rebuild
started. It's not marking wd1 as 'failed' because that would leave you
with a RAID set that is completely useless. At least right now you can
attempt to get some data from it...)
> Thanks in advance,
Good luck!
Later...
Greg Oster