md RAID reconstruction hints & tips?

this may sound like quite a trivial question, but could someone please confirm whether I need to unmount the hard drives before performing RAID reconstruction using the mdadm tool? should i be booting into single user mode 1st? or should i boot up using a live cd, such as knoppix? which is the accepted or safest method? how would i accomplish this?

Here's the deal: I've been experimenting with Linux RAID (md) and wanted to improve the availability of my server. So when everyone's asleep, I can replace the faulty drive with a new one.

I have /dev/sda and /dev/sdb mirrored (RAID1). Both sda and sdb have a "/" [md0] and "swap" [md1] partition. Also, both sda & sdb's MBR have GRUB installed. Now, when sdb gets unplugged (after powering off of course!), the system still boots into Linux Reconnecting sdb and disconnecting sda reveals the same results - flawless booting!

All is good so far. Now, lets plug sda & sdb back and boot into Linux. Running "cat /proc/mdstat" and "mdadm --detail /dev/md0" reveals a degraded RAID array (with sdb flagged as faulty/foreign). So using mdadm again, we can perform a "hot insert" of sdb. After a few moments, we can confirm (through /proc/mdstat and mdadm) that rebuilding was completed. OK, onto rebooting the system. First reboot after reconstruction seems flawless. So we shutdown the system again, and unplug sda again.

Bad news this time around. Powering on the system, we immediately notice GRUB's failed attempt to load the linux kernel, citing crc errors. And I thought this RAID1 mirror was perfect - even after reconstruction. So what went wrong? Here's GRUB's error message FYI:

Booting 'Debian GNU/Linux, kernel 2.6.8-2-386'

[..snip..]

Uncompressing Linux...

crc error

-- System halted

Am I correct in suspecting that mounted drives can't be mirrored (especially the boot blocks)? Is it possible that mounted drives have "locked" files or open "handles" which cannot be mirrored? Or am I way off the mark? Let me know. Looking forward to hearing your responses!!

edit: Whilst installing Debian 3.1 "Sarge", should you wait for RAID to finish syncing during the partitioning phase (by going to console "Alt+F2" & running "cat /proc/mdstat"), or is it OK to go right ahead and format them and install base system files while syncing in the background?

after md0 was resync'd with sda & sdb plugged in together, i rebooted and started getting errors after loading the kernel such as missing files and segmentation faults.

so i unplugged sdb again, and guess what?? i booted fine! so does this prove that raid arrays cannot be reconstructed when one of the drives are already mounted? or is there something more sinister going on here?

edit: yes, i did wait until md0 completed resyncing before i hit reboot...

Just a quick question as I am thinking of doing this.... I take it that the right way of doing the raid is through linux as you have done create duplicate partitions on both drives and then raid them together using software. Its just I thought that if your moptherboard has onboard raid cant you install everything to the first harddrive and then let the onboard raid mirror to the other hard drive or does it not work like that.... Any advice you can give would be helpful.

In your situation it sounds like you have to raid partitions working... swap and / ... when you repaired the raid by syncing did you do both raids swap and / or just one of them.... justing thinking out load that if you did not do both then perhaps that caused your problem..... dont know much about this yet I am only trying it today.....

Just a quick question as I am thinking of doing this.... I take it that the right way of doing the raid is through linux as you have done create duplicate partitions on both drives and then raid them together using software. Its just I thought that if your moptherboard has onboard raid cant you install everything to the first harddrive and then let the onboard raid mirror to the other hard drive or does it not work like that.... Any advice you can give would be helpful.

In your situation it sounds like you have to raid partitions working... swap and / ... when you repaired the raid by syncing did you do both raids swap and / or just one of them.... justing thinking out load that if you did not do both then perhaps that caused your problem..... dont know much about this yet I am only trying it today.....

If you have any pointers please share....

Thanks

Click to expand...

hi mphayesuk! yes, ive set mine up using md software raid provided by linux. even though i do have a motherboard with onboard raid (aka fakeraid; see http://linuxmafia.com/faq/Hardware/sata.html), the Debian Sarge 2.6 stock kernel does not support dmraid natively (or at least not during installation). therefore, you will always see two hard drives despite being setup otherwise using the onboard raid's bios. FYI, i'm using a Silicon Image 3112 onboard raid controller.

during debian install, i have done exactly what you've described. /dev/sda1 and /dev/sdb1 combine to become md0. /dev/sda2 and /dev/sdb2 combine to become md1. in turn, md0 becomes "/" and md1 becomes "swap".

good question about md1 (swap). from memory, it seems as though md1 was automagically sync'd after plugging in both sda & sdb. maybe that's because its designated as swap. maybe its a bug. hopefully someone can shed some light on this. i say from memory because since then, ive reformatted the drives and only have a "/" - no swap.

this second install means that i now have only md0. unfortunately, these segmentation faults still occur after simulating a hardware fault (as explained in above posts) - i suspect sdb in md0 is corrupted. given this, i think it's safe to say swap hasn't caused the problem here.

What I did was leave the motherboard raid alone... no setup at all (ie no on board raid activated) And then started the install, I set it up as both our posts said duplicate ect.... but after the initial install the machine reboots as it should... I get a "kernal panic cant sync" message and thats when it dies.... At this point I am stuck I dont know what else to try..... any suggestions.

I think I have a solution for raid on suse anyway.... you buy a adaptec raid card which has suse as a supported operating system.. well suse 8 and 8.1 but I would hope that suse 10 has got the same drivers.... but for £30 its worth a go.

so was your onboard raid bios activated, or disabled? or did u mean that u did not setup the drive mapping for raid within your onboard raid bios?

ive never touched suse before, but i hear good things about it. suse is owned by novell right? there should be plenty of support on their website (or supporting sites). which kernel does suse 10 use? 2.6.??, or 2.4 series? you can try to setup raid1 in suse with only 1 sata drive!!! see what happens then! do you still get kernel sync errors?

i havent touched adaptec raid cards, but im assuming this cheaper version you are referring to may be using 3rd party chipsets (eg. silicon image), and not their in-house adaptec ASICs.

edit: btw, if suse still doesn't work, you can try debian and see if that works for you!! i know i know, but i don't mean to make one change sides ;-)

Tried on-board raid disabled and enabled and the both suse raid on and off, if you know what I mean all combinations have been tried and either get no boot and sync errors.... I think the kernal is 2.4.... I suppose a last resort might be to re-compile with 2.6.... but I am not that good on linux and dont think I could do it.

Not sure what will be on the new raid card but on their web site it did say suse is supported so fingers crossed.

Yeah got debian down before the weekend so I might give that a bash and see what I can do with it.

So far its not worked.... I have found the drivers for my onboard sata raid that will work for suse 9.... so I am assumeing that it should also work with 10... also with the new adaptec card I thought about trying the 32bit version seeing as though the drivers that came with the card are for suse 8 and 8.1 and are 32bit drivers, I know that it should not make a difference but you never know.

I have also got fedora core 4 64bit to try as well as Linux Red Hat 9.1 and just in the middle of burning Debian Sarge 3.1 the whole 15 discs, so I am sure by the end of it all I should be able to get one of them working by the end of the week.

If anyone has any more pointers on anything to do with raid let me know.... thanks for all you help so far.

Just a quick question am I right in thinking that if I get raid 1 working when I come to install Suse or Debian ect.... I should only see one hard drive in the partitioning table right..... and if I unplug the first hard drive the second one would just take over so all I need to do to test is shutdown the machine unplug hard drive 1 and power back up and the os should load right?????

I know thats the way windows would do but I am not sure about Linux... with it using grub/lilo ect....

ive finally figured out what was going wrong. 1st, let me clear up any misconceptions. linux software raid (provided by md) works well - it allows you to rebuild a raid1 array when they the drives are mounted. Yes, it is perfectly safe to rebuild/sync an array when drives are mounted.

the problems i experienced with crc errors, segfaults, missing files, etc. was caused by data corruption!! falko, you were on the right track with your link about the crc errors. although the data corruption was NOT caused by old/aged hardware, it was actually a "bug" in the implementation of SATA on my motherboard.

basically, if your motherboard is based on the nForce 2 chipset and it has an onboard Silicon Image 3112 (3114 too?) SATA controller (and im sure there are many people out here using this combination!), then you will experience data corruption on your SATA drives. the fix? update your motherboard BIOS to latest version and apply this BIOS setting (located in Integrated Peripherals):

EXT-P2P's Discard Time = 1ms

And thats it. No more data corruption. No more errors after resyncing/rebuilding the raid arrays! so md was not the cause of the problem after all! and it took me more than a week to figure this out. i wish the motherboard manual was more informative about this BIOS setting (the motherboard was released after the problem was resolved months ago!).

mphayesuk, you may see multiple partitions (eg. md0, md1, etc.) depending on how you set it up. also unless the kernel (and/or dmraid if it is available) has support for your raid controller, you will still see your hard drives individually (regardless of how you set it up in the RAID BIOS).

thanks again for everyone's help! i hope my description above will help others with similar problems in the future!

ryoken can you explain step by step what you did.... I am still having trouble with this. This is what I am assumening you did.

DID NOT - setup raid on the motherboard ie did not use ctrl+A to access the raid controller and configure raid1

YOU DID - use the partition options in linux install to mirror/copy the partitions one both drives.

After linux is installed you installed other software to make raid work.

Thanks

Click to expand...

Yes, I DID NOT setup raid on the motherboard.

Yes, I DID use the partition options in linux install to mirror/copy the partitions one both drives.

BUT after linux was installed, I DID NOT install other software to make raid work. software raid support (md) was already provided during debian installation, hence raid was working immediately after i finished installing. nothing to reconfigure/fiddle around with afterwards.

this howto pretty much sums up what i did to get md software raid working: