Background: I had an Ubuntu Bionic system set up and running using 3x1tb disks. The disks were all partitioned with one circa 15gb partition and the remaining circa 983gb.

15gb partitions from two disks formed md0, a raid1 array used for swap.
983gb partitions from all 3 disks formed md10, a raid10 far 2 array used for / and totaling around 1.4tb.

What happened: One hard drive failed. The raid10 array carried on regardless. md0 required me to add the remaining unused 15gb partition but the system then booted. No worries \o/

What happened next: A couple of hours after ordering a new drive the filesystem went read-only and then failed to reboot. Essentially a second disk failed though smart reported a clutch of crc errors rather than bad blocks.

It should be noted that I also had issues with a bad RAM stick prior to this, and issues with system stability with replacement RAM concurrent with this happening. (Now resolved.)

Where I am now: I imaged the two disks with ddrescue and have been examining and attempting to repair the filesystem with testdisk. (Currently re-imaging the disks for a fresh start). Essentially one disk appears fine, the other shows no bad blocks or unreadable sectors when ddrescue is copying but does appear to have filesystem issues.

What I think happened is that rather than a second disk hardware fail the bad RAM caused filesystem errors which have made that disk unreadable.

Two things visible here, firstly sdd partitions have reverted back to straight ext4 linux(83) not linux raid (fd) and secondly sdd0 seems to have gained 4096 sectors from sdd1 (unless I created the partitions that way...but I doubt that).

I haven't been able to get Testdisk to correct this - doesn't seem the version on partedmagic supports linux raid partitions and It's suggestions to use fsck result in bad magic number in superblock errors even when using alternate superblocks.

Here are the results of mdadm --examine from images mounted as loop devices, again good sdf first, bad sdd second:

Again notable that sdd1(aka loop2) has issues - no 'Used Dev Size' listed.
I've tried recreating the array using the images, and whilst it seems to work the array is unmountable (bad magic superblock again).

Questions
Does it look like I'm right in thinking its a corrupted partition map on sdd that is the root of the problem?
Is it possible to fix that, and if so, with what? fdisk?

Desired outcome make the array mountable so I can dump as much as possible to a different disk. I have a backup of /etc and /home (in theory - haven't tried to restore yet), but it would be helpful and give peace of mind if I could resurrect this array temporarily. A brief run of photorec suggests a hell of a lot of files are recoverable too but sorting through nearly a terrabyte haystack of files without directory structure or filename...

[SOLVED]
I put in place fresh copies of the disk images I'd made so none of my previous fiddling could mess things up. In fact one was a partition image and one a whole disk image so mounting them:

The important fact I missed in my research at the beginning of my own attempts is that you need to specify the devices for --assemble if you're reassembling the array on a new system - didn't even realise that was possible to begin with.

What do you mean "recreated the array"? You should just have to assemble it. If you actually started fresh and created a new array with those drives, that would be bad.
– psusiSep 12 '18 at 16:15

I'm not doing anything with the drives except imaging with ddrescue @psusi I think my initial attempts at assemble failed because I didn't realise I needed to specify the devices (the OS, and thus mdadm.conf was on the array).
– adrinuxSep 12 '18 at 17:32

1 Answer
1

It sounds like you've made copies and only worked on copies, that's good!

"Used Dev Size" missing from the examine output isn't a problem, I think. Rather, I think it means it's using the entire device. The other one shows a used size 4096 less than the device size, which is consistent with one partition being 4096 smaller. (When you create the array, mdadm used the smallest partition size for all devices, otherwise it wouldn't have been possible to build the array).

I doubt anything corrupted your partition table. It'd be pretty rare for a sector you're not writing to be corrupted yet still appear mostly valid. Nothing wrong with 83 as a partition type for mdraid, the other type is actually obsolete and shouldn't be used. Non-FS data (da, if I remember right) is also a good choice.

I think all you should need is mdadm --assemble --force /dev/md«WHATEVER» /dev/loop1 /dev/loop2. You should get a message about forcing in a not up-to-date device, then it should assemble the array (degraded). You can then try fsck.ext4 (or whichever) on /dev/md«WHATEVER» If that works, you can probably do all that from your system's initramfs to recover it, then just mdadm -a the new disk and let it rebuild.

Thinking back, the 3 disk array was created as 2 + 1 missing from the ubuntu installer, the OS installed and the third disk added later once booted - that would explain the discrepancy in size and different fs type designation. Good point about the size. That still leaves Testdisk's 'Bad relative sector' to be puzzled over.
– adrinuxSep 12 '18 at 17:19

@adrinux Your partition table is not lost, there is no reason to be using testdisk to attempt to rediscover it. (And honestly, no idea what that message means from testdisk ... but it could easily be confused by having ⅔ of a filesystem on the disk).
– derobertSep 12 '18 at 17:25

Thanks. My attempts at assemble failed, but might have done that after running create (notes inaccessible right now). I'll give it another shot once I have fresh disk images copied into place.
– adrinuxSep 12 '18 at 17:28