RAID Disappeared - need help to rebuild

I've woken up this morning and can't access any media on my server. Logged into OMV GUI, rebooted and the drives are all still there, but the RAID is missing (I believe it was a RAID 5 array). Have looked at a few threads on the forum, and tried to start a self-diagnosis, but I was hoping one of you would kindly offer me some guidance.

I found some responses to other issues that guided the user to force a rebuild, but because (for an unknown reason) the 6 drives seem to be split across 2 mds I wasn't really sure what my next step should be. All 6 drives were, as of last night, in the same single RAID array.

the command will be something like:mdadm --assemble /dev/mdX /dev/sd[abcdef] (change the X to 0,126 or 127 ... whatever you want)
if that fails try to force it:mdadm --assemble --force /dev/mdX /dev/sd[abcdef]

BUT:
You should find the root-cause for this behavior first - just check the logs for any error message!cat /var/log/messages | grep KEYWORD (KEYWORD is something like md, sda, sdb, sd..., raid, ...)cat /var/log/syslog | grep KEYWORD

Just for my personal understanding: the above /proc/mdstat output talking about two RAIDs with different members can be ignored?

I hope so ... since the state for the both /dev/sdf and /dev/sdc in the md126-array are both "spare", i hope the backup superblock is intact for the assembling with the remaining disks. Btw. the "/dev/mdX" numbering is more OS related, than array related ... md is working here more like hardware-controllers.

You can issue the command:mdadm --examine /dev/sd[abcdef]
to bring clearance of the stati of all drives - i hope there are no greater "event missmatches" on the array ...

And the troubles started at '/var/log/syslog.1:Nov 23 22:46:31' on sdb. Everything else is beyond my mdraid knowledge (since I hate it wholeheartedly ) but I would check at least SMART attribute 199 of sdb now and check also sdc and sdf (since also reported as missing today)

according to the output there are good news ... and bad news:
- good: the "Magic" is the same on all drives
- bad: the Event-counter is different (but it seems, they are close enough to reassemble)

Any conclusion about the root-cause? That would be highly necessary ...

May be, you can "copy" (read backup) the log-files for later searching ...cp -v /var/log/messages* /root/20171124-raid-issue (this will create the subdir "20171124-raid-issue" under the home of root)cp -v /var/log/syslog* /root/20171124-raid-issue

That means:
- sdb was disable due to massive errors on the device (read errors) ... and with that, your array went from "degraded" to "dead"
- 2nd line states, that there was one missing drive before that ... and with that, your redundancy was gone

Don't you have an email-notification?
Which drives (vendor/model) do you use in this setup?
Which powersupply do you use?

Sc0rp

EDIT/ps: you should also check the backlogs ...ls -la /var/log | grep syslog (shows the backlogs for syslog)ls -la /var/log | grep messages (shows the backlogs for messages)
do this both commands and remember the numbers, then do:zcat /var/log/syslog.X.gz | grep sdY (the X is a number between 1-7 - look at the lists from commands above, Y is a,b,c,d,e or f)zcat /var/messages.X.gz | grep sdY
for each drive one for one (first start with Y=a), since the drive-naming between mdstat and the provided logs looks weired ...

After that, you have to check all SMART stati of all drives, as @tkaiser mentioned already!

I can see in the logs that sdb encountered a number of read error not correctable errors at that time last night.

sdb: SMART attribute199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0

That drive has some pending sectors too, which have increased slightly in the last week. Was planning to replace the drive (clearly should've done it sooner), but that's not one of the ones that was kicked out, right?

sdc and sdf both have no pending or reallocated sectors, but ALL 6 drives are showing the same CRC Error Count as sdb 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0

Yes, I do have email notification ... I'd received emails about the pending sectors, but not to the effect of any failed disc.
I'm using Seagate Barracuda 3TB discs (which I recently learned were probably not the best ones to use)
Not sure about the powersupply, specifically - it is the standard one that came in my hp Microserver

I can't understand about the missing drive ... I'm sure all 6 were in the array, previously.

You can always try the reassemble with force. Chance is 50:50 ... md is save in this, but you have to expect data-loss since the fs-layer (xfs) is damged too ...

And as always: RAID is not backup ... i hope you'll have a working backup.
For the future you should keep in mind, that you hat to think about changing from RAID5 to ZFS-Z1 or move to SnapRAID/mergerfs ...

PSU: ATX standard is good, i was only afraid of the next PicoPSU-setup ...

HDD: Barracuda's are not problematic at all, my "old" 2TB-ones are working flawlessly 24/7 ... md-RAID5 @ OMV3 (of course with continous rsync-backup ... and UPS ... and email-noti ... and other scripts)

Yes, I do have email notification ... I'd received emails about the pending sectors, but not to the effect of any failed disc.

And that's because I only had the notifications for SMART events turned on . I have to go out for a few hours, but will follow up with the log copies when I get back in ... but is there any point? Is there still any hope of salvaging any data?

Uhm ... just issue anls -la /root to see what's going on in this dir, if there is a file called "20172417-raid-issue" (the line then starts with "-rwx"), delete it with:rm /root/20171124-raid-issue and try again ... may be you have to create the subdir first:mkdir /root/20171124-raid-issue

So, I finally got chance to save the logs as suggested (I had to manually create the subdirectory to do so). Then, force reassemble worked and the array is visible and mounted again. All files appear to be 'visible', but as you said I'm expecting some to have become corrupted and to be missing data.

For now, it's a big "phew" and thanks for your help so far!

To answer your question, no .. I foolishly don't have a backup, and that will be the next step once I've got this array as stable as I can for now, before I look at alternative filesystems like you said.

What would you suggest the sequence of steps should be to minimise risk of causing more problems? There are 3 drives in the array that are reporting SMART issues:

Reallocated Sectors

Pending Sectors

Offline Uncorrectable

CRC Error Rate

sdb

640

0

0

0

sdc

1136

32

32

0

sdf

16

1680

1680

0

Which one of the drives above should I swap out and replace first, second and third?

this looks bad - have any drive SATA-errors too? From plain data given, i would "sdf" change first, then "sdc", then "sdb" ... and may be use anonther vendor/type of harddrives ... btw. which drives do you use actual?