Meta

[solved] Linux software RAID-1 fallover problem.

Note: this is a partial post. I may fill in more information at a later time. However, the solution is stated at the bottom and may be helpful already.

The Problem history

A few days ago I set up a rsnapshot backup solution and when i ran it for the first time, my linux software raid-1 (2x 3TB – 6GB/s SATA) fell over. It wasn’t tragic, since after a reboot it simply re-synced. However, this happened reliably every time I did a backup and then the re-syncs started to fail as well.

Turns out my motherboard (Asus P7H57D-V EVO) is using a Marvell 88SE6111 chip for the two SATA 6GB/s SATA connectors it provides and which I am using for the two disks in question. All my other disks are attached to an intel chip that only offers SATA 3GB/s connections. This Marvell chip seems to have poor support under linux. However, the controller is set in the BIOS to work via AHCI and the disks do work under normal conditions without a problem.

The current kernel is stated below:

Linux 3.10.7-1-ARCH x86_64 GNU/Linux

I checked the disks (both are new) using smarctctl and found no issues. When the raid was synced, everything worked fine and the smartctl command showed good new disks. Once the backup had toppled the raid or the re-sync failed, smartctl failed with the following errors shown in dmesg:

[ 7483.951154] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[ 7483.951179] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Checking dmesg for a hint on why the disks/raid failed, I found the following errors (the relevant hard drives are /dev/sdf and /dev/sde. (… marks sections where I cut repetitions):

The way I read this was as follows. The physical symptoms are that the raid stops to respond for about 30 seconds, and then works again. These freeze problems then lead to a timeout on the rsync backup or the re-sync action. The cause for the freeze may ultimately be the Marvell controller (since my software raid-5 with 3 disks on the Intel chip has no such problems), but it may well be due to an overload during backup / sync_action.

My solution to the problem

Since I could not find a better driver for the Marvell chip, and since hours of research on the web did not yield a better solution, I decided to prevent the cause instead of curing it. It seems that the software raid only falls over when it is under immense load, such as during a re-sync or when the rsnapshot tool is continuously spitting data at the raid. By setting bandwidth limits for the mdadm sync_action and the rsync tool underlying the rsnapshot software, I was able to prevent the problem from occurring again.

Setting a bandwidth limit on the mdadm re-sync action as described here:

sysctl -w dev.raid.speed_limit_max=value

I chose the value of 124000, which is slightly lower than the speed stated by mdstat during successful syncing i.e. before the re-sync action overloaded the raid. To get that value, type the following while the raid is happily re-syncing:

$ cat /proc/mdstat

Check the current limits for the resync action:

# sysctl dev.raid.speed_limit_min
# sysctl dev.raid.speed_limit_max

To permanently override the default, I added this line to /etc/sysctl.conf:

To prevent rsnapshot from overloading the raid during a backup, I set the bwlimit parameter in the rsnapshot config file at /etc/rsnapshot.conf. As you can see, I also set a 3min timout to give the raid some time to unfreeze, should that happen again:

4 Comments

Hi.
Had problems with my new 5 disk raid 6, when moving from an old raid 1. 3 disk in the raid 6 and 1 disk in raid 1 was on the same marvell 88se9230. When rsyncing from raid 1 to raid 6 it kernel paniced every time.
Thought it was disk problems until I saw you fix.
Thanks 😉

Just wanted to let you know that this page was a big help for me! I had just set up a raid 1 and copying data from an old disk to it proved to be pretty problematic. My old max value was 200000, and during re-sync it was around 195000, so I only backed my max value off to 190000. Now everything is running without a hitch!