Road map for md/raid driver - sort of

29 January 2009, 23:46 UTC

In mid-December 2008 I wrote a bit of a "road-map" containing some of my thoughts about development work that could usefully be on on the MD/RAID driver in the Linux kernel. Some of it might get done. Some of it might not. It is not a promise at all, more of a discussion starter in case people want to encourage features or suggest different features.

But I really should put this stuff in my blog so, 6 weeks later, here it is.

Bad block list

The idea here is to maintain and store on each device a list of
blocks that are known to be 'bad'. This effectively allows us to
fail a single block rather than a whole device when we get a media
write error. Of course if updating the bad-block-list gives an
error we then have to fail the device.

We would also record a bad block if we get a read error on a degraded
array. This would e.g. allow recovery for a degraded raid1 where the
sole remaining device has a bad block.

An array could have multiple errors on different devices and just
those stripes would be considered to be "degraded". As long a no
single stripe had too many bad blocks, the data would still be safe.
Naturally as soon as you get one bad block, the array becomes
susceptible to data loss on a single device failure, so it wouldn't
be advisable to run with non-empty badblock lists for an extended
length of time, However it might provide breathing space until
drive replacement can be achieved.

hot-device-replace

This is probably the most asked for feature of late. It would allow
a device to be 'recovered' while the original was still in service.
So instead of failing out a device and adding a spare, you can add
the spare, build the data onto it, then fail out the device.

This meshes well with the bad block list. When we find a bad block,
we start a hot-replace onto a spare (if one exists). If sleeping
bad blocks are discovered during the hot-replace process, we don't
lose the data unless we find two bad blocks in the same stripe.
And then we just lose data in that stripe.

Recording in the metadata that a hot-replace was happening might be
a little tricky, so it could be that if you reboot in the middle,
you would have to restart from the beginning. Similarly there would
be no 'intent' bitmap involved for this resync.

Each personality would have to implement much of this independently,
effectively providing a mini raid1 implementation. It would be very
minimal without e.g. read balancing or write-behind etc.

There would be no point implementing this in raid1. Just
raid456 and raid10.
It could conceivably make sense for raid0 and linear, but that is
very unlikely to be implemented.

split-mirror

This is really a function of mdadm rather than md. It is already
quite possible to break a mirror into two separate single-device
arrays. However it is a sufficiently common operation that it is
probably making it very easy to do with mdadm.
I'm thinking something like

mdadm --create /dev/md/new --split /dev/md/old

will create a new raid1 by taking one device off /dev/md/old (which
must be a raid1) and making an array with exactly the right metadata
and size.

raid5->raid6 conversion.

This is also a fairly commonly asked for feature.
The first step would be to define a raid6 layout where the Q block
was not rotated around the devices but was always on the last
device. Then we could change a raid5 to a singly-degraded raid6
without moving any data.

The next step would be to implement in-place restriping.
This involves

freezing a section of the array (all IO blocks)

copying the data out to a safe backup

copying it back in with the new layout

updating the metadata to indicate that the restripe has
progressed.

repeat.

This would probably be quite slow but it would achieve the desired
result.

Once we have in-place restriping we could change chunksize as
well.

raid5 reduce number of devices.

We can currently restripe a raid5 (or 6) over a larger number of
devices but not over a smaller number of devices. That means you
cannot undo an increase that you didn't want.

It might be nice to allow this to happen at the same time as
increasing --size (if the devices are big enough) to allow the
array to be restriped without changing the available space.

cluster raid1

Allow a raid1 to be assembled on multiple hosts that share some
drives, so a cluster filesystem (e.g. ocfs2) can be run over it.
It requires co-ordination to handle failure events and
resync/recovery. Most of this would probably be done in userspace.

Support for 'discard' commands

Flash-based devices and home high-end storage devices that provide "thin provisioning" like to know
when parts of the device are not needed any more so they can optimise their behaviour.
File systems are starting to add the functionality of sending this information to the block device
using a "discard" command. It might be useful for md to make use of this.

md would need to keep a data structure (bitmap?) listing sections of the array that have been
discarded. Initially this might be the whole array. When there is a write to a discarded section,
it would need to be resynced and then the write allowed to complete. A read from a discarded
section is probably an error - maybe just return nuls.
When md decides it can discard part of the array, it tells the component devices that they can discard some data too.
When the filesystem tells md it can discard part of the array, we might have a problem.

If the discard request is smaller than the granularity that md is using, then we would need to ignore it.
So this would only really work if there was some guaranteed lower lower bound on granularity that was
small enough for md to work with, or the filesystem would need to do aggregating and always send the largest
'discard' that it can. I'm not sure if they will do that, or to what extent.

If md was to maintain a granularity of sectors then a 16TB array (not at all unrealistic these days) would require
32billion bits to map what is in use. That is 4 billion bytes or 4Gigabytes - maybe a bit excessive.
If we could rely on the filesystem arounding up to at least 1 megabyte (aligned) then only 4Meg of bitmap would be
needed which is a little more realistic.

Comment (13 April 2009, 15:10 UTC)
Todays drives already do automagically remap bad blocks, so I do not see any benefit to implement this in SoftRaid.

What I'd love to see: A background task which in periods of low disk usage automatically checks and evt. repairs inconsitencys of raid-1 and in case of irreparable damage starts a configurable task to alert the sysadmin.

I don't think I'm proposing the remapping of bad blocks. Just recording their location. If you get a read error on
a degraded array, you need to either record the location of that bad block, or fail the whole array. Some people
prefer the first option.

The background task you suggest is already quite possible. If you

echo repair > /sys/block/mdX/md/sync_action

it will read the whole array and attempt to fix any read errors. By default it will limit itself to 1 Megabyte
per second if there is other activity on the devices. If you think this in not "background" enough, you can
set a lower minimum with e.g.

echo 100 > /sys/block/mdX/md/sync_speed_min

Irrepairable damage will currently kick the bad drive from the array as there is no other real option. When we get
support for storing a bad-block list, irrepairable damage will be recorded in the bad block list. It seems likely
that we will then get "mdadm --monitor" to report that in whatever way it has been configured to report thing.

I like the bad block list that you suggested. What I am just curious is why everybody relies on the bad block relocation feature of modern drives and is reluctant to implement a real bad block relocation feature in mdadm or LVM2 (which from my point of view would have many advantages)?

I just had the experience that in a RAID5 array with 4 old 250 GB drives one had a bad block and was actually running out of spare blocks to reallocate that bad block. So long so good, since mdadm could not do a software bad block relocation I changed the drive with a newly bought one. When changing the drive I must somehow have touched one of the other hitherto functioning drives a little bit to roughly so when I had put in the new drive and tried to start reassembling the degraded array, that failed, too, since another drive also then showed a bad block and failed as well. Afterwards getting my data back was a hack of an effort with two drives failing just because of two stupid bad blocks.

I am currently considering reverting to evms with debian etch since this has a bad block relocation feature. But honestly I would be happy to use mdadm with a real bad block relocation feature since evms is not maintained anymore.

Re: Road map for md/raid driver - sort of (17 May 2009, 19:49 UTC)
I would love to see the sourcecode for mdadm put up on Googlecode or any other similar code sharing site. I love the work you've done so far, and am grateful for my Raid6 array I've been able to build with it. I just think it would be neat if, in my spare time, I could start playing with the source code trying to expand it myself (however unsuccessful I will be at it). Looking around your site here, it doesn't look like the source code for the development version(s?) your working on are readily available, and would just like to humbly suggest that perhaps other people could pickup mdadm development as well if the source was available via git or something similar.

Would it be possible to force a bad block relocation on the drive itself.... By this I mean if during a sync / read we see a bad block then if we can make sure the HDD recognises this LBA as bad then it will re-map... possibly meaning that MD would simply only need to write that data again created by parity etc and the drive will store elsewhere appropriately.

I seem to recall entering bad blocks on Seagates many years ago via some tool, so although it may be proprietary I am guessing it could be possible over scsi.... just an idea

PS Just for the sake of saying it whilst I am here, it has been good to watch MD / MDadm morph from a what used to be considered a free and okay software RAID to a mainstream raid solution that out benefits many of the mid-level and above hardware raids which I test. I am sure you hear this all the time, but you and all those who have contributed deserve a big thank you from a lot of us users!

sorry for so many identicle comments... the summision scripts error when i used a [signle_quote] sign, took several attempts to work out what the error was and of course presumed that it was not posting on error!!!!!

Comment (02 December 2009, 16:40 UTC)
A Question about the bitmap implementation:

Situation: Bitmap feature is enabled for a mirroring array with three members.

Is there one global bitmap marking areas of the array yet to be resynced or one bitmap per member?

I.e. if one member is on a internal (laptop) hdd and the other two are on separate external SATA drives, will the external drives resync only what is needed when they are attatched one at a time?

Or will, if one of the external drives isn't resynced that often, the other drive allways resync more that needed when it is attatched? (also everything that the second mirror is yet missing)

I am using a HotplugRaid and found out there might be some little bugs in the md driver that prevents true hotpluging of (complete not degraded) raids between machines: https://wiki.ubuntu.com/ReliableRaid

The bitmap is per-array, not per-device. When you re-add a device, it will resync everything that has changed since the array was last completely in-sync and not degraded.

So you would need to plug both backup drives in at the same time.

If you create e.g. md0 from 'internal' and 'external-1', and then create md1 from 'md0' and 'external-2', and make the filesystem on 'md1', then re-adding either external device to the appropriate array will cause a resync of only the things that have changed since the last resync with the device. However that will require twice as many bitmap updates so writes can be expected to be slower (though making the bitmap chunk size larger might reduce this pain somewhat).

Comment (04 December 2009, 21:07 UTC)
Thank you very much for the answer. I have tried removing the third member (2nd external) from the array now, and the now lonely external second disk syncs beautifuly, only the small unclean part of the array. Perfomance is ablolutely satisfying. The array holds the /home partition and I don't notice any lag with write behind. What a cool idea with the stacking of overlapping bitmaps. Looks like you produced the perfect solution I was looking for!

Hotpluging works nicely now, with some fixes to the distribution (only with arrays created on the particular system though). A good upstream improvement might only be an option not to restrict hotpluging only to matches with hostname/UUIDs listed in mdadm.conf, and this one, if it is valid:
A "mdadm --incremental --run /dev/mdX" command to start only *selected* arrays (i.e. the rootfs after a timeout) degraded in "auto-read" mode.
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/251646

Comment (16 March 2010, 12:12 UTC)
I'm *very* curious about the cluster raid bulletpoint. We're running into exactly this: two storage boxes, and loads of nodes that need to read/write files from a central location. OCFS2 would be good for the multinode part, but we haven't got a solution yet to mirror this over both storage boxes. Cluster raid functionality would remove a big single point of failure.

Just a quick question if I may on the future of raid10 (of the "non-nested" variety, .i.e. set up as one device rather than a nested raid of first a RAID1 and then a RAID0). I currently set this up on my main server, and I have noticed that this setup is gaining popularity fast in the Linux community.

Currently though, my understanding is that mdadm doesn't support groving a raid10 array. I presume this is tricky, as it would require adding at least two disks to the set, but I was wondering if this is something that you have on your TODO?

I currently get around it by using LVM on top of my now two RAID10 sets, but it would be a very welcome feature, particularly in light of the growing popularity of RAID10.

No, growing a RAID10 is not something that is currently on my TODO list.

There are 2 reasons for this.

Firstly, it is a daunting task. Due to the multiple layout options (near, far, offset) and
varying numbers of copies of data, there are lots of different growth options. I don't know where to start :-)

Secondly, RAID10 doesn't have anything like the stripe-cache that RAID5 has, This stripe cache is central to the reshape strategy in RAID5. We load data into the cache, then write it out from the cache, and use the cache to make sure everything is synchronised. With RAID10 I would have to build the necessary infrastructure to provide the necessary synchronisation. It all sounds too much like hard work.

But maybe it just "sounds" like hard work, and if I actually tried it would be easier than I expect. Let's see:

There are 3 dimensions for growth: making individual devices larger, adding new devices, and changing the layout.

For 'near' and 'offset' layouts, changing the size of the devices should be fairly straight forward. I could probably implement that tomorrow (if I wasn't already busy). For 'far', changing the device size would require
moving the 2nd (And subsequent) copies further up the device. Probably do-able but mess.

When adding new devices I need to read-a-stripe, write-a-stripe, and always record where I am up to.
It would probably be easiest to suspend other IO while a stripe was being relocated, or at least all
IO in a window around the relocation.

Changing the number of near or offset copies would be similar to changing the number of devices. You could
probably change from 2 near copies to 2 offset copies quite easily. Changing the number of far copies ... makes my head spin. There are some cases when it would be do-able, but probably it isn't worth it.

So maybe it isn't has horrible as I imagined. I might git it a go one day, but it won't be soon. Of course if someone else would like to try their hand at it, I would be very happy to help....

I've been idly curious for some time about the notion of supporting write-back in the `md' driver. Safe write-back requires somewhere fast and persistent across power loss/OS reset to put the cache data ... but consumer battery-backed SATA "RAM Drives" offer such a place at an astonishingly reasonable price.

While I'm a mediocre-at-best C coder and am not overly familiar with the kernel, I'm wondering if this is an insane thing for me to try to actually prototype. If you haven't already considered and discarded the notion I'd be interested in your thoughts. Unworkable? Viable but really, really complicated to implement? Or vaguely sensible?

Write caching alone would be wonderful, but it'd also be a stepping stone toward coalescing writes into fewer bigger operations and finally getting rid of any need for hardware RAID controllers...

The easiest way to make use of some sort of fast-but-small storage (such as RAM drives) is to configure your filesystem with an external journal on the RAM drive and turn on data journalling.

That way everything gets written to the RAM drive first, which has very low latency, and then gets written to store large/slow storage (e.g. RAID6) at a more sedate pace but with no-one waiting for it (unless memory gets tight etc).

I have used this to get good performance on an NFS server (which needs very low latency in it's storage subsystem) and it worked quite well. There was at the time (8 years ago?) room for improvement in ext3's handling of data journalling but I managed to tune it to do quite well. The issues I hit might be fixed now.

This addresses most of what you want from a RAM drive but not quite everything. The one thing it doesn't do is close the 'write hole'. If you crash and come up with a degraded array (or the array goes degraded before a resync finishes) you can get silent corruption. This is virtually never a problem in real life, but it would be nice to fix it and the only credible fix is to journal to a v.fast device. And if you are going to journal to a v.fast device, you get all the other low-latency benefits for free. (This of course assumes that your low-latency device has very good ECC codes so the chance of data loss in there is virtually zero).

Add this to md/raid5 would certainly not be trivial but it would be quite possible. If I had a RAM drive to experiment with I might even try it....

Every incoming write would be queued for writing to the journal with a header describing the block. You would also need to get the data into the strip cache and doing that efficiently would probably be the tricky bit. It might even be easiest to write out to the RAM drive, then read back into the strip cache in order to calculate parity. After you have a strip ready to write and have calculated the new parity, you need to log that to the journal before writing anything out to the array.

I would probably want to make the strip cache a lot larger, but not allocate pages permanently to every entry. Rather the stripe cache is used to record where all that data that is not safe on storage is, and as memory is available we connect the memory to a strip, read in the required data from the log or from the array, calculate the parity, write it to the journal, then write out the strip to the array.

So there are three main parts to this:

1/ allow entries in the stripe cache to not have any pages attached. When a stripe is ready for pages it waits for them to be available, attaches them, and uses them. So we have a v.large pool of 'struct stripe_head' which
can even grow dynamically, and a more restricted pool of buffers.

2/ A journal management module that queues of blocks, creates a header, writes out blocks and header to the RAM drive, and keep track of when data is safely in the array so that it can be dropped from the journal.

3/ Tied it all together with appropriate extensions to the metadata and mdadm so that the journal is found and attached properly.

I think I would only do thing for RAID5/6. There is no 'write hole' problem for RAID1 or RAID10 so the fs-journal approach should be completely sufficient.

I just read something that makes a lot of what I read about Hardware and Software RAID seem to be mostly conjecture... so I was hoping that I could ask your opinion.

Adaptec notes on tehir website:

'... RAID controllers are designed to timeout a given command if the disk drive becomes unresponsive within a given time frame. The result will be that the drive will appear off line or will be marked bad and an alert will be given to the customer. Enterprise class drives (or drives designed for RAID environments), have a retry limit before a sector is marked bad. This retry limit enables the drive to respond to the RAID controller within the expected time frame. While desktop drives may work with a RAID controller, the array will progressively go off line as the disk drive ages and may result in data loss.'

This is certainly my experience running both RAID on large (500GB+ SATA desktop disks).

Could the vendors advising this really sell that many enterprise class disks at 6x the price just by knobbling the 'time-out' period in the controller firmwares?

I read that Google did a whitepaper (can't find it yet), showing there is no difference in drive reliability between these two 'classes' of drive. I doubt Google use hardware raid controllers in their beige box fleet of course (maybe mdadm?), but surely will know all about it.

Really my dream is to use mdadm to build a RAID 5 or 6 (though RAID 1+0 would be nicer ;) ZFS array with a hot-spare, using 5 or more desktop SATA 2TB disks. It has failed in the past with similar symptoms to the ones Adaptec describes.

Does one have a 'practical size limit' when using mdadm and cheap disks? I remember rading that <320GB disks helps to avoid I/O errors during array rebulds.

discard support (26 April 2011, 22:09 UTC)
When I saw you considered implementing discard support, I thought "Hurray, that's just what is missing!" - and then I saw you did not mean the passing through of block discards from the filesystem to the device, but some sort of "MD managing its own discarded region record" for thin provisioning.

While I don't like the very idea of "thin provisioning" too much, I'm sure a lot of people (including me) would love to see MD just pass along the discard requests, which to my knowledge it doesn't do, yet.

We run a number of filesystems on SSDs and would like to use MD RAID10 on top of them - which is currently not possible because we would lose the very import feature of being able to keep the SSDs performance by periodically wiping the unused sectors.

Passing along the discards should be much easier to implement than what you suggested, "DM" already does it. Please consider this, too :-)