On Fri, Jun 19, 2009 at 3:33 AM, Heinz Mauelshagen<heinzm redhat com> wrote:
> On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote:
>> - You have your own 'xor' code against which you do a run-time test of
>> the 'xor_block' code which md/raid5 uses - then choose the fasted.
>> This really should not be necessary. If you have xor code that runs
>> faster than anything in xor_block, it really would be best to submit
>> it for inclusion in the common xor code base.
>
> This is in because it actually shows better performance regularly by
> utilizing cache lines etc. more efficiently (tested on Intel, AMD and
> Sparc).
>
> If xor_block would always have been performed best, I'd dropped that
> optimization already.
Data please. Especially given the claim that it improves upon the
existing per-architecture hand-tuned assembly routines. If it is
faster great! Please send patches to improve the existing code.
>>
>> - You have introduced "dm-message" which is a frame work for sending
>> messages to dm targets.
>
> dm-message is a parser for messages sent to a device-mapper targets
> message interface. It aims at allowing common message parsing functions
> to be shared between targets.
>
> The message interface in general allows for runtime state changes of
> targets like those you found below.
>
> See "dmsetup message ..."
>
>> It is used for adjusting the cache size,
>> changing the bandwidth used by resync, and doing things with
>> statistics. Linux already has a framework for sending messages to
>> kernel objects. It is called "sysfs". It would be much nicer if
>> you could use that.
>
> Looks to me like heavier to program (I counted ~100 LOC of sysfs defines
> and calls in md.c) for 3 files used in md.
>
> That needs some more discussion.
Ironically the same argument can be applied to this module. ~5000 LOC
to re-implement raid5 seems much "heavier to program" than to write
some dmraid ABI compatibility wrappers around the existing md-raid5
code.
[..]
>> The problem is that when a drive fails you allow writes to continue
>> without waiting for the metadata to be updated. I.e. metadata
>> updates are asynchronous.
>> The problem scenario for raid5 is:
>> - a drive fails while a write is in process
>> - the drive is marked as faulty and user space is notified
>> - that write, and possibly several others, complete and that
>> status is potentially returned to user-space (e.g. for an fsync)
>> - user space acts on that status (e.g. sends an acknowledgement over
>> the network)
>> - the host crashes before the metadata update completes
>> - on reboot, the drive, which just had a transient failure, works
>> fine, and the metadata reports that all the drives are good, but
>> that the array was active so a parity-resync is needed.
>> - All parity blocks are recalculated. However that write we
>> mentioned above is only recorded in the parity block. So that
>> data is lost.
>>
>
> Yes, like in the dm raid1 scenario, blocking of writes would have to
> occur until metadata has been updated in order to detect the failing
> drive persistently, hence not updating parity (which contains data
> mandatory to preserve) but recalculating chunks on the transiently
> failing drive utilizing intact parity.
Another problem with dm along these lines is the mechanism for
notifying userspace, I am assuming this is still via uevents? The
problem is that each notification requires a memory allocation which
might fail under load, especially if the blocked array is in the dirty
page write out path. The userspace notifications and failed device
handling in md are designed to not require memory allocations.
[..]
>> Also it does not take account of IO to other partitions on a device.
>> It is common to have a number of different arrays on the same device
>> set. You really want to resync to back off based on IO to all of
>> the device, not just to the array.
>
> Does MD RAID > 0 cope with that in multi-partition RAID configs ?
> E.g.:
> disk1:part1+disk2:part1 -> md0 and disk1:part2+disk2:part2 -> md1 with
> global iostats for disk[12] ?
By default it will hold off resync on other arrays when it notices
that multiple arrays share disks. A sysfs knob allows parallel
resync.
[..]
>> As I said, I would rather this didn't exist.
>
> I think its better to offer options to be utilized for different
> purposes. Tradeoff is always (some) overlap.
>
> We don't have one filesystem for all use cases ;-)
That argument does not fly in this case. md is already suitable for
the dmraid use case (i.e kernel raid/userspace metadata).
>> But it could be
>> emulated (except for the statistics) if needed - md can already
>> adjust resync speed and cache size.
>>
>>
>> What is missing is agreement on how to handle synchronous metadata
>> updates. md waits for the drive failure to be acknowledged. It is
>> automatically marked as 'Blocked' on an error, and user-space must
>> clear that flag.
>
> Yes, Mikulas already put thought into that.
> Let's allow him /and others) to join in.
>
>
>>
>> I suspect, using your model, the approach would be a 'dm-message'
>> acknowledging that a given device is faulty. I would, of course,
>> rather this be done via sysfs.
>>
>> I will try to look at factoring out the md specific part of md/raid5
>> over the next few days (maybe next week) and see how that looks.
>
> That'll be interesting to see how it fits into the device-mapper
> framework
>
Yes, kernel ABI compatibility is the better [1] way to go. I will
also revive the ideas/patches I had along these lines.
Regards,
Dan
[1]: ...than dm2md in userspace