On Fri, 2013-09-27 at 00:22 +0100, Alasdair G Kergon wrote:
> On Thu, Sep 26, 2013 at 10:47:13AM -0700, Frank Mayhar wrote:
> > Launching it from ramdisk won't help, particularly, since it still goes
> > through the block layer. The other stuff won't help if a (potentially
> > unrelated) bug in the daemon happens to be being tickled at the same
> > time, or if some dependency happens to be broken and _that's_ what's
> > preventing the daemon from making progress.
>
> Then put more effort into debugging your daemon so it doesn't have
> bugs that make it die? Implement the timeout in a robust independent
> daemon if it's other code there that's unreliable?
I'm not sure how to respond to this. Some fifty years of people
programming computers appears to show unequivocally that you can't rely
on code not having bugs no matter _how_ much effort you put into it.
It's just the nature of the beast.
> > And as far as lvm2 and multipath-tools, yeah, they cope okay in the kind
> > of environments most people have, but that's not the kind of environment
> > (or scale) we have to deal with.
> In what way are your requirements so different that a locked-into-memory
> monitoring daemon cannot implement this timeout?
If we could _have_ an independent, locked-into-memory monitoring daemon
just for this purpose, we might be able to get by. It would still be
iffy, for the reason I mention above; at our scale anything that _can_
fail _will_ fail, at least occasionally and often many times per day.
Unfortunately, that's a non-starter for a number of reasons, including
but not limited to the fact that the environment the daemon is running
in is memory-constrained. Add to that the fact that the daemon we
actually have depends on stuff that my team has no direct control over
and we end up really needing an in-kernel way to deal with this.
--
Frank Mayhar
310-460-4042