On Tue, 2006-05-02 at 07:12 +0200, Andi Kleen wrote:
> On Monday 01 May 2006 21:56, Vivek Goyal wrote:
> > On Fri, Apr 28, 2006 at 06:19:24PM -0400, Don Zickus wrote:
> > > When kexec goes to issue an nmi it uses set_nmi_callback() to have the
> > > other cpus execute the proper shutdown code. Unfortunately, under certain
> > > situations set_nmi_callback will fail (ie oprofile has it reserved
> > > already). This will cause kexec/kdump to hang and do nothing. :(
> > >
> >
> > Looking at the set_nmi_callback(), there does not seem to be anything
> > which will make it fail. I think enabling profiling support will only
> > disable any regular NMI generation from LAPIC for watchdog purposes because
> > performance registers being used for NMI generation are claimed back.
> >
> > So even if profiling is enabled, kexec/kdump should not fail.
>> profiling just registers a lower priority callback. Also with Don's
> changes profiling will only trigger when there are profile events
> anyways - so all the interactions will be much cleaner.
>> >
> > > After talking to Andi, he mentioned that subsystems should be using the
> > > notifier callback on the die chain instead. The included patch
> > > incorporates that. The priority is set to 0, hopefully causing the
> > > notifier to be the first one called.
> > >
> >
> > Ok if the goal is to force the subsystems to rely on die notifier chain
> > instead of nmi_callback and getting rid of set_nmi_callback() interfaces,
> > then it spells some problems for kdump, as kdump is different for other
> > subsystems. You rightly pointed out that what if chain is corrupted
> > or if some die notifier funciton hangs.
>> All NMI handlers think they are different and more special than everybody
> else. Otherwise they wouldn't be NMI. kdump is really in no way special.
If what we want is a reliable crash dumping solution kdump should be
treated as a special case (see discussion below).
> > Looks like that notifiers are called in increasing priority order. Looking
> > at the code, it looks like notifier with priority 0x7fffffff will be called
> > first. But still there is no gurantee. People registering first with
> > this priority will be called first. Kdump registers in then end hence
> > will be called last, so liable to fail.
>> Sorry, but that's just a dumb argument. All kernel code needs to cooperate
> with others - if there is a problem it's just fixed. But having multiple
> callbacks just because you don't trust someone else doesn't make sense.
kdump constitutes a special case because it is executed after a system
crash. There might be notifiers called before kdump's which make
assumptions (about the stacks for example) that do not hold good in the
event of a crash, thus compromising the crash dumping process. Invoking
notify_die does not necessarily mean that the system is going to crash
right afterward (the kernel can recover gracefully from certain oopses,
for example). As you mentioned code needs to cooperate and the offending
callbacks could be modified to contemplate the crash case, but in some
cases this may be overkill. Besides, by using the notifier
infrastructure we are introducing extra complexity in a critical path,
which may cause problems (if the die_chain is corrupted, for example).
Besides, the default NMI handler and the notify_die function itself use
the stack profusely without checking the validity of the stack pointer
or the state of the stacks (of course this applies to the current
implementation too). After a crash the state of the system is unknown
and we may end up overflowing the stack or further bloating if it is
already bloated. For this reason kdump is very likely to fail in stack
overflow scenarios. I will elaborate on this in the next email.
Regards,
Fernando