x86-64 sporadic hang in 2.6.23rc7 and 2.6.22 - Kernel

This is a discussion on x86-64 sporadic hang in 2.6.23rc7 and 2.6.22 - Kernel ; The two kernels mentioned hangs occationally.
Typically when I compile something and pass the time
by surfing the web.
A few minutes and then I notice that the mouse (and everything else in X)
stops. kbd LEDs does not react ...

x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

The two kernels mentioned hangs occationally.
Typically when I compile something and pass the time
by surfing the web.

A few minutes and then I notice that the mouse (and everything else in X)
stops. kbd LEDs does not react to numlock/capslock.
The only thing that still works is sysrq+B
So far this has happened while running X, so no messages.

Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

On Mon, 2007-09-24 at 23:08 +0200, Helge Hafting wrote:
> The two kernels mentioned hangs occationally.
> Typically when I compile something and pass the time
> by surfing the web.
>
> A few minutes and then I notice that the mouse (and everything else in X)
> stops. kbd LEDs does not react to numlock/capslock.
> The only thing that still works is sysrq+B
> So far this has happened while running X, so no messages.
>
> I have gone back to 2.6.22rc4, which seems to work.
>
> This is a single opteron, although on a dual-slot board.

Can you switch to serial console, so we can get some information out of
that box? Sysrq-B is working, so we can get info from other sysrq
functions as well.

Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Thomas Gleixner wrote:
> On Mon, 2007-09-24 at 23:08 +0200, Helge Hafting wrote:
>
>> The two kernels mentioned hangs occationally.
>> Typically when I compile something and pass the time
>> by surfing the web.
>>
>> A few minutes and then I notice that the mouse (and everything else in X)
>> stops. kbd LEDs does not react to numlock/capslock.
>> The only thing that still works is sysrq+B
>> So far this has happened while running X, so no messages.
>>
>> I have gone back to 2.6.22rc4, which seems to work.
>>
>> This is a single opteron, although on a dual-slot board.
>>
>
> Can you switch to serial console, so we can get some information out of
> that box? Sysrq-B is working, so we can get info from other sysrq
> functions as well.
>
I didn't need the serial - it crashes during console work too.
I think a "make clean" was in progress at the time. There must be work
going on
in order to crash.

This time 2.6.22rc4 died on me with a general protection fault

I got two reports, the first one scrolled partially off screen but
the whole trace was there:

Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

On Sat, 29 Sep 2007, Helge Hafting wrote:
> Thomas Gleixner wrote:
> > > I have gone back to 2.6.22rc4, which seems to work.
> > >
> > > This is a single opteron, although on a dual-slot board.
> > >
> >
> > Can you switch to serial console, so we can get some information out of
> > that box? Sysrq-B is working, so we can get info from other sysrq
> > functions as well.
> >
> I didn't need the serial - it crashes during console work too.
> I think a "make clean" was in progress at the time. There must be work going
> on in order to crash.
>
> This time 2.6.22rc4 died on me with a general protection fault
>
> I got two reports, the first one scrolled partially off screen but
> the whole trace was there:

That's why I asked for a serial console. That way we can get all the
information from the reports including the register dumps ....
> Then I got:
> spinlock lockup on cpu #0, kswapd 0/212

Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Thomas Gleixner wrote:
> On Sat, 29 Sep 2007, Helge Hafting wrote:
>
>> Thomas Gleixner wrote:
>>
>>>> I have gone back to 2.6.22rc4, which seems to work.
>>>>
>>>> This is a single opteron, although on a dual-slot board.
>>>>
>>>>
>>> Can you switch to serial console, so we can get some information out of
>>> that box? Sysrq-B is working, so we can get info from other sysrq
>>> functions as well.
>>>
>>>
>> I didn't need the serial - it crashes during console work too.
>> I think a "make clean" was in progress at the time. There must be work going
>> on in order to crash.
>>
>> This time 2.6.22rc4 died on me with a general protection fault
>>
>> I got two reports, the first one scrolled partially off screen but
>> the whole trace was there:
>>
>
> That's why I asked for a serial console. That way we can get all the
> information from the reports including the register dumps ...
>
Sure. But I can't get a cable right now. Was the registers necessary
in this case? Often, the trace turns out to be enough.

Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Andi Kleen wrote:
> Helge Hafting writes:
>
>> shrink_dcache_memory
>>
>
> That usually means random memory corruption from somewhere -- dcache
> tends to use a lot of memory and when it is corrupted anywhere these
> functions tend to crash while walking the lists.
>
> Unfortunately memory corruption is hard to track down because
> the messenger is usually not the one to blame.
>
> Perhaps enable slab debugging and see if it turns
> something up. Could be also broken hardware. Does an older kernel
> run stable? If yes and if it can be reproduced bisecting would
> be good.
>
2.6.18 had no problem compiling stuff without crashing.
Looks like I have some work to do then.

Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Andi Kleen wrote:
> Helge Hafting writes:
>
>> shrink_dcache_memory
>>
>
> That usually means random memory corruption from somewhere -- dcache
> tends to use a lot of memory and when it is corrupted anywhere these
> functions tend to crash while walking the lists.
>
> Unfortunately memory corruption is hard to track down because
> the messenger is usually not the one to blame.
>
> Perhaps enable slab debugging and see if it turns
> something up. Could be also broken hardware. Does an older kernel
> run stable? If yes and if it can be reproduced bisecting would
> be good.
>
I attempted bisecting - and failed. The problem is that
2.6.23rc7 seems very unstable, but 2.6.22rc4 is much better
but not perfect. 2.6.22rc4 only crashed once - it can compile for
hours and swap lots and keep running. But it died at least once.

I'll try running recent kernels with more debugging instead.
I think I used SLUB instead of SLAB - either way I can switch
that over to see if it changes things.

Re: x86-64 sporadic hang in 2.6.23rc7 and 2.6.22

Thomas Gleixner wrote:
> On Sat, 29 Sep 2007, Helge Hafting wrote:
>
>> Thomas Gleixner wrote:
>>
>>>> I have gone back to 2.6.22rc4, which seems to work.
>>>>
>>>> This is a single opteron, although on a dual-slot board.
>>>>
>>>>
>>> Can you switch to serial console, so we can get some information out of
>>> that box? Sysrq-B is working, so we can get info from other sysrq
>>> functions as well.
>>>
>>>
>> I didn't need the serial - it crashes during console work too.
>> I think a "make clean" was in progress at the time. There must be work going
>> on in order to crash.
>>
>> This time 2.6.22rc4 died on me with a general protection fault
>>
>> I got two reports, the first one scrolled partially off screen but
>> the whole trace was there:
>>
>
> That's why I asked for a serial console. That way we can get all the
> information from the reports including the register dumps ....
>
I got another crash - with a full dump. I have also discovered
files with lots of single-bit errors, so this is probably just some kind
of hw problem. :-(