thermal events (or lack thereof) - Debian

This is a discussion on thermal events (or lack thereof) - Debian ; Hi,
I've been working steadily over the past few weeks to get my new HP nx6125
working under Debian (amd64 port) and have made significant progress.
However, there is one considerable problem: thermal events don't seem to be
recognised or ...

thermal events (or lack thereof)

Hi,

I've been working steadily over the past few weeks to get my new HP nx6125
working under Debian (amd64 port) and have made significant progress.
However, there is one considerable problem: thermal events don't seem to be
recognised or processed by the kernel (until I do a
cat /proc/acpi/thermal_zone/TZ?/temperature). As soon as I do anything CPU
intensive I really run the risk of frying my laptop :-(

To be more specific, I am running kernel 2.6.14.3
(www.kernel.org vanilla) with the double timer patch applied (seehttp://bugzilla.kernel.org/attachmen...61&action=view). When I boot
up, my thermal trip points get set nicely. The first is at 58 *C, then 65*C,
then 75 *C, and 80*C (S5 = 95 *C). When I was testing things I was doing
frequent executions of

cat /proc/acpi/thermal_zone/TZ?/temperature

and I would observe the temp rise to 58 C, then the fan would kick in, the
first trip point would then (automatically) re-set to 50 C and the CPU would
cool through 8 C before the fan turned off (nice, I thought, and very clever
this re-setting of trip points--sorry I'm very new to ACPI). When the fan
turned off, the trip point would again re-set to 58 C. So, I thought all was
working well. However, subsequent tests done by running glxgears and not
executing the above cat command allowed the CPU temp to rise above several
trip points without the fans kicking in! Only when I ran the above cat
command did the fans start!?

So, I stopped acpid and did a

cat /proc/acpi/event

while running glxgears. I waited a while and then did a
cat /proc/acpi/thermal_zone/TZ?/temperature to see that indeed the temp of
TZ1 had exceeded 58C---and immediately /proc/acpi/event received a thermal
event (note: the temp had already exceeded 58 C, my first thermal trip point;
the thermal event only occurred when I did the 'cat'). So, in order for
thermal events to "get through/processed" I need to keep doing
cat /proc/acpi/thermal_zone/TZ?/temperature!!!!

Can anybody shed some light on this behaviour. I don't know much about ACPI,
but it seems (?) like the linux kernel is not processing the thermal events
properly. Incidentally, I am also seeing spurious syslog errors that read
APIC error on CPU0: 40(40) meaning that some interrupts presumably are not
being correctly identified by the interrupt controller (thermal ones? could
there be some correlation here?).

Other info: the HP nx6125 is a Turion 64 based laptop with ATI chipset (yes, I
know). I am running acpid and have just installed powernowd (doesn't fix it).
I have also observed the above behaviour running the standard Debian
2.6.12-1-amd64 kernel (booting with no_timer_check to avoid double timer
interrupts). Any help, suggestions would be greatly appreciated. If anyone
has any ideas why catting /proc/acpi/thermal_zone/TZ?/temperature gets
things to work, I'd be very happy to hear an explanation, too.

Re: thermal events (or lack thereof)

Richard Mace wrote:
> I've been working steadily over the past few weeks to get my new HP nx6125
> working under Debian (amd64 port) and have made significant progress.
> However, there is one considerable problem: thermal events don't seem to be
> recognised or processed by the kernel (until I do a
> cat /proc/acpi/thermal_zone/TZ?/temperature). As soon as I do anything CPU
> intensive I really run the risk of frying my laptop :-(

For what it's worth, I used to have that problem but that got fixed
somewhere around 2.6.11 . This is on a regular amd laptop, not amd64.
> To be more specific, I am running kernel 2.6.14.3

Strange. Could it be a regression ? A good workaround for me was
to echo a 1 into /proc/acpi/thermal_zone/THRM/polling_frequency from
a bootscript. This had the same side-effect as periodically reading
the temperature and it didn't consume much CPU.

The problem had to do with delayed ACPI events, not just thermal
events. It was easiest to diagnose with a non-preemptible kernel.
There was an 8--10 event delay in processing. For example it would
take several keypresses to change the screen brightness, and
the brightness would overshoot for the first few keypresses when
I tried to bring it back.

With a preemptible kernel the pattern was not so clear. The
delay queue would tend to empty itself with no intervention,
but not reliably. It just made the problem harder to observe.
The polling_frequency hack still worked so that's what I did.