Friday, September 19, 2014

Thermal Zones

Today we're going to be taking a look into thermal zones, which are essentially different physical regions of the hardware
platform that are partitioned. This act of partitioning is done so that when a sensor
detects that a thermal zone is overheating, it will either use passive
or active cooling to cool the devices in the specific thermal zone. I'm still learning about them in-depth as they're relatively confusing.

Having a look at the above event data, this is the thermal zone data for the TZS0 sensor. Without having other more detailed event data logs aside from this, I can only assume this is the sensor for the CPU. The reason for this is the _PSL child object is used to list the processors in the thermal zone. If we saw _TZD instead of _PSL, this would list the non-processor devices in the thermal zone. For the moment however, we'll just assume the entire system was overheating as this was a really hotlaptop. Let's go down the line:

PSV

_PSV - Indicates the temperature at which the operating system started passive cooling control. In our case here, this was 371K (K = Kelvin). 371 Kelvin is 97.85 Celsius. The system we're dealing with in this post is an Acer Aspire 5740G, which according to tech specs houses an i5 430m for its processor. If we consult the manual:

We can see the max temperature is 105C regarding the CPU core, and 100C for the integrated graphics + IMC. With this said, the CPU started throttling (passive countermeasure to overheating) at 97.85C. How did the CPU know when to start throttling? PROCHOT#!

PROCHOT# is Intel's thermal throttle activity bit:

Note it states - The TCC will remain active until the system deasserts PROCHOT#. In fancy software engineer talk, this simply means that so long as the temperatures continue to rise (or doesn't drop), it will continue to lower power consumption from the CPU until it's in a safe spot. The downside to this is in extreme situations of overheating (like this one here), we trip because we hung around at an unsafe operational temperature for too long (or got higher) and shut down to prevent permanent damage.

TC1/2

_TC1/2 - These are both known as the Thermal Constants, which are essentially objects used to evaluate the constants _TC1/2 for use in the passive cooling formula.

Performance [%]= _TC1 * ( Tn -Tn-1 ) + _TC2 * (Tn. - Tt)

The return value is an integer containing Thermal Constant #'s 1 or 2. In our case, our return value was 50 for _TC2.

TSP

_TSP - Evaluates to a thermal sampling period (in tenths of seconds) used by Operating System-directed configuration and Power Management (OSPM) to implement the passive cooling equation. This value, along with _TC1 and _TC2, will enable OSPM to provide the proper hysteresis required by the system to accomplish an effective passive cooling.

In our case, this was 0ms.

AC(x)

_AC(x) - This optional object, if present under a thermal zone, returns the temperature trip point at which Operating System-directed configuration and Power Management (OSPM) must start or stop active cooling, where (x) is a value between 0 and 9 that designates multiple active cooling levels of the thermal zone.

In our case, we can see it does go 0-9, and all is listed as 0 Kelvin.

CRT

_CRT - Indicates the temperature at which the operating system will shut down
as it simply cannot throttle (passive) or use fans (active) to succeed
in cooling these temperatures in time before permanent damage. In this
case, we can see it's 373k (373 Kelvin) when this occurred, which is 99.85
Celsius. Unsure as to why this is 99.85 and not 100 in our case, but I digress. Possibly it rounded up?

HOT

_HOT - Indicates the return value (temperature) at which Operating System-directed configuration and Power Management (OSPM) may choose to transition the system into the S4 sleeping state.

--------------------

In any case, all we know is that this laptop was having some serious overheating problems. It was occurring mostly overnight when the system was being awoken from sleep to perform a defrag. The defrag was enough to push it to throttle/shutdown temperatures, which was also throwing bug checks as Windows wasn't happy it couldn't defrag.

Thanks for reading, and hopefully more info on thermal zones in the near future.