From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461)
Description of problem:
On a dual HP lp2000r with kernel 2.4.18-17.7.xsmp, we have experienced twice a
mysterious partial hang (see later) of the kernel. Network and console login
usually possible, but result in mostly unusable sessions.
Version-Release number of selected component (if applicable):
How reproducible:
Didn't try
Steps to Reproduce:
Don't know of any systematical way to reproduce, except for leaving the system
on, and waiting... Has however happened twice, about two weeks apart.
Additional info:
These hangs are almost too weird to describe :) On the machine there is an
apache and tomcat process running. Both crashes were noticed by the fact that
the web service was not responding anymore. The machine still responded to
ping, and I was actually able to ssh into it too. The connection (ssh login)
was a bit sluggish, and when I was logged in, typing something was a bit hard;
it seemed that the display always lagged one character (maybe network packet)
behind. So, when I entered 'ls', only 'l' was displayed, then pressed
enter, 'ls' was displayed, then pressed 'enter' again, and ls listing was
shown, etc etc. The same thing happened on the console too, so this wasn't a
network error.
Another thing I noticed was that the system clock seemed to be hung. That was,
date always returned the same time. From the logs we saw that cron had ceased
to execute tasks between 00:40 and 00:50 (sar was no longer run), and when we
gave 'shutdown -r 0' to the system, logs did show that the reboot time was
about 00:43, even if actually was already the following day. And no, the
shutdown also hung, and we were forced to use the main switch...
The whole system was very sluggish overall, and I couldn't run top, for
example. ps-list completed ok, but showed nothing too unusual (sorry, no saved
data this time). free showed that the machine wasn't swapping, so that's not
the problem. According to 'w', the machine was also idle, so no processes were
running wildly. There were also no panics or other kernel messages in the dmesg
or messages -file, so there is no further info I can provide at this time.
So, overall this seemed like a partial kernel crash, leaving the
kernel unable to function properly. I know the description above most likely
isn't enough to isolate the problem, so is there something that I should
especially try or write down the next time it happens?
BTW, the system had been running kernel 2.4.18-10 for a couple of months, and
there were no crashes as faw as I can recall. There seems to be a newer errata
kernel 2.4.18-18.7.x, I might upgrade to that too, even if the bugs fixed don't
seem to match this problem.
About the system :
HP netserver lp2000r
Dual PIII 1GHz
HP NetRAID 2M with 6 disks (raid1)
RedHat 7.3, with the latest patches up to 11-Nov-2002.

Ok, more similar problems on two different HP lp2000r machines. The same
symptoms, sluggish machine etc. The other machine was running 2.4.18-18.7.x and
the other was running 2.4.18-19.7.x.
This time however, I noticed one very interesting thing. On both machines,
grepping from /proc/interrupts, the timer interrupts (#0) seemed to happening
way too infrequently. I rebooted the other machine (after which it seems to run
just fine) and calculated from 10 second (external timing :) sample the timer
rate, and got about 512 Hz. On the other machine (the misbehaving one), using
the same method I got about 9 Hz... That is, total 92 interrupts (46 + 46) in
ten seconds... I quess this explains why "sleep 1" lasts over 10 seconds and
"vmstat 1" and "top" seem to freeze.
And one more note, on the other machine, there was the following in dmesg.
May be relevant or then not...
Jan 1 14:12:10 extra1 kernel: set_rtc_mmss: can't update from 59 to 12

One more observation, the timer interrupt rate seems to be decreasing all the
time. Yesterday it was at about 9 Hz, six hours later at 8 Hz and now (the
following day) it seems to be around 6 Hz...
Is there anything to check to determine the reason for the timer interrupt
slowdown? Is it possible to verify timer circuit settings and/or reset the
timer frequency? Or could there be some other reason for the interrupts not
to be delivered, in case the timer works ok?
BTW, I checked the release notes for the latest BIOS update, and there was
nothing related to this kind of problems. And for the question could this be HW
problem, it could of course, but it had to be generic lp2000r HW bug. And to
continue with, we had no problems with some earlier kernels (<= 2.4.18-10 ?).
Were those (RH) kernels running with CONFIG_HZ=100 or 512 ? Could this be a
result of a timer wraparound or something like that?

I can confirm this issue. I have had four HP lp2000r servers get stuck looping
the clock. I am also unable to shutdown the boxes without a force flag.
All of the servers were running the 2.4.18-18.7xsmp Redhat errata kernel. I
think we had a similar case to this on a previous kernel release as well. A
reboot of the box does restore the correct clock operation.

We've also seen this behavior on a dozen lp2000r machines as well as one lh3r.
All of them had dual procs. This happened on kernels from 2.4.18-17.7xsmp
through 2.4.18-19.7xsmp. Also, these machines won't boot if you install an SMP
kernel on a machine with only one CPU. It hangs right after the line:
Configuring 256 Unix98 ttys
This wasn't the case with the 2.4.9 series.

I have reproduced this problem consistently now 5 times in addition to watching
several servers fail with this problem in the wild.
It affects only SMP boxes with any i386 kernel over 2.4.10 and does not affect
RHAS. It seems particularly problematic with HP servers.
To reproduce the problem, the server need only be installed (we install over
the network using kickstart) with 7.x and left alone while still on the network.
Idleness is the common denominator. Also, going from high demand to low demand
or disuse speeds up the appearance of this issue. I can get a hang with these
symptoms in 1.5 to 4 days.
I can find no evidence in my research that indicates Asus CUR-DLS or CUR-DLSR
servers (which are identical to HP LP2000r and LP1000r servers in nearly every
way right down to the case design) are affected in the same way as HP
hardware. The primary difference between the two boxes is Asus uses Award BIOS
while HP uses a modified PhoenixBIOS.

I have also experienced the same problem. The server is an HP LH4R with
4CPU's.
I am running kernel 2.4.18-24.7.x.
The server was previously running Red Hat 6.2 for a year and half with no
problems. It was recently upgraded to 7.3, and ran fine for a couple of
months with no issues. It has now experienced this problem twice in the last
fortnight.
I have disabled NTP, and configured the system to fire up in run-level 3, and
the server has been running fine for about a week. I am waiting to see how
this goes.

Just for the record, as a workaround, I have recompiled the kernel with
CONFIG_HZ=100, and haven't had any problems since. The problem most likely is
still there, but at least it's seems to be much less frequent.

I have rebuilt the Kernel with 'CONFIG_HZ=100', and the server has been up for
almost 3 weeks with no problems, although I am experiencing some slight pain
from keeping my fingers and toes crossed.
The server has about 50 users on it during the day, so it is getting a decent
workout.
Thanks Ville (I hope that's right) for the suggestion, it has definitely
helped. Any ideas if this is the long term solution, or am I likely to see
the 'hang' again?

Same problem on 3 out of about 20 lp2000r machines.
Machines are 1.4 and 1.133 GHz SMP / 1 GB RAM / NetRAID 1M
BIOS versions are spread from 4.6.06 to 4.6.16.
Kernel is 2.4.20-18.7 and all errata installed.
I can reproduce the problem starting a stress test procedure, running 1000
processess of 3 threads each. I kill all of them and soon the problem rises.
In addition, i get a continuous sound; maybe the machine is gone flatline:
"BIIIIIIIIIIIIIIII" :-)
?!?
As long as the machine is RH certified on 7.2, i'll go on with that distro.

We are calling this phenomena "TimeWarp."
It is not fully understood but I have spent a good while exploring and
experimenting with affected servers.
Here's what my group knows:
1) The problem is indeed a skewing problem between the two CPUs.
2) CONFIG_HZ = 100 is just a delaying tactic. TimeWarp still occurs at about
330 days on AS 2.1 with CONFIG_HZ = 100 kernels.
3) Typical failure time is 30 days.
4) The "trigger" is heavy or sustained activity followed by an abrupt cessation
of activity. Within three days of idleness, TimeWarp occurs. The above
prescribed method for reproducing failure is correct.
5) BIOS alterations using the F11 method are not fruitful.
6) Building/installing a custom kernel that turns off ALL elements of power
management (APM and ACPI) and other superfluous functionality results in at
least 180 days (and counting) of uptime even with CONFIG_HZ = 512.

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.
The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases,
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note

You need to
log in
before you can comment on or make changes to this bug.