From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7)
Gecko/20040803 Firefox/0.9.3
Description of problem:
The new kernel from Update 3 panics during boot. The attached panic is
from a Dell PE2650 using latest BIOS (A18) and RAID firmware (2.8-0).
A similar PE2650 with the the same RAID firmware, and another with
previous firmware (2.7-1) doesn't have this problem, but we did
experience this on a PE2550 using 2.7-1 firmware.
This is somewhat strange as the problem isn't always reproduceable.
One some boxes we don't see this problem, but on the boxes that have
this problem it is always reproduceable.
The latest kernel from Update 2 (2.4.21-15.0.4.ELsmp) works fine on
either firmware releases.
Version-Release number of selected component (if applicable):
kernel-2.4.21-20.ELsmp
How reproducible:
Sometimes
Steps to Reproduce:
1. Upgrade to latest firmware on a Dell PE2650 running RHEL3
2. Install latest kernel from Red Hat
3. Reboot the new kernel
Actual Results: The system panics.
Expected Results: Normal bootup.
Additional info:

I receive a very simliar error. A PowerEdge 2650 was updated with
up2date and rebooted and the system worked fine. Another reboot of
the system brought up the error message as above and The system no
longer boots with the kernel-2.4.21-20.ELsmp kernel.

We are investigating.
In the meantime, the previous version of the aacraid driver is
perserved in U3 as aacraid_00909.o. I expect that the older driver
will not have this problem. If you are able to get the system up, you
can change aacraid to aacraid_00909 in modules.conf, re-make the
initrd, then boot the U3 kernel. We will develop a more complete
solution once we determine the cause of the problem.

Additional information:
Seems to be SMP related.
2.4.21-20.EL-smp on Dell PowerEdge 2550 (BIOS A09) with PERC3/Di
(firmware 2.7-0 Build 3546) panics, but 2.4.21-20.EL works.
Confirmed - aacraid_10102 does work in 2.4.21-20.EL-smp on this machine.
Also note that a Dell PowerEdge 2650 (BIOS A18) with PERC/Di (firmware
V2.8-0 Build 6089) does NOT have this issue.

Also get this on 2650, Phoenix BIOS Revision A15, Dell PowerEdge
Expandable RAID controller 3/Di BIOS v2.7-1, Dell PowerEdge
Expandable RAID controller BIOS v3.31. Occurs 50% of the time using
the 2.4.20-20ELsmp kernel, not on the 2.4.20-20EL kernel or the
2.4.20-15ELsmp kernel. See attachment #103861[details].

Count me as another Dell 2650 user.. Thanks for the heads-up regarding
aacraid_10102.
I was able to boot the -20 non-SMP kernel, and use that to copy the
aacraid_10102.o from /lib/modules into the initrd, and the system
was able to boot the SMP kernel.
Quick summary of workaround:
zcat /boot/initrd-2.4.21-20.ELsmp.img > /tmp/smp.initrd
mount -o loop /tmp/smp.initrd /mnt/loop
cp
/lib/modules/2.4.21-20.ELsmp/kernel/drivers/addon/aacraid_10102/aacraid_10102.o
/mnt/loop/lib/aacraid.o
umount /mnt/loop
gzip /tmp/smp.initrd
cp /tmp/smp.initrd.gz /boot/initrd-2.4.21-20.ELsmp-test.img
and then point grub at that initrd.

That will work, but an easier way to do the workaround is:
edit /etc/modules.conf and replace "aacraid" with "aacraid_10102"
/sbin/mkinitrd /boot/initrd-2.4.21-20.ELsmp.img 2.4.21-20.ELsmp
Just remember to fix /etc/modules when redhat releases a fixed kernel.

We also experience this exact issue after going from 2.4.21-15.0.4 to
2.4.21-20. System won't boot; kernel panic relating to aacraid
module as described in comment #1. PE2650 with 3-disk RAID5 on
PERC3/di (older firmware, don't have version handy).

Tom,
that is not a good idea, unless you preserve both aacraid_10102.o and
the buggy version in U4. Otherwise there might be no stable aacraid
driver in the U4 kernel.
Personally I would prefer a patched kernel for U3.

Whoa. Not acceptable. This is a bug that prevents booting of a machine, not a FE.
Many people *will* boot this config and get burned. Save yourself and us the headache of
3-4 months of dealing with workarounds and having to build unsupported driver modules
and just fix it. Please?
FWIW, the Dell PERC3/Di, an aacraid based card, has some serious issues, so Dell is
saying: Please upgrade to the most recent driver to try to fix your random lockups and
freezes!" which people do because RHEL3-UPD3 has a newer version of the drive, plus
support, and lo and behold it breaks your machine in a total and complete way.
I know that respinning the kernel is a PITA, but it needs to be done.

Getting this fixed sooner than later is the right answer. My group
has taken a bit of a black eye after updating our server to Update 3
and running into this problem head on. I'm not pleased to have to
integrate a workaround into production systems.
I would very much like to see this bug fixed asap

Ernie, does this mean that a fix won't be around until U4 is released?
I wouldn't think this an acceptable solution, the PE2650 being one of
the most mainstream server boxes out there...
BTW, has anyone else experienced that kudzu hangs indefinately when
booting boxes that have the 2.4.20-20.ELsmp with workaround applied?

It is slated to be included in any security errata that comes out
before Update 4, but there is none pending right now and with luck
there won't be.
If you can't use the older included driver for some reason, you can
use the 1.1.4.2302.1 driver in DKMS format on linux.dell.com.

I have two PE2650 (identical firmware setup), one of them kernel panic
with 2.4.21-20smp but okay with 2.4.21-20 non-smp kernel. It appears
to me that the system will panic if it has an IMPEFERECT raid status,
otherwise, it will boot into 2.4.21-20smp just fine.
Here are the situdations that I encountered: (I have mirorr 0 for the
boot disk, disk0 and disk1)
1. Disk1 was flashing orange-green light, booting failed for 2.4.21-20smp
2. Re-insert disk1, no flashing orange-green light, booting 2.4.21-20
smp succeeded
3. Disk1 flashed orange-green light again after some time, rebooting
2.4.21-20 failed again.
4. Pull out disk1, booting 2.4.21-20smp failed.
5. Put in the new disk1 from DELL (raid container rebuilding), booting
2.4.21-20smp worked.
6. Shutdown the system, while (raid was rebuilding, RAID is
imperfect), rebooting to 2.4.21-20smp failed.
7. Booting into working kernel and having finished raid rebuild,
rebooting into 2.4.21-20smp worked again.
When will we have the kernel patch? I hated to do those work-around.

> When will we have the kernel patch?
As stated earlier, the patch in comment 12 will be in U4, and it is
also proposed to be in the first U3 errata, if there is one.
> I hated to do those work-around.
There are some better workarounds in comment 7/8, and 28.

Sorry I'm being unclear.
#33 comes from a system using 1.1.5 (eg: RHEL 3 UPD 3 default driver)
The crash mentioned in #34 is the same system using aacraid_10102. The crash is a hard
deadlock fo the machine (no panic or OOPS). 1.1.5 + 6092 is supposed to fix the
deadlock. Maybe.
Both were same hardware, so why was it running 2 drivers today you ask? It was
upgraded from UPD2 to UPD3 finally this morning and is unhappy with all the drivers. I
tried on a fluke to see if 1.1.5 + 6092 would boot (it does for another identical machine)
and it did not so I backed down to aacraid_10102 to get the machine to boot but it's still
crashing.

Nathan,
Try this pre-release kernel:
http://people.redhat.com/coughlan/RHEL3-perf-test/
Warning: this is a pre-beta U4 test kerrnel. It has not been through
QA. It must not be used in production. It is only to be used for
early testing and feedback.
This kernel has the 1.1.5 driver with the patch in comment 12. If you
still have the deadlock with latest firmware, then please open a new
BZ. You have a different problem.
Tom

Based on Stefan Hudson's Additional Comment #16, I put together this
script. I've run it on my existing 2650's and added it to my kickstart
for the servers I'm building. HTH...
#!/bin/sh
# The aacraid driver released with Red Hat Enterprise
# Linux 3, Update 3 has problems that can prevent a Dell
# PowerEdge 2650 server from booting. The workaround is to
# use the older aacraid_10102 driver. Two changes are
# needed to implement this. The /etc/modules.conf file
# should specify the aacraid_10102 module. An initrd file
# containing the other driver needs to be in place to
# make the correct driver available at bootup.
# Modify the modules.conf
timestamp=$( date "+%y%m%d%H%M%S" )
cp /etc/modules.conf /etc/modules.conf.${timestamp}
patch /etc/modules.conf <<EOPATCH
3c3,6
< alias scsi_hostadapter aacraid
---
> # For RHEL 3 EL U3, there is a bug with the aacraid driver.
> # The workaround is to use the aacraid_10102 driver. Be sure
> # to change this back with future RHEL version upgrades.
> alias scsi_hostadapter aacraid_10102
EOPATCH
# Rebuild the initrd file
mv /boot/initrd-2.4.21-20.ELsmp.img
/boot/initrd-2.4.21-20.ELsmp.img.${timestamp}
mkinitrd /boot/initrd-2.4.21-20.ELsmp.img 2.4.21-20.ELsmp

I see that the fix hasn't been included in 2.4.20-20.0.1.ELsmp, as
suggested in comment #28 and comment #31. Whether this is a slip-up or
a thought-through decision is unknown, but anyway it's a disappointment
that Red Hat doesn't take this problem more seriously.

Just a note that I recently installed RHEL 3 on a Fujitsu-Siemens
Primergy RX600 server (2 CPUs, Adaptec AIC-7902 U320 hardware RAID)
and downloaded 222 (!) RPM updates, including the 2.4.21-20.0.1.ELsmp
kernel and am seeing the same reboot crash that people are here, so
it's not just restricted to Dell Poweredges.
The crash seems to be intermittent and the latest one I got was during
an "insmod" of the aacraid driver according to the console output
(i.e. pretty well identical to Nathan's crash output in comment #33).
I must say that releasing a new kernel on 2nd December without this
problem fixed is very poor when the fix has been on this thread for
over 2 months. Priority "high" and Severity "high" apparently aren't
good enough to get this crucial fix in the kernel :-(

Ah... so the reason that the fix wasn't included in the latest errata
is that U4 is just around the corner? Then I withdraw my critisism and
look forward to the arrival of U4. Good work guys :)
(Since we're counting.. I have 80+ pe2650s. I think I'm in the lead ;)

Just to add a "me too"
I am having the same problem with an IBM x306 that has an Adaptec raid
controller and is using the aacraid module. I am now able to boot
using the workaround in comment #16 above.
I am also looking forward to the release of U4 next week.

Calm down, calm down. U4 for RHEL 2.1 *did* release this week (I got
all the e-mails). One can only assume that either U4 for 3.0 is
imminent, or some information came to light during the 2.1 release
that is delaying 3.0.
Sux, since I could use the kernel patch, too, and also sux since we
don't know why it's not out yet, but hey. That's life. Programming's
hard. If U4 comes out next week, I'll be happy. If all software (and
construction) projects were only a week late, I'd be f***ing ecstatic...
--J
P.S. But I *want* that kernel update... :-/

An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
http://rhn.redhat.com/errata/RHBA-2004-550.html

I don't think the problem is in the aacraid driver!
I have PE2250 with BIOS version A09 and PERC 3/Di firmware version
2.8.0 build 6092.
After kernel update to 2.4.21-27ELsmp system won't boot.
I tryed both accraid and aacraid_10102 drivers but the effect was a
series of "Segmentation fault" on each attempt operation with
filesystem to be done.
As a final result system freezed.
aacraid_10102 is a preserved version of the old aacraid driver and if
the problem was in the driver the system has to be able to boot with
it, but ir won't. I think the problem is in the interaction between
the kernel the the driver.
We had identical problems with PE1650 server which has just SCSI
controller (not RAID).
The only possible way to bring the machines back was to use the old
kernel -- 2.4.21-15.0.4

I don't think the problem is in the aacraid driver!
I have PE2250 with BIOS version A09 and PERC 3/Di firmware version
2.8.0 build 6092.
After kernel update to 2.4.21-27ELsmp system won't boot.
I tryed both accraid and aacraid_10102 drivers but the effect was a
series of "Segmentation fault" on each attempt operation with
filesystem to be done.
As a final result system freezed.
aacraid_10102 is a preserved version of the old aacraid driver and if
the problem was in the driver the system has to be able to boot with
it, but ir won't. I think the problem is in the interaction between
the kernel the the driver.
We had identical problems with PE1650 server which has just SCSI
controller (not RAID).
The only possible way to bring the machines back was to use the old
kernel -- 2.4.21-15.0.4

Vlady,
The problem you are describing is not the same as the problem reported
in this bugzilla. Please open a new bugzilla. When you do, provide
the console output showing the driver being loaded and the device
configuration messages, and the subsequent error messages.
Also, which driver is being used in the non-RAID PE1650 system?
Tom

Note

You need to
log in
before you can comment on or make changes to this bug.