#7458: Intermitent suspend/resume lockup
----------------------------+-----------------------------------------------
Reporter: dsaxena | Owner: dsaxena
Type: defect | Status: assigned
Priority: blocker | Milestone: 8.2.0 (was Update.2)
Component: not assigned | Version: Development build as of this date
Resolution: | Keywords: joyride-2131:-
Next_action: diagnose | Verified: 0
Blockedby: | Blocking: 7393
----------------------------+-----------------------------------------------
Changes (by dsaxena):
* cc: mbletsas (added)
Comment:
I've spent some time running many many test iterations to find a pattern
and get
a better understanding of where the issue might be. I started by adding
debug code
to various paths in the kernel to trace the suspend path as we go down the
kernel stack
the main issue I was seeing is a lockup in the Libertas driver's suspend
path where
the driver submits a sleep request and the the device never responds.
I wanted to separate out the EC timeouts from the wireless lockup so
looked at the few changes in the EC handling code from 2.6.22 to 2.6.25,
backed them out, and ran the tests again. No luck.
Next I started looking at the driver itself and trying to see if there
are any major changes in the suspend/resume path. Since the code has been
massively updated between stable and 2.6.25, I ended up just replacing the
full 2.6.25 driver with the 2.6.22 driver. Ran tests again, no luck.
I was using 2.6.22 as a baseline b/c it is a know quantity and I
assumed that it did not have the same issues since we're shipping
703 and 708 based on it. This turned out to be an incorrect assumption.
When I tested with the 2.6.22 kernel, I was able to reproduce a
suspend/resume lockup under stress, but somewhat different from
what I saw with 2.6.25.
I also wanted to rule out the firmware as a source of the issue, so
I backed out the firmware that we are including in Joyride (5.110.22.p14)
and dropped in the firmware from build 703 (5.110.22p1). Still got a lock
up, but again, somewhat different.
Following is a summary of the various failure cases I saw in my extensive
testing. All tests
were performed with ec_timeout set to 40ms, using XO firmware q2e11. Test
consists of XO looping continuously and going to sleep with a host
continuously pinging the XO.
----
=== 2.6.25 kernel + 2.6.25 driver + old fw ===
* Sometimes see long (>> 10000ms) ping delays once we start sleep loop
* After a few sleep cycles, kernel stops at olpc_do_sleep(), and does
not wake back up from WLAN ("Destination Host Unreachable" messages on
host).
* ctrl-c on console does not do anything while stuck
* Moving mouse, power button, or other input wakes system back up but
goes back to waiting state on next sleep cycle.
* Can stop the sleep loop by hitting ctrl-c in the small window between
wake and sleep in above case and system immediately starts responding to
pings.
=== 2.6.25 kernel + 2.6.25 driver + new fw ===
''Failure Mode 1:''
* Sporadic ping response times after sleep looping beings, up to 20000ms
* After several cycles (anywhere from 30 to several 100s), system gets
stuck waiting for libertas to respond to sleep command.
* Even though the wireless card did not respond back (or the kernel did
not receive the response), we are receiving WLAN wake up events from the
EC:
{{{
olpc-ec: running cmd 0x84
olpc-ec: received 0x20
olpc-pm: SCI 0x20 received
olpc-ec: running cmd 0x84
olpc-ec: received 0x0
olpc-pm: SCI 0x0 received
}}}
Javier, Richard, is it possible to receive a 0x20 SCI if the WLAN chip
is not actually in suspend mode?
* Hitting ctrl-c on serial console causes sleep cycle to complete but
upon waking up we get a list corruption bug in libertas driver's
lbs_queue_cmd() function. What this points to is that the driver is still
processing when it should not. At this point the system is completely
dead.
''Failure Mode 2:''
* System hits olpc_do_sleep(), but we see an EC error when disabling
EC_SCI_SRC_BATSOC && EC_SCI_SRC_ACPWR events (Not sure why we do
this...Dilinger?):
{{{
/* We don't want to wake up on superfluous events */
olpc_ec_mask_unset(EC_SCI_SRC_BATSOC | EC_SCI_SRC_ACPWR);
}}}
{{{
[ 295.122289] olpc-ec: running cmd 0x1c
[ 295.122289] olpc-ec: timeout waiting for EC to read command!
[ 295.122289] olpc-pm: error getting SCI mask: -5
}}}
* Kernel tells EC to sleep:
{{{
[ 295.122289] olpc-ec: running cmd 0x32
[ 295.122289] olpc_do_sleep!
}}}
* We wake up from wlan event but kernel hangs on trying to wake up the
wireless device:
{{{
+r[ 295.122289] olpc-ec: running cmd 0x34
[ 295.122289] olpc-ec: running cmd 0x24
[ 295.122289] olpc-ec: running cmd 0x24
}}}
=== 2.6.25 kernel + 2.6.22 driver + new fw ===
* Sleeps once, wakes up, does not sleep again. Driver gets stuck waiting
for libertas to respond to sleep command
* Hitting ctrl-c on console causes sleep cycle to complete.
Touchpad/keybd/power button does not.
* Waking system back up casues list corruption bug in lbs_queue_cmd(),
but is not fatal. System still running but network interface is dead at
this point.
=== 2.6.25 kernel + 2.6.22 driver + old fw ===
* Sleeps once, wakes up, does not sleep again. Gets stuck waiting for
libertas to respond to sleep command
* Hitting ctrl-c on console causes sleep cycle to complete.
* Touchpad/keybd/power button does not.
* System does not have network connectivity at this point.
* Hitting ctrl-c on console at this point does nothing
* Touchpad/keybd/power causes system to wakeup and crash with a spinlock
error in the khub kernel driver.
=== 2.6.22 kernel + old fw ===
* Hangs at olpc_do_sleep() after a number of cycles.
* Keybd/mouse/pwr wakes system up
* Can wake system up and hit ctrl-c to stop sleep loop and system
immediately starts responding to pings.
=== 2.6.22 kernel + new fw ===
* Hangs at suspend of libertas driver
* ctrl-c completes suspend, stops at olpc_do_sleep()
* Waking up with touchpad/keyboard leads to list corruption in
lbs_queue_cmd()
* System completely hung at this point.
----
Note that even 2.6.22 kernel + old FW, which is what we are shipping in
703, does not always successfuly wake-on-lan, it just happens to not be
fatal.
I'm fairly certain the root cause of this is a race somewhere in the
kernel/EC/Libertas suspend stack. The kernel is sending a command to the
device and it may or not be succeeding but if it is, we're not getting
notified in time. Due to the race, we're getting out of sync, either
driver <-> libertas, or driver <-> EC. How this ends up impacting the
system can drastically change based on timing of the system. For example,
if I enable full debug output from the libertas driver ({{{echo 0xffffffff
> /sys/module/libertas/libertas_debug) I get a very different set of
errors from the driver compared to having it just set to 0.
I'm not sure what role the EC timeouts play in this whole mess. With the
40ms timeout, I'm barely seeing them but still running into the same
crashes so they might be orthogonal to the issues we're seeing here;
however, they still need to be looked into.
To really root cause this I think we're going to have to look at debug
output of all three
components of the system (kernel, EC, wireless) to see what they are
sending and how each component is reacting. Our goal is to ship latest
kernel + firmware, so we need to
focus on that specific combination.
Next steps:
* Javier, can you do some debug on your end to see what's happening in
the firmware when we hit the condition where the kernel does not get a
response to sleep? I can post my debug bits for kernel side.
* I will try to track down the lbs_queue_cmd() list corruption in the
kernel and look for general racyness in the generic olpc and driver
suspend/resume path.
* Richard/Dilinger/me should look into the EC timeouts.
--
Ticket URL: <http://dev.laptop.org/ticket/7458#comment:33>
One Laptop Per Child <http://laptop.org/>
OLPC bug tracking system