Description

dmesg attached, with loglevel=9 and libertas_debug=0x4000. To reproduce, on joyride-2386:

suspend (and turn off wireless) by closing lid

resume by opening lid

while driver resume is still happening, close lid again

I think the hang happens because we power down the device while it was being brought up, and khubd refuses to freeze (because it's still trying to send URBs?). Marking as blocks?:8.2.0, and we have some options:

back out turning off the wifi power on suspend

make that code quiesce the wifi somehow (how?) before powering it off

make that code delay performing a suspend that is less than a few seconds after a resume

Maybe there are other options here. What do you think, wireless/kernel people?

Something's screwy with command_timer here. I can physically remove the harware (libertas usb dongle) or remove the usb8xxx module, and rather than lbs functionality immediately stopping... I can watch ping hang for a few seconds, or get "usbcore: deregistering interface driver usb8xxx" followed by (a few seconds later) "libertas: Command 6 timed out". That should _not_ be happening...

lbs_add_card (called from if_usb_probe) inits command_timer; lbs_remove_card should be deleting the timer (synchronously!). Instead, it appears to call kthread_stop on lbs_thread, which then deletes command_timer before exiting. Problems include:
1) it deletes the timer asynchronously. we should be calling del_timer_sync.
2) kthread_stop doesn't immediately stop the thread, so we're hanging for a few seconds somewhere before checking kthread_should_stop.

I also don't see URB errors (which the host controller should be setting, if the libertas driver is in the middle of an URB submission), so.. my theory is that we're not currently waiting on an URB submission, but elsewhere in lbs_thread.

No, much the same, with a "command 50 timed out" in the logs. rmmod continues to be hung after that timeout. I guess a similar thing happens as with the suspend case -- the rmmod hang is due to us continuing to wait for the command response after the device goes away.

Greg, heads up that #7879 is jeopardized here, and might end up waiting for 8.2.1.

Thanks for the heads up. I don't think this one is that bad since it requires a close then quick open (or a short press of power key then another short press within 1 sec, right?) and it only affects users connected to APs.

That said, it will come up in the field, including in Uruguay and Peru. Also, I don't think you can avoid this one by not turning on power save in the SCP (right?).

We should take another stab at documenting where we are in terms of power save mode in SCP and power save via quick press of power key/lid close.

Let me know if you are aware of any other open bugs in power save that should be on that list. Let's collect all the power save relevant ones in to a single section. Then write up some release notes explanation of what will work and what wont in the various power save/suspend/idles modes?

I can look for someone to work with you on this if that helps, but I think you have all the main info. This will help us see where we are and therefore help decide what (if any) bugs we should try to address in 8.2 and what to focus on in 8.2.1.

Javier/libertas people, do you know if there's a way for me to quiesce the radio and command queue such that it becomes safe to rmmod? rmmod isn't that way, as above, but maybe one of the libertas commands for turning off the radio might be..?

Can somebody add 0x100000 to lbs_debug and reproduce the issue? we need to see what the main thread is doing at the time that it's told to quit, that might expose the source of the hang in lbs_thread() and allow us to figure out how to fix it.

can I get another log with 0x124007? LBS_DEB_MAIN would also be useful here. What I'm getting at is that I'd like to see if the USB code gets the disconnect event. The EC is clearly cutting the power and the 8388 drops off the bus, but that should trigger if_usb.c::if_usb_disconnect() which should clean stuff up. 0x124007 should show that. If thats too much output, just use 0x104007 instead which should show the enter/leave for if_usb_disconnect().

Javier/libertas people, do you know if there's a way for me to quiesce the radio and command queue such that it becomes safe to rmmod? rmmod isn't that way, as above, but maybe one of the libertas commands for turning off the radio might be..?

Would a simple "ifconfig eth0 down" do the trick? If the driver is properly behaved, it should quiesce the queue and wait for all outstanding commands to timeout or cancel them.

So the lbs_cmd_async() in lbs_stop_card() is probably coming from lbs_set_mcast_worker(), which is scheduled from lbs_eth_stop() which is called via netif_carrier_off(). Not a problem, but shouldn't be happening. We should be checking priv->surprise_removed before sending any new commands to the card from stop/remove paths.

The return value of lbs_stop_card() can be ignored because nothing actually touches it, and in fact it now returns void on 2.6.27.

I think the lbs_set_mac_control() and lbs_cmd_async() bits during lbs_remove_card() are the same thing since that code also does netif_carrier_off().

We can ignore the lack of a 'leave' for lbs_remove_rtap(), which is also fixed in 2.6.27.

The core problem looks like kthread_should_stop() isn't returning true in lbs_thread(), otherwise it would break out before the priv->surpriseremoved check in lbs_thread(), or it would set shouldsleep to 0 after and break out in the next iteration.

Can you put something like this in, rebuild and reproduce? We need to see if it's getting block in one of the timer/queue cancellation calls in lbs_remove_card() before it's able to hit the kthread_stop() near the bottom.

The command we are seeing timeout is 1d, which is an RF_CHANNEL, and in the previous logs it was 6, which is a SCAN. Are we also possibly still accepting commands from network manager once we get the USB disonnect? Or are those leftovers from the insmod/resume path that did not get processed before we shutdown power?

The command we are seeing timeout is 1d, which is an RF_CHANNEL, and in the previous logs it was 6, which is a SCAN. Are we also possibly still accepting commands from network manager once we get the USB disonnect? Or are those leftovers from the insmod/resume path that did not get processed before we shutdown power?

SCAN commands are definitely left around from before; the driver isn't clearing the command queue correctly. There's work to be done with command handling; priv->cur_cmd_retcode and cmdnode->result are all muddled together and used interchangably in some places. I'll try to fix that up.

RF_CHANNEL (set) only ever gets called from the association worker code as a result of WEXT calls. RF_CHANNEL (get) only gets called from WEXT code, or from the mesh autostart code when the firmware sends MACREG_INT_CODE_MESH_AUTO_STARTED, to ensure the driver and firmware agree on a channel.

I think that (besides the async MAC_CONTROL commands) all the commands we're seeing here are in-flight commands that aren't properly canceled while tearing down the driver.

cjb: any chance you can reproduce the failed suspend with the ~20s hang?

I tried adding a flush_workqueue() before the destroy_workqueue(), with the result that we hung in the flush_workqueue() instead -- then I noticed that destroy_ calls the same internal function as flush_. So, I think it's safe to say that we're hanging in flush_cpu_workqueue(), and the reason is that we're successfully clearing scheduled tasks from the runqueue, but we're not successfully clearing the active thread, and we're instead blocking on it to return, which it never does.

I don't know enough about the workqueue API yet to know what the right way for a kernel thread to know when to tear itself down immediately is.

lbs_stop_card() should be clearing out any in-flight commands, and the driver should also be failing command prep if priv->surpriseremoved == 1, which is done in lbs_cmd_async() for direct commands and lbs_prepare_and_send_command() for old-style commands. So there definitely shouldn't be any commands getting through after if_usb_disconnect().

But the command 6 (SCAN) surely is coming from earlier since nothing in the disconnect path will call scan, since there isn't any indication of a WEXT scan request coming in, and since there isn't any indication of an association request (which triggers a scan) coming in either.

scan.c::lbs_do_scan() calls cmd.c::lbs_cmd(CMD_802_11_SCAN), which then blocks on cmdnode->cmd_waitq. When lbs_stop_card() cycles through all the commands in the queue and stops them,

Argh. I forgot about LBS_DEB_HOST, which is what most of the command code uses when queuing commands. I hate to ask, but could you get another log with 0x126007 (which adds LBS_DEB_HOST) and also see if you can retrieve about 15 - 20 seconds of earlier logging before the stuff that you've attached so far? That should be enough to show us where the actual command that's in-flight at the time of disconnect is getting queued, and if it's successfully blocking on its cmdnode waitq. Thanks!

So here's the problem... priv->work_queue is a generic worker thread that executes commands, including SCAN. In newlibertas5.cap, the worker thread is blocked in lbs_do_scan() waiting for the queued scan command to finish. When lbs_remove_card() is called, it calls destroy_workqueue() which blocks on the worker finishing, but priv->work_queue is blocked waiting for the command to finish, so there's the hang.

What's not clear yet is why the bits in lbs_stop_card() aren't clearing the command and waking up the priv->work_queue and making the lbs_prepare_and_send_command() in lbs_do_scan() return with an error so that the scan work completes.

My current theory is that killing the current in-flight command is mishandled.

The current command actually gets removed from the pending command queue when it's in-flight, and thus the code in lbs_stop_card() doesn't look like it could possibly kill that command too. Can you try the following patch (completely untested, against stable)?