rt73 kernelbug:sleeping function called from invalid context
Hi. I've got a USB wifi interface that gives connectivity fine when the computer is first booted, but then drops connectivity at seemingly random intervals and emits this kernel bug
BUG sleeping function called from invalid context at kernel/sched.c3752
[<f89c9619>] RTUSB_VendorRequest+0xc5/0xd2 [rt73]
[<f89c97f1>] RTUSBReadMACRegister+0x31/0x35 [rt73]
[<f89c986e>] RTUSBWriteBBPRegister+0x1d/0x89 [rt73]
[<f89c3ce8>] AsicSwitchChannel+0x30a/0x3c0 [rt73]
[<f89c4114>] MlmePostRestartStateMachine+0x26/0x2e [rt73]
[<f89c5ed3>] MlmeHandler+0x5c/0x113 [rt73]
[<f89c17ca>] MlmeThread+0x68/0xab [rt73]
[<f89c1762>] MlmeThread+0x0/0xab [rt73]
I've tried with recent CVS builds as well, and seem to have the same problem.
My card and version
Bus 003 Device 003 ID 148f2573 Ralink Technology, Corp.
rt73 driver version - 184.108.40.206 CVS
Linux vid 220.127.116.11-72.fc6 #1 SMP Wed Nov 21 151259 EST 2007 i686 i686 i386 GNU/Linux
I can't use networking unless I remove the USB stick or reboot.
Should I recompile with debug and post that output?
Should I recompile with debug and post that output?[/quote3ezqib8z]If you could do that, it would help.
Here's the debug output.
Note that the issue happened at the following timestamps
In /var/log/messages it was this error
Apr 6 182454 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
The debug shows
Apr 6 182441 vid kernel rt73 rt73_get_ether_stats --->
Apr 6 182453 vid last message repeated 14 times
each time before the crash. ?
description Ralink RT73 802.11abg WLAN Driver 18.104.22.168 CVS 2008040214
vermagic 22.214.171.124-72.fc6 SMP mod_unload 686 4KSTACKS
parm debugDebug mask n selects filter, 0 for none (int)
parm ifnameNetwork device name (default wlan%d) (charp)
parm firmNamePermit to load a different firmware (default rt73.bin) (charp)
Thanks for the log. Would it be possible to egrep out the time period 181524 - 182424 from /var/log/kern.log and attach a gzipped copy of that segment to a posting here?
Hi. Let me know if this was what you wanted. Thanks!
Was that file the one that was requested? I'm on Fedora, so I don't have a kernel.log...
Was that file the one that was requested?[/quote2bx3egbb]Close enough. Often /var/log/kern.log is set up to catch all kernel messages; so I'd been hoping to catch the BUG message surrounded by the driver debug messages.
Regardless, it looks like I overlooked some housekeeping on close, which left a little hole on a subsequent reopen. I think the attached patch - atom1.patch.gz - fixes that. It doesn't break anything in my environment. Could you see if it cures the symptoms in yours?
The patch applied cleanly and the module was rebuilt okay. So far there have not been any crashes, but I'll continue to monitor and let you know.
Unfortunately it crashed again the same way.
I'll add debugging back and try to get the crash logs.
Here's the BUG message. I've also attached the debug and messages portions.
Let me know if there's anything else I can do to help.
...oh, and it seems like it might be triggered by the network interface coming up and down, although I haven't been able to reliably reproduce this...
Apr 9 144407 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
Apr 9 144407 vid openvpn SIGUSR1[soft,ping-restart] received, process restarting
Apr 9 144407 vid kernel in_atomic()0, irqs_disabled()1
Apr 9 144407 vid kernel [<c062e5f0>] wait_for_completion_timeout+0x1c/0xae
Apr 9 144407 vid kernel [<c0592c5e>] usb_start_wait_urb+0x52/0x87
Apr 9 144407 vid kernel [<c0592e6f>] usb_control_msg+0xb9/0xd4
Apr 9 144407 vid kernel [<f89e5088>] RTUSB_VendorRequest+0x107/0x17a [rt73]
Apr 9 144407 vid kernel [<f89e52c6>] RTUSBReadMACRegister+0x31/0x35 [rt73]
Apr 9 144407 vid kernel [<f89e5362>] RTUSBWriteBBPRegister+0x1f/0xa9 [rt73]
Apr 9 144407 vid kernel [<f89de002>] AsicSwitchChannel+0x30a/0x420 [rt73]
Apr 9 144407 vid kernel [<c042983a>] printk+0x1f/0x92
Apr 9 144407 vid kernel [<f89de552>] MlmePostRestartStateMachine+0x3e/0x47 [rt73]
Apr 9 144407 vid kernel [<f89e0849>] MlmeHandler+0x79/0x15a [rt73]
Apr 9 144407 vid kernel [<f89daf00>] MlmeThread+0x8f/0xe7 [rt73]
Apr 9 144407 vid kernel [<f89dae71>] MlmeThread+0x0/0xe7 [rt73]
Apr 9 144407 vid kernel [<c0405b6b>] kernel_thread_helper+0x7/0x10
Apr 9 144407 vid kernel =======================
Thanks for the debug info. It looks like the first patch did result in moving the symptom timing around a bit.
Unfortunately it crashed again the same way.[/quote2m3zeyna]Tiny comment here. Crash is when the keyboard lights blink, nothing happens when you type, and you're really thankful for the magic SysRq. What we're dealing with here is a kernel warning message.
Anyway, examining the execution path, I've modified the patch - attached as atom2.patch.gz - to sleep instead of doing program delay loops for 1/10th to 1/2 second. That *may* give the kernel a more favorable opinion of our execution context.
Could you try it and see what happens?
The kernel warning happened again with the new patch applied.
I recompiled with debugging and am trying to get debug info now.
One other thing I noticed is this warning message on iwlist. Coudl that be a clue?
# iwlist wlan0 scan
Warning Driver for device wlan0 has been compiled with version 22
of Wireless Extension, while this program supports up to version 20.
Some things may be broken...
wlan0 No scan results
To avoid this message, you should update wireless tools to version 29.
If you don't find some packages for your distribution, the tarball is available here[/url1tx59te7].
I recompiled with debugging and am trying to get debug info now.[/quote263454ue]Hope you haven't gotten too far into that. I think I've found the problem. Pretty obvious, too (looking back in retrospect).
Could you apply the attached atom3.patch.gz to the latest CVS and see if indeed we do have a fix?
I applied that patch to the latest CVS, and so far so good! Thanks!
I'll let it go a little bit more, and if it looks okay will deploy to 10 more computers with similar hardware setups.
It looks like that last patch worked! It hasn't crashed in over 12 hours. Thanks!
I'll deploy to more machines and let you know if there are any issues. Thanks for your help.
Will this fix make it into cvs soon?
Glad to see you're on the air. You should see the patch start to show up in the hourly CVS tarballs Real Soon Now. Thanks for your efforts.
BTW, your logs show there's evidently a *lot* of roaming going on. What are you doing? Deploy on another 10 machines? It all sounds like a very interesting environment. Anything you can share with us?
It is an interesting environment--they are used on taxicabs roaming around New York City...so the number of access points available at any given moment is huge. So such an environment has proven be useful for stress testing in a real-world environment.
I have preliminary results from the last upgrade. Statistically, the units on which the last patch was deployed have higher uptimes than those that do not/the same units before the upgrade. So I think it's definitely better...and I haven't been able to "catch" one of them misbehaving.
But, I do still have a kernel BUG in /var/log/messages. Nothing in dmesg so far that I've found.
As an example, this is what it looked like before the last upgrade
Apr 12 114716 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
Apr 12 114749 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
Apr 12 114805 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
And now we have
Apr 12 130351 vid kernel kernel BUG at include/linux/timer.h153!
Apr 12 203909 vid kernel kernel BUG at include/linux/timer.h153!
Apr 13 154232 vid kernel kernel BUG at include/linux/timer.h153!
Apr 14 095730 vid kernel kernel BUG at include/linux/timer.h153!
Maybe it's a non-fatal flaw now?
Let me know if you'd like debug information on this.
Thanks for the info on your environment. Its probably the stressiest stress test the driver has gotten.
If you can work it in, can you arrange for /var/log/messages to hold debug messages, too, then run with debug enabled? That may help me get a better idea where in the processing the "kernel BUG at include ..." messages are being caused. If you can do that, no need to include /var/log/debug. Otherwise, I'll need it.
Here's the debug and messages files. I wasn't able to put them in the same file.
Let me know if I missed something.
The debug info was good. Thanks.
I'm not too sure I've thought about it enough, but I think the attached patch may be a fix for the "BUG at timer" message. Could you try it and see what happens?
I think that might have done it. It ran for several days without going down/having a BUG report.
I'll push it out to more machines and see if that fixed it!
Thanks for your help!
Indeed, that looks like it's stable. No new BUG messages, and the interface seems to stay up all the time now. I'd say it's ready for a commit.
Glad to see things are working. The change has been submitted to CVS, and you should see it show up in the hourly tarballs soon.
Thanks for the problem discovery, and all the testing you've done.