rt73 kernelbug:sleeping function called from invalid context

Live forum: http://rt2x00.serialmonkey.com/viewtopic.php?t=4717

vert

05-04-2008 18:14:10

Hi. I've got a USB wifi interface that gives connectivity fine when the computer is first booted, but then drops connectivity at seemingly random intervals and emits this kernel bug

BUG sleeping function called from invalid context at kernel/sched.c3752
in_atomic()0, irqs_disabled()1
[<c062e5f0>] wait_for_completion_timeout+0x1c/0xae
[<c0421228>] update_curr+0x13d/0x167
[<c0592c5e>] usb_start_wait_urb+0x52/0x87
[<c0592e6f>] usb_control_msg+0xb9/0xd4
[<f89c9619>] RTUSB_VendorRequest+0xc5/0xd2 [rt73]
[<f89c97f1>] RTUSBReadMACRegister+0x31/0x35 [rt73]
[<f89c986e>] RTUSBWriteBBPRegister+0x1d/0x89 [rt73]
[<f89c3ce8>] AsicSwitchChannel+0x30a/0x3c0 [rt73]
[<f89c4114>] MlmePostRestartStateMachine+0x26/0x2e [rt73]
[<f89c5ed3>] MlmeHandler+0x5c/0x113 [rt73]
[<f89c17ca>] MlmeThread+0x68/0xab [rt73]
[<f89c1762>] MlmeThread+0x0/0xab [rt73]
[<c0405b6b>] kernel_thread_helper+0x7/0x10
=======================

I've tried with recent CVS builds as well, and seem to have the same problem.

My card and version

Bus 003 Device 003 ID 148f2573 Ralink Technology, Corp.
rt73 driver version - 1.0.3.6 CVS

Kernel

Linux vid 2.6.22.14-72.fc6 #1 SMP Wed Nov 21 151259 EST 2007 i686 i686 i386 GNU/Linux


I can't use networking unless I remove the USB stick or reboot.


Should I recompile with debug and post that output?

Thanks!

Vern

06-04-2008 21:15:49

Should I recompile with debug and post that output?[/quote3ezqib8z]If you could do that, it would help.

Thanks,

vert

06-04-2008 22:37:41

Here's the debug output.

Note that the issue happened at the following timestamps

181524
182454

In /var/log/messages it was this error

Apr 6 182454 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752

The debug shows

Apr 6 182441 vid kernel rt73 rt73_get_ether_stats --->
Apr 6 182453 vid last message repeated 14 times

each time before the crash. ?

modinfo

filename rt73.ko
license GPL
description Ralink RT73 802.11abg WLAN Driver 1.0.3.6 CVS 2008040214
author http//rt2x00.serialmonkey.com
srcversion A5D87B62C040BD45DB167FA
alias usbv1740p7100d*dc*dsc*dp*ic*isc*ip*
alias usbv2019pAB50d*dc*dsc*dp*ic*isc*ip*
alias usbv2019pAB01d*dc*dsc*dp*ic*isc*ip*
.
.
.
depends
vermagic 2.6.22.14-72.fc6 SMP mod_unload 686 4KSTACKS
parm debugDebug mask n selects filter, 0 for none (int)
parm ifnameNetwork device name (default wlan%d) (charp)
parm firmNamePermit to load a different firmware (default rt73.bin) (charp)

Vern

07-04-2008 00:09:32

Hi vert,

Thanks for the log. Would it be possible to egrep out the time period 181524 - 182424 from /var/log/kern.log and attach a gzipped copy of that segment to a posting here?

Thanks,

vert

07-04-2008 01:53:55

Hi. Let me know if this was what you wanted. Thanks!

vert

08-04-2008 21:46:15

Was that file the one that was requested? I'm on Fedora, so I don't have a kernel.log...

Thanks!

Vern

09-04-2008 01:49:24

Was that file the one that was requested?[/quote2bx3egbb]Close enough. Often /var/log/kern.log is set up to catch all kernel messages; so I'd been hoping to catch the BUG message surrounded by the driver debug messages.

Regardless, it looks like I overlooked some housekeeping on close, which left a little hole on a subsequent reopen. I think the attached patch - atom1.patch.gz - fixes that. It doesn't break anything in my environment. Could you see if it cures the symptoms in yours?

Thanks,

vert

09-04-2008 02:47:17

The patch applied cleanly and the module was rebuilt okay. So far there have not been any crashes, but I'll continue to monitor and let you know.

Thanks!

vert

09-04-2008 16:29:06

Unfortunately it crashed again the same way.

I'll add debugging back and try to get the crash logs.

vert

09-04-2008 19:08:03

Here's the BUG message. I've also attached the debug and messages portions.

Let me know if there's anything else I can do to help.

...oh, and it seems like it might be triggered by the network interface coming up and down, although I haven't been able to reliably reproduce this...

Apr 9 144407 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
Apr 9 144407 vid openvpn[2169] SIGUSR1[soft,ping-restart] received, process restarting
Apr 9 144407 vid kernel in_atomic()0, irqs_disabled()1
Apr 9 144407 vid kernel [<c062e5f0>] wait_for_completion_timeout+0x1c/0xae
Apr 9 144407 vid kernel [<c0592c5e>] usb_start_wait_urb+0x52/0x87
Apr 9 144407 vid kernel [<c0592e6f>] usb_control_msg+0xb9/0xd4
Apr 9 144407 vid kernel [<f89e5088>] RTUSB_VendorRequest+0x107/0x17a [rt73]
Apr 9 144407 vid kernel [<f89e52c6>] RTUSBReadMACRegister+0x31/0x35 [rt73]
Apr 9 144407 vid kernel [<f89e5362>] RTUSBWriteBBPRegister+0x1f/0xa9 [rt73]
Apr 9 144407 vid kernel [<f89de002>] AsicSwitchChannel+0x30a/0x420 [rt73]
Apr 9 144407 vid kernel [<c042983a>] printk+0x1f/0x92
Apr 9 144407 vid kernel [<f89de552>] MlmePostRestartStateMachine+0x3e/0x47 [rt73]
Apr 9 144407 vid kernel [<f89e0849>] MlmeHandler+0x79/0x15a [rt73]
Apr 9 144407 vid kernel [<f89daf00>] MlmeThread+0x8f/0xe7 [rt73]
Apr 9 144407 vid kernel [<f89dae71>] MlmeThread+0x0/0xe7 [rt73]
Apr 9 144407 vid kernel [<c0405b6b>] kernel_thread_helper+0x7/0x10
Apr 9 144407 vid kernel =======================

Vern

10-04-2008 17:10:50

Hi vert,

Thanks for the debug info. It looks like the first patch did result in moving the symptom timing around a bit.Unfortunately it crashed again the same way.[/quote2m3zeyna]Tiny comment here. Crash is when the keyboard lights blink, nothing happens when you type, and you're really thankful for the magic SysRq. What we're dealing with here is a kernel warning message.

Anyway, examining the execution path, I've modified the patch - attached as atom2.patch.gz - to sleep instead of doing program delay loops for 1/10th to 1/2 second. That *may* give the kernel a more favorable opinion of our execution context.

Could you try it and see what happens?

Thanks,

vert

10-04-2008 18:50:50

The kernel warning happened again with the new patch applied.

I recompiled with debugging and am trying to get debug info now.

One other thing I noticed is this warning message on iwlist. Coudl that be a clue?

# iwlist wlan0 scan

Warning Driver for device wlan0 has been compiled with version 22
of Wireless Extension, while this program supports up to version 20.
Some things may be broken...

wlan0 No scan results

Starcrasher

11-04-2008 12:30:28

To avoid this message, you should update wireless tools to version 29.
If you don't find some packages for your distribution, the tarball is available here[/url1tx59te7].

Vern

11-04-2008 20:00:35

I recompiled with debugging and am trying to get debug info now.[/quote263454ue]Hope you haven't gotten too far into that. I think I've found the problem. Pretty obvious, too (looking back in retrospect).

Could you apply the attached atom3.patch.gz to the latest CVS and see if indeed we do have a fix?

Thanks,

vert

11-04-2008 23:53:40

I applied that patch to the latest CVS, and so far so good! Thanks!

I'll let it go a little bit more, and if it looks okay will deploy to 10 more computers with similar hardware setups.

vert

12-04-2008 20:39:39

It looks like that last patch worked! It hasn't crashed in over 12 hours. Thanks!

I'll deploy to more machines and let you know if there are any issues. Thanks for your help.

Will this fix make it into cvs soon?

Vern

13-04-2008 17:30:27

Hi vert,

Glad to see you're on the air. You should see the patch start to show up in the hourly CVS tarballs Real Soon Now. Thanks for your efforts.

BTW, your logs show there's evidently a *lot* of roaming going on. What are you doing? Deploy on another 10 machines? It all sounds like a very interesting environment. Anything you can share with us?

Thanks again,

vert

14-04-2008 16:30:51

It is an interesting environment--they are used on taxicabs roaming around New York City...so the number of access points available at any given moment is huge. So such an environment has proven be useful for stress testing in a real-world environment.

I have preliminary results from the last upgrade. Statistically, the units on which the last patch was deployed have higher uptimes than those that do not/the same units before the upgrade. So I think it's definitely better...and I haven't been able to "catch" one of them misbehaving.

But, I do still have a kernel BUG in /var/log/messages. Nothing in dmesg so far that I've found.

As an example, this is what it looked like before the last upgrade

Apr 12 114716 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
Apr 12 114749 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752
Apr 12 114805 vid kernel BUG sleeping function called from invalid context at kernel/sched.c3752

And now we have

Apr 12 130351 vid kernel kernel BUG at include/linux/timer.h153!
Apr 12 203909 vid kernel kernel BUG at include/linux/timer.h153!
Apr 13 154232 vid kernel kernel BUG at include/linux/timer.h153!
Apr 14 095730 vid kernel kernel BUG at include/linux/timer.h153!

Maybe it's a non-fatal flaw now?

Let me know if you'd like debug information on this.

Vern

14-04-2008 20:34:44

Hey vert,

Thanks for the info on your environment. Its probably the stressiest stress test the driver has gotten.

If you can work it in, can you arrange for /var/log/messages to hold debug messages, too, then run with debug enabled? That may help me get a better idea where in the processing the "kernel BUG at include ..." messages are being caused. If you can do that, no need to include /var/log/debug. Otherwise, I'll need it.

Thanks,

vert

15-04-2008 19:02:37

Here's the debug and messages files. I wasn't able to put them in the same file.

Let me know if I missed something.

Vern

18-04-2008 03:43:53

Hi vert,

The debug info was good. Thanks.

I'm not too sure I've thought about it enough, but I think the attached patch may be a fix for the "BUG at timer" message. Could you try it and see what happens?

Thanks,

vert

21-04-2008 16:06:58

I think that might have done it. It ran for several days without going down/having a BUG report.

I'll push it out to more machines and see if that fixed it!

Thanks for your help!

vert

23-04-2008 14:34:32

Indeed, that looks like it's stable. No new BUG messages, and the interface seems to stay up all the time now. I'd say it's ready for a commit.

Thanks again.

Vern

23-04-2008 18:08:17

Hi vert,

Glad to see things are working. The change has been submitted to CVS, and you should see it show up in the hourly tarballs soon.

Thanks for the problem discovery, and all the testing you've done.