Kernel Ooops with rt73 legacy driver

Live forum: http://rt2x00.serialmonkey.com/viewtopic.php?t=4851

laforge

14-06-2008 02:59:10

Good morning all,
i have a big problem with the module rt73 legacy.
I use this module for a dlink G-122 and a Linksys WUSB54GR-EU dongle.
With some traffic with the legacy driver i get an ooops with my kernel.
Sometimes the problem came with a simple dhclient wlan2 or after the start of simple download.

dubhe-thII kernel Oops 0000 [#1] PREEMPT SMP
dubhe-thII kernel Process rt73Mlme (pid 6204, ti=f52c4000 task=f5a7d840 task.ti=f52c4000)
dubhe-thII kernel Stack f8ffdb08 ef2c0000 00000292 f9001be0 ef2c2624 ef2c0000 00000000 00000000
dubhe-thII kernel 00000000 f8ffd313 f8ffd2bc c040550b ef2c0000 00000000 00000000 00000000
dubhe-thII kernel 7363616d 6f6f742a
dubhe-thII kernel Call Trace
dubhe-thII kernel [<f8ffdb08>] StateMachinePerformAction+0x19/0x1c [rt73]
dubhe-thII kernel [<f9001be0>] MlmeHandler+0xc8/0x114 [rt73]
dubhe-thII kernel [<f8ffd313>] MlmeThread+0x57/0x86 [rt73]
dubhe-thII kernel [<f8ffd2bc>] MlmeThread+0x0/0x86 [rt73]
dubhe-thII kernel [<c040550b>] kernel_thread_helper+0x7/0x10
dubhe-thII kernel =======================
dubhe-thII kernel Code Bad EIP value.
dubhe-thII kernel EIP [<00007f7f>] 0x7f7f SSESP 0068f52c5fb8
Jun 14 044622 dubhe-thII kernel BUG unable to handle kernel paging request at 00007f7f
Jun 14 044622 dubhe-thII kernel IP [<00007f7f>]
Jun 14 044622 dubhe-thII kernel *pde = 00000000
Jun 14 044622 dubhe-thII kernel Oops 0000 [#1] PREEMPT SMP
Jun 14 044622 dubhe-thII kernel Modules linked in rt73 ppp_deflate zlib_deflate ppp_async crc_ccitt ppp_generic slhc rfcomm l2cap autofs4 vmnet(P) parport_pc parport vmblock vmmon(P) cpufreq_ondemand acpi_cpufreq freq_table sha256_generic aes_i586 aes_generic cbc dm_crypt ipv6 reiserfs xfs dm_multipath loop sbp2 firewire_sbp2 snd_hda_intel ohci1394 ieee1394 snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device uvcvideo rfkill snd_pcm_oss snd_mixer_oss rtc_cmos arc4 snd_pcm compat_ioctl32 sdhci ecb rtc_core firewire_ohci videodev i2c_i801 iTCO_wdt crypto_blkcipher mmc_core pcspkr serio_raw input_polldev firewire_core rtc_lib iTCO_vendor_support ricoh_mmc crc_itu_t snd_timer psmouse hci_usb snd_page_alloc v4l1_compat iwl3945 snd_hwdep bluetooth snd mac80211 usb_storage soundcore usblp cfg80211 video tg3 nvidia(P) output battery button ac i2c_core sr_mod cdrom joydev sg acer_wmi wmi ata_piix dm_snapshot dm_zero dm_mirror dm_mod ahci libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded rt2x00lib]
Jun 14 044622 dubhe-thII kernel
Jun 14 044622 dubhe-thII kernel Pid 6204, comm rt73Mlme Tainted P (2.6.25.4-UFP-DEVEL-ALPHA-1.0 #1)
Jun 14 044622 dubhe-thII kernel EIP 0060[<00007f7f>] EFLAGS 00010202 CPU 1
Jun 14 044622 dubhe-thII kernel EIP is at 0x7f7f
Jun 14 044622 dubhe-thII kernel EAX ef2c0000 EBX 00000029 ECX ef2c2624 EDX ef2c2624
Jun 14 044622 dubhe-thII dhclient DHCPDISCOVER on wlan2 to 255.255.255.255 port 67 interval 11
Jun 14 044622 dubhe-thII kernel ESI ef2c050c EDI ef2c05b0 EBP ef2c05a4 ESP f52c5fb8
Jun 14 044622 dubhe-thII kernel DS 007b ES 007b FS 00d8 GS 0000 SS 0068
Jun 14 044622 dubhe-thII kernel Process rt73Mlme (pid 6204, ti=f52c4000 task=f5a7d840 task.ti=f52c4000)
Jun 14 044622 dubhe-thII kernel Stack f8ffdb08 ef2c0000 00000292 f9001be0 ef2c2624 ef2c0000 00000000 00000000
Jun 14 044622 dubhe-thII kernel 00000000 f8ffd313 f8ffd2bc c040550b ef2c0000 00000000 00000000 00000000
Jun 14 044622 dubhe-thII kernel 7363616d 6f6f742a
Jun 14 044622 dubhe-thII kernel Call Trace
Jun 14 044622 dubhe-thII kernel [<f8ffdb08>] StateMachinePerformAction+0x19/0x1c [rt73]
Jun 14 044622 dubhe-thII kernel [<f9001be0>] MlmeHandler+0xc8/0x114 [rt73]
Jun 14 044622 dubhe-thII kernel [<f8ffd313>] MlmeThread+0x57/0x86 [rt73]
Jun 14 044622 dubhe-thII kernel [<f8ffd2bc>] MlmeThread+0x0/0x86 [rt73]
Jun 14 044622 dubhe-thII kernel [<c040550b>] kernel_thread_helper+0x7/0x10
Jun 14 044622 dubhe-thII kernel =======================
Jun 14 044622 dubhe-thII kernel Code Bad EIP value.
Jun 14 044622 dubhe-thII kernel EIP [<00007f7f>] 0x7f7f SSESP 0068f52c5fb8
Jun 14 044622 dubhe-thII kernel ---[ end trace dbef1ec95bdd0da5 ]---



My configuration is this
Fedora 8 up to date

Kernel vanilla 2.6.25.4

Hostdubhe-thII.mobile.int.ipv7.org - 0436-06/14/08 - Path~
[root21]$modinfo rt73
filename /lib/modules/2.6.25.4-UFP-DEVEL-ALPHA-1.0/extra/rt73.ko
license GPL
description Ralink RT73 802.11abg WLAN Driver 1.0.3.6 CVS 2008061321
author http//rt2x00.serialmonkey.com


Hostdubhe-thII.mobile.int.ipv7.org - 0436-06/14/08 - Path~
[root27]$lsusb -d 13b10023 -v

Bus 002 Device 014 ID 13b10023 Linksys WUSB54GR
Device Descriptor
bLength 18
bDescriptorType 1
bcdUSB 2.00
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 64
idVendor 0x13b1 Linksys
idProduct 0x0023 WUSB54GR
bcdDevice 0.01
iManufacturer 1 Cisco-Linksys
iProduct 2 Wireless-G USB Network Adapter
iSerial 0
bNumConfigurations 1
Configuration Descriptor
bLength 9
bDescriptorType 2
wTotalLength 32
bNumInterfaces 1
bConfigurationValue 1
iConfiguration 0
bmAttributes 0x80
MaxPower 300mA
Interface Descriptor
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 2
bInterfaceClass 255 Vendor Specific Class
bInterfaceSubClass 255 Vendor Specific Subclass
bInterfaceProtocol 255 Vendor Specific Protocol
iInterface 0
Endpoint Descriptor
bLength 7
bDescriptorType 5
bEndpointAddress 0x81 EP 1 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Endpoint Descriptor
bLength 7
bDescriptorType 5
bEndpointAddress 0x01 EP 1 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0200 1x 512 bytes
bInterval 0
Device Qualifier (for other device speed)
bLength 10
bDescriptorType 6
bcdUSB 2.00
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 64
bNumConfigurations 1

Hostdubhe-thII.mobile.int.ipv7.org - 0436-06/14/08 - Path~
[root28]$uname -a
Linux dubhe-thII.mobile.int.ipv7.org 2.6.25.4-UFP-DEVEL-ALPHA-1.0 #1 SMP PREEMPT Mon May 19 125254 CEST 2008 i686 i686 i386 GNU/Linux

Vern

14-06-2008 17:45:05

Hi laforge,

Could you build the driver with debug enabled and run with debug=15 until the oops? Then please grep out the content of /var/log/kern.log for the interval covering modprobe up through the oops, gzip it, and attach it to a posting in this thread. I'll take a look at it. A gzipped copy of /var/log/debug would also help.

Thanks,

laforge

14-06-2008 22:38:06

Hi laforge,

Could you build the driver with debug enabled and run with debug=15 until the oops? Then please grep out the content of /var/log/kern.log for the interval covering modprobe up through the oops, gzip it, and attach it to a posting in this thread. I'll take a look at it. A gzipped copy of /var/log/debug would also help.

Thanks,[/quotes0xyxf4p]

Ok in attachment you can find the expert of the messages related the module rt73, and the log file with the debug of the module.
Additional information about the situation, the AP have no encryption (wlan0 is a wifi card in the notebook, intel 3945ABG)
Hostdubhe-thII.mobile.int.ipv7.org - 0017-06/15/08 - Path/var/log
[root5]$iwlist wlan0 scan
wlan0 Scan completed
Cell 01 - Address 0004ED6ADD1E
ESSID"A02-RA141-W54"
ModeMaster
Channel6
Frequency2.437 GHz (Channel 6)
Quality=54/100 Signal level=-76 dBm Noise level=-127 dBm
Encryption keyoff
Bit Rates1 Mb/s; 2 Mb/s; 5.5 Mb/s; 11 Mb/s; 6 Mb/s
9 Mb/s; 12 Mb/s; 18 Mb/s; 24 Mb/s; 36 Mb/s
48 Mb/s; 54 Mb/s
Extratsf=0000000154eb90dc

I think that this is the the problem, when this messages is log, the default gateway become unreachable and no traffic from wlan1.
After sometime boot, kernel ooops.

Jun 15 001126 dubhe-thII kernel rt73 ==> MlmePeriodicExec
Jun 15 001126 dubhe-thII kernel rt73 - MlmePeriodicExec call STAExec
Jun 15 001126 dubhe-thII kernel rt73 MMCHK - PortCfg.Ssid=A02-RA141-W54 ... MlmeAux.Ssid=A02-RA141-W54
Jun 15 001126 dubhe-thII kernel rt73 === RTUSB_VendorRequest Non-recoverable err = -71
Jun 15 001126 dubhe-thII kernel rt73 device disconnected
Jun 15 001126 dubhe-thII kernel rt73 device disconnected
Jun 15 001126 dubhe-thII kernel rt73 device disconnected
Jun 15 001126 dubhe-thII kernel rt73 device disconnected
Jun 15 001126 dubhe-thII kernel rt73 device disconnected
Jun 15 001126 dubhe-thII kernel rt73 3 strong RSSI=-58, CCA=63765, fixed R17 at 0x40
Jun 15 001126 dubhe-thII kernel rt73 MMCHK - CQI= 34 (Tx Fail=49217/Retry=125544/Total=176326, Rx Fail=1565/Total=1824, RSSI=-58 dbm)
Jun 15 001126 dubhe-thII kernel rt73 device disconnected
Jun 15 001126 dubhe-thII kernel rt73 48 NDIS push BE=123, BK=0, VI=0, VO=0, TX/RX AGGR=<0,0>, p-NDIS=0, RSSI=-58, ACKbmap=15f, PER=99%
Jun 15 001126 dubhe-thII kernel rt73 DRS --TX rate from 48 to 36 Mbps
Jun 15 001126 dubhe-thII kernel rt73 AsicAdjustTxPower = 17, AvgRssi = -58
Jun 15 001126 dubhe-thII kernel rt73 device disconnected
Jun 15 001126 dubhe-thII kernel rt73 Retry count exhausted or device removed!!!
Jun 15 001126 dubhe-thII kernel rt73 STAMlmePeriodicExec - LastRssi=63, BbpRssiToDbmDelta=121
Jun 15 001126 dubhe-thII kernel rt73 <== MlmePeriodicExec

The message tell that the device is disconnected, why?

I have a log file from the plug of the dongle to another kernel oops with debug level set to 31, is it useful to you? It is a 3MB compressed file with 7zip.

Vern

15-06-2008 18:00:29

Hi laforge,

Thanks for the ole' college try. Unfortunately, your copy of /var/log/messages is missing the stuff that follows the Oops line. I had actually asked for /var/log/kern.log. Is your /etc/syslog.conf set up OK? There is some info about this in the TESTING file.

Thanks,

laforge

15-06-2008 22:06:44

Hi laforge,

Thanks for the ole' college try. Unfortunately, your copy of /var/log/messages is missing the stuff that follows the Oops line. I had actually asked for /var/log/kern.log. Is your /etc/syslog.conf set up OK? There is some info about this in the TESTING file.

Thanks,[/quote2tlmsjca]


Strange...i use syslog-ng and i put a simple match for rt73 messages...
Now i change the configuration, is this
filter f_debug { facility(kern) and level(debug); }
destination d_rt73 { file("/var/log/rt73-debug.log" owner(laforge) perm(0400) );};
log { source(s_sys); filter(f_messages); destination(d_messages); };

I replicate the ooops (it's simple...) and attach the log file here. Messages and the debug file.
Here there are all informations in the log from the insmod to kernel ooops an the usb remove.

During some test to generate the ooops i verified some things

- The start of the problem is a wifi network lost, and after this (if there is the Network Manager) the ooops will come.
- Without any type of network or wifi manager, after the wifi network lost, a simple iwlist wlan1 scan generate the ooops.
- Without the vmware layer is more difficult to get the ooops after the network lost.
- After an ooops the system remain up but stable (i think...). Is it possible to remove the driver module, the usb dongle, and re-insert it to restart the wifi network connectivity.
- With a low rate (i see near 200KB/s during a transfert) is more difficult to reach the situation that generate a network lost and then the kernel ooops



PS The file rt73-debug.log.gz is not a gzip compressed file but a bzip2. The blog not allow bz2 extension....I used the bzip2 because the debug file is very large .

Vern

15-06-2008 23:56:46

Hi Geordie,

Thanks for your efforts.
[quote1a0opo88]Strange...i use syslog-ng and i put a simple match for rt73 messages...[/quote1a0opo88]Its best to grep on the timestamp - e.g. [code1a0opo88]egrep 'Jun 15 2[0-3]' /var/log/messages|gzip >messages.gz[/code1a0opo88]or whatever the modprobe to Oops interval actually occupies. The example extracts everything in the four minutes of 20 o'clock to 23 o'clock.
We also still have other problems. Your messages file is substantially hosed e.g.[code1a0opo88]Jun 15 23:00:53 dubhe-thII kernel: rt73: <---RTUSBRxPacket
Jun 15 21:00:55 dubhe-thII kernel: unparseable log message: "<7---RTUSBKickBulkOut"[/code1a0opo88]Time is supposed to go forward. I don't know from syslog-ng, so you're pretty much on your own there. The result we need is that all debug messages be routed to the same file that the Oops message goes to. Also, we're missing part of the Oops info that provides the code in the neighborhood of the failing IP.

If you don't have /var/log/kern.log, then /var/log messages will do. What I'm trying to do is establish context; the trace info that is available simply shows the dispatcher (MlmeThread &c), which is going to run whenever there's anything to do - not helpful.

Also, if you're running a VM, its probably best to not use it while working this problem.

... but hey, no pressure ...

Thanks,

laforge

16-06-2008 00:23:41

Hi Geordie,

Thanks for your efforts.
Its best to grep on the timestamp - e.g. [code2aje0ywl]egrep 'Jun 15 2[0-3]' /var/log/messages|gzip >messages.gz[/code2aje0ywl]or whatever the modprobe to Oops interval actually occupies. The example extracts everything in the four minutes of 20 o'clock to 23 o'clock.[/quote2aje0ywl]

I open the messages log file and i copy all the line from the manual insmod (for debug option) and the last line after the ooops.
No other line was removed.


We also still have other problems. Your messages file is substantially hosed e.g.
[code2aje0ywl]Jun 15 23:00:53 dubhe-thII kernel: rt73: <---RTUSBRxPacket
Jun 15 21:00:55 dubhe-thII kernel: unparseable log message: "<7---RTUSBKickBulkOut"[/code2aje0ywl]

Time is supposed to go forward. I don't know from syslog-ng, so you're pretty much on your own there. The result we need is that all debug messages be routed to the same file that the Oops message goes to.
Also, we're missing part of the Oops info that provides the code in the neighborhood of the failing IP.[/quote2aje0ywl]

The messages don't contain debug information, you can find all kernel-debug in the file rt73-debug.
I have only this line about the kernel ooops in the messages.


If you don't have /var/log/kern.log, then /var/log messages will do. What I'm trying to do is establish context; the trace info that is available simply shows the dispatcher (MlmeThread &c), which is going to run whenever there's anything to do - not helpful.
[/quote2aje0ywl]

You need all the kernel messages in one file? I can change the configuration of syslog-ng and put all message with facility kern in the rt73-debug...let me try this configuration to see if there are some lost messages.


Also, if you're running a VM, its probably best to not use it while working this problem.
[/quote2aje0ywl]

No, is the VmWare Workstation layer (vmmon, vmnet,vmblock) that i use to start some virtual os.
The problem isn't on a guest linux but on the main host.
I see only if i put down the vmware service (module, dhcp server) to get the oops is required a lot of time.


... but hey, no pressure ...
Thanks,[/quote2aje0ywl]

If i can give you the best information, so you can give us a better driver; for me is a pleasure. ;)

laforge

22-06-2008 14:31:04

Hello,
now i put all kernel messages and all debug message in one file.
In attach you can find the log from the insmod to the rmmod.
I attach the messages file during the ooops.
At the 160801 you can find the oops.
Jun 22 160801 dubhe-thII kernel BUG unable to handle kernel paging request at 00009898

On screen

dubhe-thII kernel Oops 0000 [#1] PREEMPT SMP
dubhe-thII kernel Process rt73Mlme (pid 398, ti=f5010000 task=f5ce1230 task.ti=f5010000)
dubhe-thII kernel Stack f908d004 e69e0000 00000292 f909214c e69e5ee8 e69e0000 00000000 00000000
dubhe-thII kernel 00000000 f908c871 f908c800 c0405513 e69e0000 00000000 00000000 00000000
dubhe-thII kernel 00000000 00000000
dubhe-thII kernel Call Trace
dubhe-thII kernel [<f908d004>] StateMachinePerformAction+0x19/0x1c [rt73]
dubhe-thII kernel [<f909214c>] MlmeHandler+0xbb/0x144 [rt73]
dubhe-thII kernel [<f908c871>] MlmeThread+0x71/0xb4 [rt73]
dubhe-thII kernel [<f908c800>] MlmeThread+0x0/0xb4 [rt73]
dubhe-thII kernel [<c0405513>] kernel_thread_helper+0x7/0x10
dubhe-thII kernel =======================
dubhe-thII kernel Code Bad EIP value.
dubhe-thII kernel EIP [<00009898>] 0x9898 SSESP 0068f5011fb8

Vern

22-06-2008 18:52:12

Hi laforge,

Thanks for the log info. I'll see if I can make anything of it.

The basic problem seems to be that the driver thinks the USB adapter is unplugged. That happens, but it is a bug to cause an Oops because of it.

In the meantime, posts in rt73usb - test results and questions[/urlm31obg57] indicate cable length may be a problem. If you are using a fairly long cable, what happens if you use a shorter one?

Thanks,

laforge

22-06-2008 22:53:41

Hi laforge,

Thanks for the log info. I'll see if I can make anything of it.

The basic problem seems to be that the driver thinks the USB adapter is unplugged. That happens, but it is a bug to cause an Oops because of it.
[/quotehuxtwghs]

Ok, but the usb layer in the linux kernel not think this.
No message about the loosing of the usb device.


In the meantime, posts in rt73usb - test results and questions[/urlhuxtwghs] indicate cable length may be a problem. If you are using a fairly long cable, what happens if you use a shorter one?

Thanks,[/quotehuxtwghs]

Ok, i start some test with a short usb cable.

laforge

24-06-2008 11:57:27

Ok, i think that your idea about the cable is right.
I use a 3mt cable with the dongle and the dlink small cable with the base. With this configuration i have the disconection, but
if i use only the 3mt cable, no problem, no "device disconection"...
After this "cut" i can transfer for an hour up to 400KB without any problem.

The question is, the standard usb 2.0 define a max length of a cable up to 5 mt, i use one cable of 3mt and the extension
with the base (half meter), and there is a problem with the dongle, why? Poor quality dongle?

Vern

24-06-2008 18:02:13

Hi Geordie,

Glad to see the workaround works. As to why you have the problem I think it's basically insufficient power; but whether that's poor design in the dongle, or whether the host PC is marginal, I don't know.

In the meantime, when you use the long cable, your system does uncover a bug in the driver's exception handling. When I develop a fix, could you test it here if I upload a patch to a posting on this thread? It won't keep the driver from thinking the device is disconnected, because that's a hardware problem, but it will keep an Oops from happening. That will save everyone who may have the same problem a lot of grief.

Thanks,

laforge

24-06-2008 18:40:13

Hi Geordie,

Glad to see the workaround works. As to why you have the problem I think it's basically insufficient power; but whether that's poor design in the dongle, or whether the host PC is marginal, I don't know.
[/quote3mp2zo8j]

I think that usb standard have a lot of "words" without a real and valid implementation.... ;)


In the meantime, when you use the long cable, your system does uncover a bug in the driver's exception handling. When I develop a fix, could you test it here if I upload a patch to a posting on this thread? It won't keep the driver from thinking the device is disconnected, because that's a hardware problem, but it will keep an Oops from happening. That will save everyone who may have the same problem a lot of grief.
Thanks,[/quote3mp2zo8j]

I can test your patch when you want.
Now i have a RT73 802.11abg WLAN Driver 1.0.3.6 CVS 2008062209, need you the latest cvs source ?

Starcrasher

25-06-2008 23:53:08

Hi,

I remember a time where VIA chipsets (and probably others) did not deliver enough power to mainboard USB ports and things plugged were sometimes working, sometimes not depending how much power they need.
At this moment, a lot of people bought an additional card to manage USB and solve the problem.
Geordi, I don't think this can solve your problem because it already seems to work with a quite long cable. But, if you already have such card somewhere you can give it a try.

laforge

26-06-2008 01:59:35

Hi,

I remember a time where VIA chipsets (and probably others) did not deliver enough power to mainboard USB ports and things plugged were sometimes working, sometimes not depending how much power they need.
At this moment, a lot of people bought an additional card to manage USB and solve the problem.
Geordi, I don't think this can solve your problem because it already seems to work with a quite long cable. But, if you already have such card somewhere you can give it a try.[/quote2hekozsd]

I have an Acer notebook, the problem there is when is attached directly on the notebook or by an usb powered hub.
I define the problem, if the cable is longer up to 3 mt the dongle works very well, with another extension, within the 5 mt, the dongle have the problem.

Vern

02-07-2008 03:40:24

Lt.Cmdr. Geordi LaForge

Could you hook up your long cable and try the attached patch? It won't keep the driver from thinking the device is disconnected, but it *should* keep the driver from causing a kernel Oops when that happens.

If it turns out to actually work, please let me know.

If you still do get an Oops, please compile and run with debug enabled, then attach gzipped copies of /var/log/debug and /var/log/kern.log to a posting here,

Thanks, and sorry for the delay,

Vern

20-07-2008 16:49:09

Looks like Lt. Cmdr. LaForge has buggered out on me.

This thing has been up for three weeks, and downloaded once. Accordingly, I'm treating it just like it was working and have put it into CVS. Ya'll had yer chance.

But if you do get an Oops on error -71, post here with a gzipped copy of /var/log/kern.log attached - if possible.

Thanks,