rt73 goes bonkers during reassociation; CSR16 corrupt?

Live forum: http://rt2x00.serialmonkey.com/viewtopic.php?t=4364

Spoon

17-10-2007 18:33:08

I don't really know what's going on here, but I guess I'll post this anyways.

While running wpa_supplicant and going in and out of the AP's range for some time (lots of deassociation, reassociations), my rt73 will start going REALLY bonkers -- USB traffic is high and I'm seeing what look like corrupted frames and bogus RX descriptors come in. Data still works, but is a good 100ms slower for pings.

Then I made a registry dump (attached) and it's obvious what the problem is -- look at CSR16. This is [i2g6pef9r]not[/i2g6pef9r] a fluke thing, I've seen this happen many times now. It goes from 0x027eb162 to 0x00003000 after reassociating sometimes.

This problem also has an effect of clobbering the avg_rssi and link tuner because you get frames with an RSSI of 0 and other strange things.

Manually resetting CSR16 via debugfs makes everything sane again, without needing to even reassociate.

I'm not sure if the driver is messing something up, or if the device is resetting itself somehow.

EDIT actually I can't seem to attach anything at the moment (??)

lfcorreia

17-10-2007 19:06:38

Spoon

the attachment quota was reached again... it's now fixed, thanks for the warning

IvD

17-10-2007 21:04:28

CSR16? You mean TXRX_CSR0 (the register after CSR15) or a different register?

Spoon

17-10-2007 21:21:53

Yeah that's the one I mean )
I attached the register diff. Still don't know what to think about it. But I can get it to occur pretty easily if given 15 minutes. Could be more PPC weirdness AFAIK though.

I'm going to attempt to catch via my USB sniffer if the driver ever pushes out this bogus TXRX_CSR0 to the device. That's a painful process though.

IvD

17-10-2007 21:30:17

After a quick glance over the register it almost looks like it is shutting down the radio. Because the TSF registers are also cleared and TXRX_CSR9 also indicates the TSF sync has been disabled.

What does the dev_flags debugfs field indicate?

Spoon

17-10-2007 22:00:23

I confirmed that these bad values are going out over USB so the driver must be sending them. But on this particular run I actually got 0x00010000 instead.

What I would expect
[quote3syar095][ 92796] 205.035.214.437 / DATA0
Duration = 7840 ns
Data start = 560 ns
Length = 11 bytes

0000 C3 40 06[/color3syar095] 00 00 40 30[/color3syar095] 04
0008 00 CA CB


[ 92809] 205.035.294.708 / DATA1
Duration = 5340 ns
Data start = 560 ns
Length = 7 bytes

0000 4B 62 B1 7E 02[/color3syar095] 10 25
[/quote3syar095]

Something happened just 8ms later that caused crap to be written to this register. This was during (re)association
[quote3syar095][ 92895] 205.043.359.854 / DATA0
Duration = 7820 ns
Data start = 560 ns
Length = 11 bytes

0000 C3 40 06[/color3syar095] 00 00 40 30[/color3syar095] 04
0008 00 CA CB


[ 92908] 205.043.439.520 / DATA1
Duration = 5360 ns
Data start = 560 ns
Length = 7 bytes

0000 4B 00 00 01 00[/color3syar095] FE 4B
[/quote3syar095]

USB command[/color3syar095] (06 = multi write)
Index[/color3syar095]
Data[/color3syar095]

Since I'm running on older version (for 2.6.22 compat.), I don't see any reason for you all to spend time on this unless the problem may still exsist on the current version.
I'm guessing that maybe is_interface_present() is returning false -- that function seems to control a lot of those DROP bits as well as some TSF stuff. Probably some sort of race when wpa_supplicant is up/downing an interface or something. I really need to upgrade to the git version )

Spoon

18-10-2007 14:23:12

What I've found is that sometimes when rt73usb_toggle_rx() or rt73usb_config_duration() reads TXRX_CSR0, it reads 0x00000000. But this is wrong, I clearly can see via the USB sniffer that the 'usual' value is sent over the wire, but rt73 somehow sees 0x00000000. I'm not sure yet if it's getting lost in the rt2x00usb wrappers or if the USB layer is actually breaking somewhere.

Spoon

18-10-2007 14:55:59

Most of the time the order of operations over the USB wire when dealing with 0x3040 are
Read 0x3040
Write 0x3040

But whenever I get this problem, I see something like this
Read 0x3040
Write 0x3074 (with 0x00000000)
Write 0x3040 (with a bad value)

Can rt73usb_reset_tsf() be called while inside of rt73usb_toggle_rx() or rt73usb_config_duration()? If so, I'm thinking this could whack the CSR buffer in the driver..?

Spoon

18-10-2007 19:28:57

I believe this is the problem

ieee80211_sta_req_auth() is called
This sets IEEE80211_STA_REQ_AUTH flag
The IEEE80211_STA_REQ_AUTH is picked up by the mac80211 [i3bpwcbki]workqueue[/i3bpwcbki]
Workqueue calls ieee80211_sta_config_auth()
which calls ieee80211_sta_reset_auth()
which calls rt73_reset_tsf()

While USB will serialize all accesses to the actual CSR registers, the CSR buffer implemented in rt2x00usb will not handle this situation correctly. I'm pretty sure this is what is broken, but I admit I'm not 100% sure how workqueues are done in the kernel.

I disabled the rt73_reset_tsf() function and I don't have the problem anymore.

Thoughts?

IvD

18-10-2007 20:59:40

Doe you mean there is concurrent access to the CSR register?
This would imply we need a locking mechanism around the CSR register.. (

Spoon

18-10-2007 21:18:07

Actual hardware access to the registers is serialized due to USB. The problem is the buffer that is used by the driver. Remember the driver was changed to use scratch buffer space from kalloc (csr_cache)

What could (and apparently is) happening is that 2 usb_control_msg() calls could be outstanding at once -- one in the 'normal' rt73 code and one in the mac80211 workqueue thread. These will be sharing that same scratch buffer space so no surprise that leads to some spookyness.

I think a "simple" fix to this is to simply create a second scratch buffer that strictly rt73_reset_tsf() uses (I think this is the only rt73 function that can execute under mac80211's workqueue and touches registers but this should be double checked).
This solution would only take 4 extra bytes of space... certainly this would be perferred over locking ) I don't see any reason to over complicate things to save a few bytes.

What do you think? I will test this second buffer idea.

AdamBaker

18-10-2007 21:27:57

It does seem likely that register accesses could occur both triggered by ioctl's from user space and from the mac80211 workqueue so it seems we either need locking or at least thread local storage for the csr_cache (not so easy in kernel space). Somehow I convinced myself this couldn't happen when you first introduced the csr_cache but I can't see why now.

What would be even more complex to protect against would be if BBP or RF accesses can come from multiple threads as in those cases the register busy tests could fail if, for example 2 threads attempt a bbp write, they both read the status as not busy and then they both attempt the write operation.

This is of course where locking gets tricky as you can't use the same mutex for the register write as you use for bbp_write and rf_write unless you introduce a special version of register write for when the lock is already held. The alternative would be to have multiple mutexes but then great care is needed to avoid deadlocks.

AdamBaker

18-10-2007 21:32:04

Spoon,

At least in ad-hoc mode I believe scanning can occur from the work queue which potentially touches lots of registers, even the rf ones. I think it is also possible given a sufficiently determined user to get multiple user space threads accessing the driver simultaneously.

Spoon

19-10-2007 00:26:52

I think it is also possible given a sufficiently determined user to get multiple user space threads accessing the driver simultaneously.[/quote3nwxdnz0]Don't all ioctls from user space come through mac80211 -- shouldn't mac80211 serialize all those in its work queue?

IvD

19-10-2007 18:30:59

There are several workqueues operating, we also have the link tuner, the packet filter, and whatever mac80211 throws at us in scheduled context.
Looking at this it is almost strange that this wasn't spotted earlier because of other reports. S

Anyway about a possible solution. Having 2 caches won't work, because what would happen on quad-core systems? Even worse, what will happen during preemption? That last point alone is also a important factor here.

percpu values also won't work since the kernel API demans lvalues, which means we cannot use pointers. Creating our own version will undoubtly not be appreciated.

Another option would be removing the cache and let it be allocated on the fly in rt2x00usb_vendor_request_buff() this however will not be the fastest solution. But neither are mutexes. (

This means that we are probably going to need mutexes. And indeed we also need to lock around the BBP and RF register access. I'm not sure what the cleanest approach on doing that would be.

Perhaps looking on how other drivers handle this will help.

AdamBaker

19-10-2007 22:32:22

I haven't finished it yet but I'm writing a patch to try the locking approach. Progress so far is attached. It is written but not tested for RT73 and doesn't yet protect the BBP / RF registers for rt2500usb. There are also some compiler warnings because the lock lives in rt2x00dev and the places that need to lock used to only need a const version of that and I haven't finished tracking down all the side effects of making that non const.

Spoon

19-10-2007 23:02:54

There are several workqueues operating, we also have the link tuner, the packet filter, and whatever mac80211 throws at us in scheduled context.

...

Having 2 caches won't work, because what would happen on quad-core systems? Even worse, what will happen during preemption? That last point alone is also a important factor here.[/quote3gpgackn] I guess what I was originally thinking was something like a buffer per workqueue. In other words, functions called from the mac80211 workqueue (like rt73_reset_tsf()) always use csr_cache_mac80211, functions called from the link tuner always use csr_cache_linktuner, etc. There may need to be 3 or more csr_cache_*s, depending on how many potential workqueues there are that have the chance of touching registers.

Of course the problem with that is some functions could get called from different workqueues and how do they know to use csr_cache_mac80211 or csr_cache_linktuner. I guess this is why it would be nice to have thread-local storage.

Seeing how scans can be launched from the mac80211 workqueue, it's a more complicated problem than I originally thought -- I thought it was just the _tsf() function that created this problem and my idea looks woefully inadequate now.

IvD

20-10-2007 08:33:00

Per workqueue caches aren't a nice solution, I fear that is too sensitive to bugs. And preemption might still be a problem as well.

Adam, the patch looks good, but please call the mutex usb_cache_sem or usb_cache_mutex.
Btw I assume the
[code9pcojuzy]ERROR(rt2x00dev,"Got interrupt acquiring lock\n");[/code9pcojuzy]
is only there for debug purposes now?
When submitting the patch, could you also make sure you updated the other drivers for the non-const rt2x00dev?
I am not sure why I made it const, but it should be safe to remove it.

AdamBaker

20-10-2007 22:54:51

Per workqueue caches aren't a nice solution, I fear that is too sensitive to bugs. And preemption might still be a problem as well.[/quote3voqkgaf]

They also don't address the possibility of different sources accessing BBP registers that require multiple USB operations and would fail if they were interleaved.



Adam, the patch looks good, but please call the mutex usb_cache_sem or usb_cache_mutex.
[/quote3voqkgaf]

OK


Btw I assume the
[code3voqkgaf]ERROR(rt2x00dev,"Got interrupt acquiring lock\n");[/code3voqkgaf]
is only there for debug purposes now?
[/quote3voqkgaf]

I suspect that case could happen if someone pressed ctrl-c at the wrong time during iwconfig. If I used the uninterruptible form then if the driver fails for some reason with the lock held such an instance of iwconfig would hang irrecoverably. My current thought is that leaving a possibily inconsistent sate during an aborted iwconfig isn't too bad a solution. Another option might be to put the down_interruptible in a loop so it retries unless the radio disabled flag gets set.


When submitting the patch, could you also make sure you updated the other drivers for the non-const rt2x00dev?
I am not sure why I made it const, but it should be safe to remove it.[/quote3voqkgaf]

Yes, that was one of the reasons I considered it unfinished. You shouldn't need a reason to make something const - if the compiler doesn't object then it is probably better being const, that's useful information for optimising the caller.