driver gets no Rx interrupts for 8 - 13 seconds?
I'm using the rt2500 card/driver to communicate to several embedded devices who are downloading their boot images. I've been debugging a problem where one of the clients gets a blue screen of death during the download and then all clients timeout. I noticed in the driver debug logs that when this happens, the driver doesn't process any Rx buffer interrupts for 8 to 13 *seconds*. I don't know what rev of the card/firmware I'm using, and not sure how to find out. I'm also not sure how to find out what is going on in the hardware when this occurs. I haven't been successful in locating errata for the card or firmware on RaLink's website.
Does anybody have any suggestions for determining if this is a driver issue, or an ASIC issue?
I've attached a debug log of the driver output for a test run where this occurs.
Any and all suggestions would be appreciated.
Uh, you're in the queue.
Finally got around to taking a look at your log. Thanks for your patience.
It looks like you're running in monitor mode, using 11B only phy capabilities. Can you try using infrastructure or adhoc mode - depending on your environment needs - instead of monitor mode, and leave the phy setting at 11B/G mixed? Also, turn - or leave - AdhocOfdm on.
The interrupt rates do look lackadaisical, but *may* be showing the result of flow control applied elsewhere, or of congestion.
Hi Vern -
Thanks so much for looking at my log!
I think I need to be running in monitor mode. Perhaps you can disabuse me of that notion.
I'm running a boot server that serves up boot images to several embedded devices. These devices have some support for 802.11b but with max rates at 1-2 Mb/s, and they look for beacons to find a boot server. I'm not sure how to send out my own beacon content and handle the replies for images other than to use monitor mode and inject(). Can I do that with Adhoc or infrastructure modes? I also don't want this image server associating with other 802.11 APs in my network. Also, if I leave the driver in b/g mixed, will it appropriately scale down the rates to talk to my slow little devices? I've been hard setting the rate to ensure that. And, what will be the benefit of leaving AdhocOfdm on? I don't believe my devices support Ofdm encodings.
I'll give the configuration a try and see what happens.
I'm impressed - nay, stupified - at what you've wrung out of this driver. Just to provide a little technical feedback
Setting AdhocOfdm does not require, but simply allows, 54 Mbps operation in Adhoc mode.
When operating in adhoc mode, the driver beacons if there is no other beacon source with the same SSID on the same channel.
If the adapter is left in b/g mixed mode, it should scale its rate to conform to 11b only device capabilities. If it does not, that's a bug in the driver. However, configuring a specific rate shouldn't harm anything.
Is the client BLOD something that *started* to happen after things worked OK for a while, or has it been like this "forever"?
Could you provide some more information on your use case? Specifically, what is the procedure used on the server machine to download images? That info might help me get a better idea of what's going on.
Thanks, and congratulations on your accomplishment,
Greetings Vern -
Hmmm. Glad to hear that you are impressed with how the driver has performed.
I've been trying to scale it up to booting 15 devices and almost had it working, except for the little blue screen of death issue which has happened "forever" as far as I can tell, but it took me a while to figure out if it was the driver or my server app. Once I could see it was the driver, I tried instrumenting the driver with debug and poking around the code a bit to see where things might be going wrong. When I saw that it wasn't handling or getting receive interrupts though, I don't have enough information to figure out where to go (asic? kernel? eeeek!) or how to diagnose.
The critical issue for me is not speed, but that when it stopped processing receive interrupts, all the clients that are in process fail the boot and one or two blue screen and require hard powercycle to get back to life. It looked to me that the driver hung somehow, as I would usually see transmit requests from my app fail to go out as well - the send() would return without error but nothing went over the air. I would really like to have things fail in a non-catastrophic manner so that it just degrades performance as the load increases.
So here's some more information about my use case for this. I have a small application that has the boot image to serve up. It starts up and connects to the driver and starts sending regular beacons to advertise the presence of the boot server. The embedded devices power on and look for the server's beacons with appropriate boot information. They go through the AUTH/ASSOC exchange with the server, and then exchange a few small (< 50 bytes) packets to tell the server identification information and exchange meta information about how many blocks to expect to receive the boot image in. Then the server starts serving up blocks of boot image. After the AUTH/ASSOC, most of this exchange occurs over a quasi- multicast address, which besides sending beacons is the other reason for using monitor mode. The embedded devices know enough so that if there are several requesting the same image at the same time, they can all receive boot image packets over the multicast address simultaneously and potentially with blocks out of order. Boot image packets are about 500 bytes blocks at a time. The devices will generally ACK boot packets. Once the device has the full image, it sends a final goodbye and then boots itself.
I've been taking wireshark traces when this happens, and it doesn't look like anything weird goes over the air - it just looks like everything stops. My server logs show that it stops getting anything when it polls on select(), but no error information is returned. All calls to inject return successfully, but packets just don't go over the air for a while.
Thanks so much for your assistance!
Oh, and one more thing -
once I hit a dead-end with this for a bit, I tried the latest rt2500pci driver for a while. This little driver may be pokey, but it works 4-6X better than the non-legacy driver for my application. I ended up adding some crude timing measurements to my application to measure how long it takes to select(), handle an incoming packet, and reply (basically to run the polling loop I have running), and found that this driver took significantly less time to send a beacon and process a packet and reply than the new driver/non-integrated stack. I saw differences of 700 usec for this driver vs. 4 msec for the new one...
So it's doing some things right!
Greetings Vern -
I ran a bunch of scenario tests today based on your suggestions. I'm running 2.6.27-rc6-wl-smp kernel built 1/3 from the rt2x00 git sources, though I have all of the new rt2x00 stuff blacklisted (rt2500pci, rt61pci, rt2x00lib, rt2x00pci). The rt2500 driver was freshly built this morning from yesterday's cvs tarball, and I run a bash script to insmod and set all the iwconfig/iwpriv settings (monitor, rates, etc) and start up the driver before I start my app.
1) Adhoc mode doesn't work for me. I am unable to inject any packets unless I'm in monitor mode, it looks like. And, I was unable to get either driver into Infrastructure mode - I kept getting an invalid argument error on iwconfig.
2) I am able to use 11B/G mode rather than B only. It doesn't fix the problem though. I did try not setting the rate down to 2M and letting it auto-detect the rate, but my devices never see or acquire the server.
3) AdhocOfdm was set on default in the log I sent. Explicitly setting it seems to work fine, but also doesn't fix the problem. I did notice that the iwconfig display did not show a Nickname field until AdhocOfdm was explicitly set, but once it was set, it suddenly showed up. (?)
One more bit of observation when my device whitescreens, it happens when the beacon has been found and it first tries to communicate with the server. I never see any AUTH/ASSOC packets from that device in my wireshark traces. I half think this points to preamble handling, but I'm not positive.
Is any of that useful or helpful? If not, do you have other suggestions?
>It looks like you're running in monitor mode, using 11B only phy capabilities. Can you try using infrastructure or adhoc mode - depending on >your environment needs - instead of monitor mode, and leave the phy setting at 11B/G mixed? Also, turn - or leave - AdhocOfdm on.
>The interrupt rates do look lackadaisical, but *may* be showing the result of flow control applied elsewhere, or of congestion.
Greetings again rt2500 Gurus -
I was wondering why the TX ring buffer is set to 48 entries, but the RX ring buffer only has 32?
I haven't been able to find any hw docs for the chip, but is there any reason they couldn't be set equal in size?
1) Adhoc mode doesn't work for me. I am unable to inject any packets unless I'm in monitor mode,[/quote11g0b7g8] We observe that if you're not in monitor mode, you only supply the payload, not the whole frame.
It looks like you've basically implemented a user space AP. True? If so, here's a couple of observations regarding how inherently noisy wireless links are dealt with. If you're already doing this, never mind.
First, you should be prepared to try your side of the authentication/association frames up to four times, with retries spaced 1 to 3 tenths of a second apart.
Second, you should be prepared to receive an indeterminate number of duplicate authentication/association frames from your STA(s); and be prepared to drop back to that stage in the proceedings. i.e. if you've just sent the final authenticatoin frame, be able to handle the case where you receive the preceeding one (either seq. 1 or 3, depending on authentication scheme) and respond to that - again. The driver itself keeps doing this until either things progress or a timer it has previously set goes off.
My own observations are that over the course of an hour or so, one or more of these duplications are observed.
The ring sizes - I think - are just what we got from Ralink. They work, so we leave them alone.
Hi Vern -
Thanks for the suggestions. I suppose you could see it as a userspace AP, albeit a very simplified one with no encryption, very little authentication, and a simple client-driven state machine. I see it as using the infrastructure of 802.11 to support my proprietary boot protocol without having to design and code everything from scratch.
My app does have packet resending capabilities with timeouts built into it, but it's fairly crude.
Just for fun today, I modified the number of RX ring buffer entries existed in rtmp_def.h from 32 to 48 to match the TX ring buffer size and recompiled. It's just a #define, so it was pretty quick. I ran my tests again, and couldn't get the driver to exhibit the problem (doesn't mean it doesn't exist still, only that I can't force it to happen in my current environment). So it looks to me like the ring sizes are somewhat configurable and may not be set to optimal values for different environments and can certainly give some more performance by being tweaked (basically I think the driver is rate-limiting the receive side by having too-small of RX ring buffers). So the next question I have is how big can they be?
Do you have a copy of RaLink's hardware interface documentation somewhere? I have not been able to locate one anywhere. I would assume there is a hard limit on what DMA space the card can physically address. I would also think that the maximum size would not necessarily be optimal and may require some tweaking. You have much more experience with this driver than I do. Do you have a sense of how to set the values more appropriately?
Note that the queues are consuming quite a lot of DMA memory, so increasing the size will seriously impact the amount of used memory (for little benefit). So far testing with rt2x00 has prooven that using a queue size of 12 will not affect the number of times the RX queue becomes full, on the other hand the hardware might indeed limit the number of frames it receives. Might be interesting for rt2x00 to see what happens if the queue sizes are set higher again.
I just checked, the max numer for queue entries are probably 255 (the max size which fits in an unsigned char), I doubt this is a advisable size.
This register might be interesting
* PCICSR: PCI control register.
* BIG_ENDIAN: 1: big endian, 0: little endian.
* RX_TRESHOLD: Rx threshold in dw to start pci access
* 0: 16dw (default), 1: 8dw, 2: 4dw, 3: 32dw.
* TX_TRESHOLD: Tx threshold in dw to start pci access
* 0: 0dw (default), 1: 1dw, 2: 4dw, 3: forward.
* BURST_LENTH: Pci burst length 0: 4dw (default, 1: 8dw, 2: 16dw, 3:32dw.
* ENABLE_CLK: Enable clk_run, pci clock can't going down to non-operational.
* READ_MULTIPLE: Enable memory read multiple.
* WRITE_INVALID: Enable memory write & invalid.
#define PCICSR 0x008c
#define PCICSR_BIG_ENDIAN FIELD32(0x00000001)
#define PCICSR_RX_TRESHOLD FIELD32(0x00000006)
#define PCICSR_TX_TRESHOLD FIELD32(0x00000018)
#define PCICSR_BURST_LENTH FIELD32(0x00000060)
#define PCICSR_ENABLE_CLK FIELD32(0x00000080)
#define PCICSR_READ_MULTIPLE FIELD32(0x00000100)
#define PCICSR_WRITE_INVALID FIELD32(0x00000200)
By default this register is initialized as
rt2x00pci_register_read(rt2x00dev, PCICSR, ®);
rt2x00_set_field32(®, PCICSR_BIG_ENDIAN, 0);
rt2x00_set_field32(®, PCICSR_RX_TRESHOLD, 0);
rt2x00_set_field32(®, PCICSR_TX_TRESHOLD, 3);
rt2x00_set_field32(®, PCICSR_BURST_LENTH, 1);
rt2x00_set_field32(®, PCICSR_ENABLE_CLK, 1);
rt2x00_set_field32(®, PCICSR_READ_MULTIPLE, 1);
rt2x00_set_field32(®, PCICSR_WRITE_INVALID, 1);
rt2x00pci_register_write(rt2x00dev, PCICSR, reg);
Yer havin' too much fun, here. It ain't right, I say.
I'm afraid Ralink has asked that the team keep what little hardware info we have private.
That you can avoid timeouts and BLODs by increasing the rx queue size is interesting. Basically, it looks like you can declare victory and go home.
Assuming a relatively constant arrival rate, increasing the rx queue size means that somewhat more time is available for your application to respond to an arriving packet, at which time it can pick up all that have accumulated. This may be a hint that the response by your process is being deferred due to activity by another process of equal or better priority. Unless you've already done so, maybe giving your server process the best possible "nice" value could also help things.
There is a function called "MlmePeriodicExec", in the file mlme.c, that runs once a second. Its mission is to detect and respond to changes in the runtime environment. It may be interesting to put a gauge in there to track rx queue occupancy run around the queue and see who owns what. If the host owns all the entries, we're WFO. If the adapter owns them all, we're snoozin'. Maybe track highwater mark, most recent, and a running average of - say - the last four samples? Results could be emitted to the log file.
- or maybe not,