Thursday, April 5, 2012

The initial introduction into "it's the NIC, stupid!"

In my case, an IBM/Lenovo Thinkpad T60 has been modified (not by me) to take an Atheros AR9280 NIC. Unfortunately, the NIC was proving to be very unstable when doing 802.11n throughput. The investigations did show I was doing something slightly incorrect with TX descriptors (and I've since fixed that) but the stability issues remained.

The Atheros NICs can expose some host interface error conditions via the AR_INTR_SYNC_CAUSE register. These include PCI(e) transaction timeouts, illegal chip access (eg whilst the MAC is asleep), parity errors, and other rather nice things. FreeBSD's HAL and Linux ath9k does have the register definition for what the bits do - but unfortunately we don't keep statistics.

In my particular case:

I'd see AR_INTR_SYNC_LOCAL_TIMEOUT occur. This is because a PCI(e) transaction didn't complete in time. I can tune these timeouts via a local register but that's not the point - I was seeing these errors when receiving only beacons from the access point. That's a bit silly.

I'd also see AR_INTR_SYNC_RADM_CPL_DLLP_ABORT, which is an indication that the PCIe layer isn't behaving well.

I swapped it out with another AR9280 based NIC and suddenly all the instabilities have gone away. No TX hangs, no missed TX interrupts. Everything looks great.

So as an open source developer, I want to try and put some tools into the hands of the community to be able to debug what's going on - or, if that's not possible, at least get some indication that things are going wrong. Right now the only thing people see is "I see TX timeouts, it must be the driver/chip fault." There's too much going on to be able to conclude that.

My game plan is this:

Implement statistics keeping for each of the SYNC interrupts and expose those via a diagnostic interface. Ben Grear has done something similar for Linux ath9k after a private email discussion. He's also seeing MAC sleep accesses, so it's quite likely we'll start finding/squishing these.

Take the offending laptop/NIC to the office and attach it to a very expensive and fancy looking PCIe analyser. I'm hoping we'll find something really silly occuring - like lots of sleep state transitions, or a high number of parity errors.

Try documenting this a lot better so users are able to understand what's going on when their NIC is misbehaving.