Call for beta testers: FIQ_FSM USB driver rewrite

The latest rpi-update kernels now include fiq_fsm, my USB driver rewrite. This is a comprehensive rewrite to push more of the time-critical work into the FIQ in order to get needed reliable communication to problem USB devices.

The dwc_otg driver has been rewritten to include a much more fully-featured FIQ. This FIQ now handles select types of transcation by itself without any involvement from the dwc_otg host controller driver (HCD).

Two modes of operation are provided:

"NOP" FIQ which emulates the base FIQ implemented by gsh some time ago. This is an extremely fast interrupt handler designed solely to hold off SOF interrupts until USB transactions need queueing by the driver. All other driver functionality is unchanged when using this FIQ - there have been some minor tweaks that result in a small reduction in interrupt rate.

"FSM" FIQ - Performs the entirety of all split transactions without driver intervention via a state machine design.- A bitmask module parameter selects which features are enabled at boot-time.- Implements its own stateless microframe "pipelining" allowing optimal use of a full-speed frame bandwidth.- Performs an entire URB's worth (32 or 64 transactions total) of high-speed Isochronous transactions to an endpoint (webcam or DVB dongle)

Advantages of pushing more work into the FIQ:- When handling a host channel interrupt, the FIQ takes approximately 0.1x the CPU time that the HCD takes to service the interrupt and perform the next transaction. This results in an increase in CPU cycles available for everything else: the reduction typically increases available cycle count by 20% under worst-case conditions.- For SOF interrupts, the CPU time is approx 0.03x the time the driver takes. A 490ns interrupt handler is important when the interrupt generation rate is 8000 per second.- The memory footprint of the FSM FIQ is tiny compared to the HCD. This dramatically reduces both the effect and likelihood of L1 cache evictions when servicing a FIQ interrupt which should tangentially increase responsiveness.- The FIQ is unaffected by system interrupt latency. The FIQ is only ever disabled in minimal critical sections where the HCD is reading or writing FIQ state information.- There's a nice side-effect of performing transactions in lock-step in the FIQ: under heavy load, there's an interrupt-aggregation effect. This means that the total time spent in the HCD interrupt handlers levels off as workload increases, rather than increasing until the Pi grinds to a halt.- The OTG hardware has a number of bugs in it that cause scheduling problems for periodic transactions. By precisely tracking interrupt status and host channel state, these scheduling errors can be worked around or masked.- High-speed isochronous transactions on dwc_otg are especially vulnerable to interrupt latency: by giving control of a whole batch (usually 32 or 64) of transactions to the FIQ, they can be precisely scheduled and also reap the benefit of 0.1x the CPU time.

3. Known issues / bug list- There is a bug affecting interrupt processing with fiq_enable = 0. Use fiq_enable=1.- If the root port is disconnected on boot, or if the root port is disconnected while the FIQ is running, USB becomes unresponsive. This is because the core changes mode from host to device which changes interrupt register meanings.Github issues have been logged for-

dwc_otg.nak_holdoff: default 8. For split transactions to bulk endpoints, this adjustable parameter specifies the hold-off time in microframes before retrying a transaction to a full-speed Bulk endpoint. Useful for throttling interrupt frequency. Set 0 to disable. This can be used with either the NOP FIQ or FSM FIQ.

The NAK holdoff can be used to dramatically reduce the CPU interrupt frequency when polling full-speed bulk endpoints. The default value of 8 should be used in most cases. However, if the answer to all these questions is yes:

- The only full-speed devices attached with bulk endpoints are serial UART adapters or similar with documented tx/rx FIFO sizes- The data source accumulation rate is "slow" compared to USB1.1 bandwidth (slow means <0.1Mbit/s)- Latency of returned data is not an issue (latency in ms = nak_holdoff/8)

then it will be of benefit to increase this figure. Most serial port adapters use quite large FIFOs which can be left for quite some time to accumulate data before polling.

As an example for a device with 128-byte FIFOs and a slow baud rate, the maximum values listed below can be used without risking data loss:

Note: the default value of 8 may cause overflows if the input baud rate is >400,000 baud. Set this to 4 or even 2 if the source data rate exceeds 1Mbps.Control endpoints are throttled to a fixed interval of 8 microframes.

6. LimitationsThe FIQ is still dependent on the HCD to queue periodic transactions in a timely fashion. While each individual stage of a split transaction or high-speed isochronous transaction can be performed perfectly, endpoints can still get a longer service interval if there is a long interrupt hold-off time. With typical strenuous usage of subsystems known to cause interrupt latency (heavy write activity to filesystems/SD card, heavy ethernet use) it is possible for periodic transactions to be queued too late in a full-speed frame to be performed. The FIQ will automatically time-out any transaction that could not be started in the correct frame. In most cases, this results in the transaction being re-queued for the subsequent frame (in the case of interrupt transactions) but for Isochronous transports this will cause data loss.

There is an upper bound to the amount of USB1.1 bus bandwidth that can be used per TT. This restriction is born from the limitations of the hardware, which conspire to reduce the throughtput for all types of split transaction. In effect, we can only make use of:

- Approx 45% of a downstream TT's non-periodic bandwidth- Maximum 3 periodic transactions per frame per TT, inclusive of Isochronous- Maximum of 752 periodic bytes per frame for an Isochronous IN or OUT endpoint- Using large-bandwidth Isochronous transport will reduce the number of other types of transactions that can be completed in a frame.

In the vast majority of cases, with a suitable multiple-TT hub these issues will not be noticeable. The only times where contention will be an issue is with single-TT hubs.

The driver rewrite has resulted in a much more aggressive reservation of host channels for transactions performed by the FIQ. Each host channel can theoretically be recycled for another transaction after each transfer complete interrupt (and this is what the HCD does at the cost of significant CPU time) but for a FIQ-enabled transfer the host channel is reserved for the duration of the transfer. This imposes a greater constraint on the maximum number of active endpoints that it is possible to communicate simultaneously with: typical effects would be that bulk transport endpoints start to slow down in throughput as contention for host channels occurs. In extreme cases, Isochronous will start to lose out on host channel contention and thus miss frames.

The microframe scheduler is currently unaware of the increased reservation period necessary for host channels used by the FIQ. It also does not accurately track the frame bandwidth required when considering a full-speed transaction and the associated guard interval.

For what it's worth, the new FIQ_FSM code seems to be working here normally so far, but no real testing; just booted up and checked network connections on a R-Pi model A with 4-port USB hub, keyboard, mouse, and two wifi devices working at the same time. To check things I started up X and Midori which I have not done in a while (normally I use the R-Pi headless, just interacting via ssh). For whatever reason, performance seems faster than I remember.

Some observations on FIQ_FSM (mostly stream of consciousness), running basic USB and network transfer tests, monitoring with bcmstat.sh (xgcd10), with and without FIQ_FSM enabled, also without the entire FIQ_FSM patch (ie. pre-FIQ_FSM). Testing with OpenELEC (kernel 3.13.5) with XBMC disabled (add a very long sleep to autostart.sh).

1) USB write[1] (USB3 Flash storage, ext2), USB read[2] and NFS (udp,rsize=32768,wsize=32768) network read[3] performance seems to be the same (or very slightly reduced) when FIQ_FSM is enabled. Tests were repeated 5 times each, min-max values shown where there is a range of values (peak values for IRQ/s).

Given the results above, I'm not seeing any tangible benefit from FIQ_FSM - if anything performance may be slightly reduced (though statistically it's probably not significant).

That's not to say FIQ_FSM isn't a better solution, maybe it's too early to tell, it's just that USB or network transfers don't seem to show any obvious benefit. Obviously the network performance is constrained by the 100Mbit/s PHY, but CPU is similarly maxed out with/without FIQ_FSM, so again I'm not sure where or what the benefit is with USB/network storage as the "0.1x CPU time" improvements don't seem to have any practical benefit.

2) IRQ/s (as measured by bcmstat.sh) never dips below 8120/s when FIQ_FSM is disabled. With builds that do not include the FIQ_FSM patch, IRQ/s would usually run at about 750/s while the system is idle so this is a significant difference. When performing the network test with FIQ_FSM disabled, IRQ/s peaked at 16835/s! It appears that when FIQ_FSM is disabled there is now a baseline of 8100 IRQ/s, not including USB or network activity - is this attributable (and expected) with the FIQ NOP behaviour?

When FIQ_FSM is enabled, idle IRQ/s peaks at 750/s which is about the same as builds where FIQ_FSM is not present. When performing the network test with FIQ_FSM enabled, IRQ/s peaked at 9590 IRQ/s, which is again comparable with a system without FIQ_FSM present (10816 IRQ/s).

Interestingly once the system without FIQ_FSM had returned to an idle state, IRQ/s on this pre-FIQ_FSM kernel remained at 1010 IRQ/s following the network transfer, whereas the FIQ_FSM enabled kernel would return to the original 750 IRQ/s of for an idle system. This is a win for FIQ_FSM as there seems to be no way to explain the extra ~300 IRQ/s that remain active on non-FIQ_FSM kernels following a network transfer.

So there does appear to be a slight difference in IRQ/s behaviour when FIQ_FSM is added to the kernel and then left disabled, which is seen as significantly increased IRQ/s even while idle. I'm not entirely sure what effect this has on CPU performance - presumably it's costing a few more cycles to service all those extra IRQs as the CPU never went below 5.00% load with FIQ_FSM disabled, but with FIQ_FSM enabled the CPU would idle at 1.85%, and with no FIQ_FSM present in the kernel the CPU would idle at 1.7%.

Conclusion:FIQ_NOP (ie. FIQ_FSM disabled) incurs more CPU overhead than pre-FIQ_FSM and significantly more IRQ/s, but strangely appears to just shade pre-FIQ_FSM and FIQ_FSM enabled in terms of performance on USB/network transfers.

A kernel with FIQ_FSM enabled seems to perform roughly on par with pre-FIQ_FSM kernels, however FIQ_FSM does appear to demonstrate superior IRQ control following large network transfers.

The FSM FIQ should have a negligible impact on USB2.0 transfers - its job is to be transparent to high-speed bulk or interrupt traffic - but the larger code size of the FIQ handler may incur a slight overhead. Code paths through the FSM FIQ are not yet completely optimised: there's a lot of early returns that can be done in the passthrough states.

2) IRQ/s (as measured by bcmstat.sh) never dips below 8120/s when FIQ_FSM is disabled. With builds that do not include the FIQ_FSM patch, IRQ/s would usually run at about 750/s while the system is idle so this is a significant difference. When performing the network test with FIQ_FSM disabled, IRQ/s peaked at 16835/s! It appears that when FIQ_FSM is disabled there is now a baseline of 8100 IRQ/s, not including USB or network activity - is this attributable (and expected) with the FIQ NOP behaviour?

The NOP FIQ shouldn't behave like that: this is probably a bug. You should see similar interrupt numbers as the original FIQ fix.

Hi,uptime since applying the patch: 21:47 hours.Setup: async DAC NAD D 3020 connected to powered usb hub, also connected to the hub is my media hard drive with samba share. The powered usb hub is connected to the usb hub of the raspi. I play music usually via mpd, accessing the files on the media hard drive.

So far I was unable to reproduce jitter and crackling, including situations were before applying the patch I could reproduce it: playing music through mpd from an internet music stream, while starting pyload on the raspi. Mind, I got regular crackles without stressing the raspi, but if I wanted the raspi to "crackle now", thats what I had to do. But since this patch: no crackles whatsoever. I see no noticable performance loss. Just for sports I started to stream a movie to my PC though samba from the raspi media hard drive, again no crackles on the usb DAC. Me likes this patch

UsageNAS, torrent downloader and wifi routerUsing Transmission-daemon to download the torrent on the hard drive and samba to share it on the wifi.Usual issueBecause of overloading the USB with the Ethernet and the Hard Drive, I was getting a lot of these messages even when playing with the vm.min_free_kbytes. It was especially true when downloading big torrent file.

TestDonwloading a big torrent file (9Go) at +- 3.5Mb/s + copying another big file at 5.5Mb/s using samba on the wifi + doing a speedtest.net (on the raspberry directly using python) + looking a movie using samba.

No kevent drop at all, no loss of connection.ConclusionFrom my test (just a couple of hours now), my raspberry pi is now clearly more stable and seems to handle better the load on the USB ports. I know it's not as "technical test", but well, I don't have that much knowledge about how the USB is handled, just understood that my issue with smsc95xx were linked to USB load and system memory. It seems this patch is helping it.

I'll test again later this evening, but I didn't really see much difference between the different modes when i tested last time, I suspect my tests don't really benefit a great deal (in terms of performance) from these changes.

milhouse wrote:I'll test again later this evening, but I didn't really see much difference between the different modes when i tested last time, I suspect my tests don't really benefit a great deal (in terms of performance) from these changes.

Typically, USB mass storage and the LAN95xx devices (along with most ethernet-USB adapters) use USB2.0 bulk transports. There's non-trivial support for this in the DWC core: the hardware will perform a bulk transfer of length N in as many packets as it takes (and as long as it takes) without software intervention.

The remaining overhead is a function of- Length of transfer requested (directly translates into frequency of transfer complete interrupts)- How much software processing is required for the incoming/outgoing data (ex. Ethernet framing via USB: there's a vendor-specific padding applied to packets to delineate ethernet packet boundaries)- Whether the device drivers are efficient in moving data around in kernel space.

Of course, the CPU cycles for kernel/userspace are stolen by USB interrupt handling (which for the most part does nothing) - at 8000 interrupts per second, you will see a non-negligible difference in throughput if there is a significant fractional difference in the number of cycles it takes to handle a USB SOF interrupt, for example.

Each successive test results in higher peak IRQ/s during the test, and lower read rate. These are the results from 10 consecutive tests immediately after booting, showing *peak* IRQ/s and average read rate for each test:

I'm running RuneAudio 0.2 on a RevA Raspberry Pi with an Ayre QB-9 USB DAC connected (it's using asynchronous USB). With the stock build I'd always get the occasional pops and clicks so I thought I'd give this new driver a try since it seems to be getting good results from some when playing audio. As soon as I update the drive and modify boot/cmdline.txt as instructed in the first post, I get a kernel panic related to the USB driver at bootup.- if I remove the DAC and reboot the Pi then there's no kernel panic- if I remove dwc_otg.fiq_fsm_mask=0x3 from cmdline.txt and plug the DAC back in, then after a reboot there's no kernel panic and audio is streamed to the DAC but it's full of clicks and pops

Here's the result of my FIQ FSM testing. After 766 iterations, IRQ had increased from 4868 IRQ/s (straight after booting) to a peak of 7221 (+2357), and read rate had dropped from 30.8614MB/s to 27.8587MB/s (-3.0026MB/s).

Here's a graph of my results, showing read-rate declining rapidly then tailing off, while IRQ/s increases rapidly then stabilises at a high level over time:

I've now started the test again, this time with FIQ NOP, but early indications suggest that read rate is trending downwards, and IRQ/s is increasing.

I don't know if this is expected behaviour, but I would have thought that IRQ/s and USB read rate would remain fairly constant on a system that is doing nothing else (idle CPU is ~1% when not running the tests, idle IRQ/s 650).

The mass storage driver will queue a variable number of URBs using scatter-gather helpers. If memory is fragmented then it's likely that fewer URBs (possibly of shorter length, meaning higher interrupt rate) get created as the amount of contiguous memory drops.

One way to see if this is linked to the slowdown is to monitor /proc/buddyinfo. Large numbers on the left hand side point to increasing fragmentation.

As I suspected, it seems to be broadly correlated. The columns in /proc/buddyinfo correspond to free contiguous 2^n page slabs (starting at 0 in the leftmost column, not including cached or buffer pages). The leftmost two or three show a substantial level of thrash which means there's a lot of reclaim and compaction going on.

It'd be interesting if you do a echo 1 > /proc/sys/vm/drop_caches while running a test and see if the rate improves.

I'm unfamiliar with kernel coding rules - can you use a different/home grown allocator to try and avoid fragmentation? Or is the 'problem' not fixable in that way? Is it even a problem - do these increasing numbers actually cause a slowdown?

Volunteer at the Raspberry Pi Foundation, helper at Picademy September, October, November 2014.