Faster SoC and MCU communication

Hi I figured I should open a topic here first. In brief I'm looking to implement changes in the Tessel 2 firmware and Openwrt image to speed up communication between the SoC and MCU. Before getting myself too deeply into that I want to write out here what I'm looking to do and get any thoughts and advice on it.

I'm currently planning two phases of work (and one optional phase):

(optional) Replace spidusleep calls with busy waits.

Write kmod-tessel-spi that writes spi messages to and from the MCU for the SoC in place of spid and the spidev kernel module. Rewrite spid to be a convenience layer for existing Tessel apps that read and write directly to the current unix sockets.

Write a mt7620-spi kernel module based on the rt2880 currently in use and either use hardware fifo or dma to send spi messages actually asynchronously opening the cpu to do real work instead of the current busy waiting.

From best I can tell, and I'll write more on it with numbers and such soon, spid and spidev are the primary uses of time. One of the simple things to track down is spid's use of usleep. Replacing that with any form of busy wait speeds up this work. And doing so can better use the CPU as the time to change to another process and back seems to be more time than we really need to wait. This first optional step provides a small boost that could benefit users transparently.

Beyond usleep there are the ioctl calls in spid that hand off to spidev's handling that lead to a spi_async call. Like usleep that sleeps the thread until a kthread can perform the underlying work. Sometimes this does the spi work immediately but often only for the header message. The payload message often waits I'm guessing because the linux scheduler figures the spi thread just got to do work and something else deserves a chance (This is a wild assumption. I have benchmark timing that supports this, with the header giving times too small to context switch, and the payload regularly giving times too large to not be a context switch. I need to learn more about the scheduler and the one in use in openwrt/tessel to better confirm).

I believe writing a kernel module to do the work of spid and spidev is the best way to remove this regular spot for context switching to source that sends and receives the spi messages without sleeping. This also has the potential for other speed ups, removing the kernel context switching time for the other IO calls, faster IRQ response to async, possibly quicker interaction with gpio calls, extra memory copying, and removing cache thrashing due to existing sleeps.

Being a kernel module this would have greater potential for causing a tessel device to malfunction (with a kernel panic) than the userland spid program. But I think we can offset that with plenty of community testing and review before making it a normal update through t2. (Having not read or written a kernel module before doing this research I was really surprised how "normal" the code and APIs are. I was expecting a whole mountain of dragons to be inside.)

The last phase easily seems the biggest work to me. I read the issue relating to Tessel 2 boot time and agree with @kevin that the MT7620 programming guide and datasheets info on the SPI and DMA hardware is lacking. When I take a swing at this phase I don't know if I'll be able to finish figuring out this mystery. Though the goal is very tempting to reach. Seeing faster boot times and being able to work on data and communicate at roughly the same time would be great.

I'll come back with some of my rough benchmark data. I look forward to everyone's thoughts. Cheers!

It's not a complete answer but as one bit of info. Kernel development includes a function udelay that internally is a busy wait for kernel modules to use. Kernel modules still sleep and block in other ways.

I guess the main way to think about it is to prove that you can do more work on the cpu by busy waiting than by sleeping. I have a few small rust binaries I'm running on the tessel to benchmark some of these thoughts. The numbers I'm getting are rough and I'm trying to figure out if I'm not considering anything to better reflect performance on the tessel.

Two of the binaries, baseline_unixstream and process_unix_streamed, time 1000 iterations of unix streams passing buffers back and forth. The first is one process using a pair to write in one and read from the other. The second is two processes writing and reading from each other. The first process gives us a measure of time to write and read from a stream pair. The second gives us times of two write, read cycle, and the time it takes to switch between the processes to do that.

One of baseline_unixstream's run after averaging used 7463ns per write, read cycle.One of process_unix_streamed's run after averaging used 24079ns per write, read, process switch cycle.

So if my logic is right in those runs that would say this cpu takes 4576ns ((24079 - 7463 * 2) / 2) to switch to another process.

usleep implies either two process switches (I'm ignoring other cases here). Switching to another process and back we can say takes 9153ns. If instead of sleeping the process for 10us we're only wasting the other process's oppurtunity to operate for 847ns. With the IO work spid needs to do I think you could say that is an acceptable margin of error. But the spid also wins the benefit of not waiting in case that other process has a lot of work to do. Or in other words what kind of priority do we want to give spid talking to the MCU.

One thing I'm curious about is how long we need to give the MCU after writing the sync gpio. The more its less than 10us the more time that'll be available to the cpu to do actual work since we'll be saving the time on those context switches since spid wouldn't need to do that anymore.

This version only supports usb communication. It doesn't support the port channels yet (need to hook up the socket communication for them, not a lot of work just wanted to post about progress before I do that). For the usb communication it does show a ~2.2x performance boost. The repo needs a lot of commenting and clean up. spid and spidev (and a lot of other resources) were a lot of help in getting this working.

One of the reasons I wanted to post progress now, the last thing I was tinkering with in that is boosting the spi max frequency. While researching I played with boosting the max frequency and didn't encounter any issues. On review I realized I was doing all my testing over LAN and was avoiding usb to focus on the communication times for port A and B. tesseldev as of this writing boosts the max frequency but limits the USB channels frequency to 10MHz. From best I can tell when any higher it hits some issue or race condition in what I am assuming is some part of the MCU middleware. While the USB spi packets are limited to 10MHz the repo doesn't limit the header packets. And those appear so far to have no issue with that.

If anyone knows what may be blocking the USB performance that'd be really helpful. Otherwise I'm hopeful to add Port A and B communication pretty soon which will both gain for the existing work and likely from a spi frequency boost.

Are you saying that higher SPI clock frequencies work when packets contain data for port channels 1 and 2, but fail when they carry data on channel 0 for USB? I don't have a good explanation for that -- the channels are not handled any differently within the section of code with tight timing constraints.

When writing that code originally, I found it very helpful to have a logic analyzer on the SPI, sync, and IRQ pins (that's what the 0.05in pitch diagonal header is for), as well as adding code to set and clear some module port pins when entering/leaving the various interrupt handlers, so I could watch the timing.

I'll take a closer look this weekend. I've also been meaning to review your Rust proposal.

Are you saying that higher SPI clock frequencies work when packets contain data for port channels 1 and 2, but fail when they carry data on channel 0 for USB?

I haven't dug deep enough into it yet since my focus at this stage is making the port 1 and 2 channel communication faster. My current belief is that the higher SPI clock for usb packets successfully transfers the packets themselves to and from the MCU but that some other code in the MCU or in t2-cli is experiencing a race condition. Something else isn't able to send and receive the USB data that fast. So my workaround at this moment is to limit the usb packets to 10MHz. I figure the time the usb packets take at 10MHz allows whatever race condition is experienced at a faster rate to be avoided.

kevin:

When writing that code originally, I found it very helpful to have a logic analyzer on the SPI, sync, and IRQ pins (that's what the 0.05in pitch diagonal header is for), as well as adding code to set and clear some module port pins when entering/leaving the various interrupt handlers, so I could watch the timing.

I'm new to this level of hardware and need to look at getting a logic analyzer and other stuff to learn these things at that kind of level. But for this I'm not worried about the SPI stuff. That was some of the easier work to get running for the module. (In fact it actually worked the first try once I worked out non-SPI bugs like misunderstanding allocation of a struct kfifo and MODULE_LICENSE determining many of the functions symbols available to the module.) Currently and probably once the work is done, the module shares much of spid's source for tracking state and constructing the packets since the spidev interface design mirrors the kernel spi interface for transmitting messages.

kevin:

I'll take a closer look this weekend. I've also been meaning to review your Rust proposal.