Saturday, 11 February 2017

It's a bright, sunny Saturday morning!

So how come I'm bashing my head against this embedded-system Linux issue instead of watching cartoons?

Update: Ever get the feeling the C compiler is trying to be too smart? I suspect that the difference in behavior between my workstation and the SOM may involve optimization issues with the ARM target. Like, I set a nonzero value in a global structure, launch a thread, and the thread sees the value as zero. Though... how could that be optimization? The value was set before the thread started running. But there are other strange things happening that seem to defy the structure of my code, suggesting that the code that's running doesn't match the code that I wrote.

Maybe I need to print out some addresses, to see if the two functions are even talking to the same object in memory.

Update 2: Misleading debugging messages! Turns out the confusing output resulted from running the program with a short configuration (only n/8 serial ports being enabled at the moment).

Which leaves the cause of the total system wedgie that happens after the program has been running for a couple of seconds a complete mystery. Maybe any attempt to read modem-control status causes chaos?

Update 3: Cut the program down to where it opens one (1) UART, launches a receive thread, and sits in a loop accepting user commands from stdin. This works, for a while. Then I leave it sitting idle for a few seconds, switch to another ssh session to launch top... and WEDGIE!

But, after a while, I get a dump on the console. Starts with:

rcu_sched kthread starved for 2600 jiffies

Which is apparently a thing. More exploration needed. A jiffy is a third of a microsecond, right? Or is that the wrong kind of jiffy?

[Hm. Can't seem to find that particular usage on the 'Net. I have a vague recollection that, in a certain historical context, all the bridgewires had to pop within a jiffy or the implosion would not be sufficiently symmetrical. 1/3 of a microsecond is a bit less than 3 RDX-millimeters, so I guess that's a reasonable ballpark value.]

Update 4: Built the kernel, again, with some debugging features turned on - mainly hang detection.

In an effort to eliminate my code from the picture, I simply opened one of the FTDI ports with minicom, and... well, with one instance of minicom and one of top running, thing looked OK for a fair many seconds, but then the WEDGIE! occurred again, and shortly thereafter there came a "soft lockup" message from the kernel, and some kind of dump that I guess would make sense to a kernel guru.

Anyway: definitely a problem with the kernel and/or the ftdi_sio driver and/or the use thereof with this particular TI processor and/or its interfaces. Which doesn't narrow things down very far, really, but at least I'm not trying to find the bug in my code. Though, again, if it were in my code, I'd have a much easier time tracking it down. Yeah, I have the kernel source tree. There's a lot of it, and I haven't spent a lot of time studying it.

Update 5: It happens with a standard FT232 plugged into the eval board's hub, and none of my custom hardware involved at all.

It doesn't happen when I plug in a USB Ethernet dongle and run traffic over that.

So! The USB hardware isn't to blame, nor are the drivers for everything up to the edge of the eval board.

This pretty much leaves the FTDI driver, or some abstraction layer that's invoked when the FTDI driver is used (which I guess would be usbserial). Or some arcane interaction.

Um. The 4.4 kernel for the TI processors does ship with USB serial-port support turned off by default. Maybe there's a reason for this? (That would be bad. The product is designed around this SOM.)

Update 6: It doesn't happen with the cdc_acm driver, but that doesn't use usbserial. I was having different trouble with ACM, but at least that wasn't causing system hangs.

I found a dongle with a PL2303 chip - knew I had one around here somewhere. Now to build a kernel with PL2303 support configured, and test with that. May gain some enlightenment thereby.

Update 7: No wedgie with the PL2303. No wedgie with the ACM device, though there's still a different weird issue with that driver (under some conditions, reading the device causes fragments of the data from the device to be sent back out to the device, and maybe sometimes the read is returning previous data).

Still wedgie with the FTDI. Looks like somebody had a system-hang problem on the RasPi back in '12, but it looks like that was with all manner of serial dongles. Still, it could be related, despite my target system having a different brand of processor chip. Hm. More RasPi here. I'm not (on the spur of the moment) finding an older (not high-speed) USB hub around here to test with... which might be informative, or might not, though the actual gadget under development needs to run its USB at high speed (and the FTDI quad UARTs are high-speed-only in any case).

Update 8: Now it's a warm, sunny Sunday afternoon... and I'm at my desk again. It seems that when the timer ticks and the watchdog barks, what the system was doing before the tick was this:

So, hm. Something in the host controller driver framework is hanging? The bit where it... recycles?... a USB request block?

The function being interrupted doesn't appear to have any loops in it (though there may be macros that expand to loops?), but the calling function does. So maybe usb_giveback_urb_bh is buzzing in a loop, giving back URBs until a list is empty, and the list got somehow corrupted and became infinite?

Someone proposed a patch relating to ARM, giving back URBs, and tasklet, back in 2013. Ah! That's in fact where these two layers of functions got added. So I can't expect to find the cause of this hang in the explanation for that patch.

Gronk. Best guess is that this code is running amok because some other code wrote where it shouldn't have, and messed with its data. Maybe it is something at the HCD level, and it manifests itself in different ways depending on the device and random circumstances?

Update 9: Oho! Having left the PL2303 plugged in overnight with cat /dev/ttyUSB0 and top running, with no visible problems, and having been awakened at 0300 by a playful feline, I interrupted the test, tried running a more demanding test, was reminded that the test in question (as the code currently stands) needs at last four UARTs or it won't actually do anything, restarted the test that had been running overnight... and got error messages on the console. Things like:

Huh. So the problem is in some underlying layer, and just manifests itself differently depending on the device and the circumstances. Probably it's also the cause of the inconsistent strange behavior of the cdc_acm device.

Since most of the USB protocol stack is in widespread use elsewhere, I have to suspect that the problem is in one of the layers specific to this TI processor's onboard USB host controller.

Which, hm. The product is kind of designed around this particular SOM, but for evaluation purposes we could maybe get away with shoving a RasPi or something inside the box with a mounting bracket and a kludge cable. Wouldn't meet the environmental specs - not even close - but at least it would show off the concept. That could be the short-term backup plan, if I can't soon get the USB UARTs working with this one.

Update 10: With a fresh download of the kernel source, the incantation to use 4.9.y instead of 4.4.y, and a certain amount of manual intervention because Olimex's patch script doesn't work on 4.9, the mysterious system demise goes away.

Which brings me back to the problem I set out to fix by going to the 4.x kernel and compiling from source: there seems to be some sort of resource exhaustion at work when I open more than about 9 USB serial ports (I haven't tested to see if opening built-in serial ports counts against the limit, but I'm pretty sure it's USB-related, and in something common to usbserial and cdc_acm).

Um. Also, a bunch of USB serial ports ends up putting somewhat of a load on a dinky embedded CPU. Something about all those layers of protocol. Still, I think it'll be OK with a bit of tweaking... as long as I can get past that silly resource issue.