USB Testing

Comments

My apologies for being absent for so long, but I finally had a good stretch of days to make some progress in P2-ES evaluation.

But, I've run into a bit of a snag that may include the smartpin USB transmitter state machine.

When I did my initial quick/dirty tests with the P2-ES board I just modified the code to configure the sysclock to use the 20MHz crystal to match the 80MHz used by the FPGA. This gave the same results I had been getting on the FPGA -- most devices enumerated cleanly, some 3.x parts did not. Then I cranked the sysclock up to 180MHz and the IN transactions broke, which I was expecting because I had been cutting some timing corners due to the 80MHz constraints. Once those were cleaned up, device enumeration was working better than ever, including those pesky 3.x thumbdrives that were being so flaky at 80MHz!

But then I ran my media read "stress" routine that scans the first FAT region and calculates the count of free cluster entries so the "DIR" output can show the volume bytes remaining -- and it broke.

On USB, the SCSI command parameters are packaged into a CBW (Command Block Wrapper) that is sent from host to device using the bulk-out transfer protocol and if the CBW is "valid and meaningful" the device ACKs the CBW and waits for the host to issue consecutive bulk-in transfer requests, and the device responds with IN data packets (up to the full-speed data packet limit of 64 bytes) until the requested sector count is met. When/if the sector count is fulfilled, the device issues the SCSI command success/fail result in a CSW (Command Status Wrapper) using a final bulk-in transfer. To exercise the out/in process, I was only requesting one sector at a time so if things went smoothly the sequence was 1 OUT and 9 IN. But when the eventual fail happened, it was the OUT transfer which, given the 1:9 ratio, was an eyebrow-raising moment.

The device was not responding to the OUT transfer because the out data was seen as being corrupt. The USB analyzer diagnosed the fail was due to a bad CRC (see image CRCFailRecover). At the time I had a transaction fail retry set to six, so I increased it 20 to see if it could recover and a pattern emerged. Sometimes the first error encountered would recover, but farther into the transfer it would fail again, and the retries increased exponentially until whatever retry cap I set was met.

The bad packet byte number was always one of #6, 7 or 8. If the device deems the packet not valid due to corruption, it may ignore that packet and wait for the host to send another, or cancel the transfer by issuing and endpoint reset. The offending byte in this packet was #7 (0x91) in all these fails (but has differed in others) and 0xBF that makes the CRC work is in the recovery packet. The analyzer always shows the CRC that was transmitted and all failed packets AND the accepted packet had the same 0x3f29 CRC, so it was definitely a bus glitch. It's looking like just four bits or fewer are getting scrambled whenever there's a fail.

My subroutine to output a byte to the USB has always been a clone of the example in the smartpin_usb_turnaround.spin2 program in the FPGA version drops:

I added a RDQPIN after the return of the output sub to check status, but the error bit (#6) was never set. In the end, that doesn't really matter, because the only way to "recover" from a bad CRC is to re-transmit the packet. But I was stuck, because once the OUT CRC fail got locked in the transfer was doomed.

I re-read the smartpin docs regarding WRPIN/WXPIN/WYPIN/RDPIN/AKPIN behavior and every thing looked good. The USB transmit looks to be somewhat unique in that the "feed me a byte" flag is raised on the D+ (upper/odd) pin, but the WRPIN is done on the D- (lower/even) pin, so an AKPIN is necessary on the upper pin. The doc mentioned that it took at least two clocks for the flag pin to drop following AKPIN, so I added a NOP, which didn't work, so I added another, which sometimes worked, so I added another and at 180MHz, there were still OUT CRC fails, but they would recover after two or three retries. So I kept adding NOPs until the OUT CRC fails vanished completely and I replaced the NOPs with a WAITX #10.

Then I started testing at higher sysclocks, bumping it up +12MHz each time. But up around 216MHz the OUT CRC errors came back, so I increase the delay and they go away again. Eventually I ended up with these results, settling with a "mostly effective compromise tweak" setting :

But I have no answer as to why this is, other than it doesn't seem to me that a clock-based state machine should behave in this manner. All I know for sure is that the delay must be between AKPIN and WRPIN, or it all falls apart.

I've got a little more cleanup to do, but hope to get some code posted sometime today...

Glad to hear you figured out a way to make it work!
I guess that's the important part.
Seems we could just bit-bang the USB with the kind of clock speeds we have now, right?
I've been afraid to go to 300 MHz... Wonder how safe that is...
Can we just divide the clock freq by some number to get a good waitx value?

I suppose "at least two clocks" is still technically accurate... Wonder if it's an amount of time, rather than fixed # of clocks...

But I have no answer as to why this is, other than it doesn't seem to me that a clock-based state machine should behave in this manner. All I know for sure is that the delay must be between AKPIN and WRPIN, or it all falls apart.

A state machine should be either correct, or not, and subsequent delays should, of course, not matter.
However, I think P2 does not quite do a proper USB clock locking/extraction, instead you WAIT on some event, which gives you a lock reference, and then run a 12Mhz NCO to try and keep phase with following data, based on that initial sample. Waits there, may be shifting the following sample points ? - could that apply to your code ?

How often do you resync to a USB edge ?

As data packets get larger, that gets more fragile.
I think you are saying, this is ok with smaller packets, but fails above some size ? (6,7,8?)
Did you test sizes all the way up to the max 64 ?
If you are going to lock on an edge, and then remain 1/4 USB time aligned, over ~64 bytes I get

64*8*1.333*4 = one part in 2729.984
or, in ppm = 366ppm

Appx time is 64*8*1.333*(1/12M) = 56.87us, so those numbers are ballpark similar to VGA jitter.
If you can hold one part in 2730, over 57us, you are likely to get a clean VGA display.
Those are 64 byte estimates, as low as 6,7,8 should be proportionally more tolerant
If you hope to resync to a frame marker only, that then needs ~21ppm timebase, but you can auto-cal to the USB frame marker.

What you could also try, is capture a counter every frame marker, and check the variation in sysclks - that will show jitter over 1ms windows.

But, maybe has to do with how the receiver is working... If the receiver is syncing it's clock to the packet itself, maybe it is forced to resync when there is a big enough space between packets...

I think this particular issue may only be related to the transmitter. The smartpin tx/rx state machines are separate, but the baud generator is common to both. The receiver has been working spectacularly at all sysclocks I've tried. So far, if something has gone awry regarding the receiver, it has been due to stupid programmer tricks

The start-of-packet byte is what the receiver syncs to and, on a larger scale, the 1ms frame counter must be 1ms +-500ns from SOF to SOF. In either of those cases, the analyzer will report them.

Regarding the 1ms frame counter, in my testing of different sysclock rates, I tried 144.5MHz to see how the smartpin baud generator would do, and the analyzer flagged every 1ms frame as having jitter. I didn't go any deeper than this one quick test because I have a ton of other issues at a higher priority.

I think this particular issue may only be related to the transmitter. The smartpin tx/rx state machines are separate, but the baud generator is common to both. The receiver has been working spectacularly at all sysclocks I've tried. So far, if something has gone awry regarding the receiver, it has been due to stupid programmer tricks

The start-of-packet byte is what the receiver syncs to and, on a larger scale, the 1ms frame counter must be 1ms +-500ns from SOF to SOF. In either of those cases, the analyzer will report them.

That's good news, at least. Receiver issues are likely to be lower level, and harder to sort.

Is this test a USB host, with flash drive slaves as test devices ?
Maybe some slaves expect TX edges to be 12MHz aligned within the frame ?
USB 3 devices likely have a fancier PLL, and more assumptions about timing that creep into their 12MHz side.

12MHz only designs may use a simpler digital-locked loop edge extractor locking, which is more tolerant of unit to unit timing ?

Yep. My test device pile is a dozen "thumb drives" of which four are 3.x parts, seven are 2.x and one is 1.1. I also have an old SDCard reader, a uSD reader and a uSD/SD reader (2.x). Capacities range from 64MB to 64MB 64GB.

Maybe some slaves expect TX edges to be 12MHz aligned within the frame ?

I don't think so. The host/hub must schedule the bus transfer(s) so they are guaranteed to complete within the 1ms frame window. At least that's the only rule I've been following and the analyzer hasn't complained, yet.

USB 3 devices likely have a fancier PLL, and more assumptions about timing that creep into their 12MHz side.

Yeah. It's a whole new world above 12MHz (micro frames and such).

12MHz only designs may use a simpler digital-locked loop edge extractor locking, which is more tolerant of unit to unit timing ?

In general, I've found 12MHz devices to be very tolerant. When I first started this project, the oldest devices I had were the first to talk to me

It's currently a RAM pig -- ~15KB code, 74KB data and buffers.
You must use PNut_v32j as USBMS.spin2 breaks the symbol limit of earlier versions. It's in the .zip file if you don't have it.

Edit: The latest version of p2asm should work also.Doh! I forgot I'm working with my own patched version of p2asm.exe that adds support for the if_00, if_01, etc. instruction prefixes. I've submitted a pull request to @Dave Hein, but it hasn't been picked up, yet.

Default sysclock is 180Mhz.
Serial default is 2MBaud.
On the P2-ES I recommend using the auxiliary "P2 USB" connector for power. If you only use power from the PC, connecting a USB device may trigger a brown-out detection reset.
Remember that the USB smartpins have their own pull-up/pull-down resisters, according to the mode that is set (device or host). I use 27ohm terminating resistors on D-/D+.
With one exception, all media accesses are read.
See the Readme.txt file in the .zip for more information.

Then I started testing at higher sysclocks, bumping it up +12MHz each time. But up around 216MHz the OUT CRC errors came back, so I increase the delay and they go away again. Eventually I ended up with these results, settling with a "mostly effective compromise tweak" setting

Is that working from a 4Mhz PFD ? (20M/5)
You could try 120MHz and 240MHz, as those allow 20MHz/1 into the PFD, which is known to be least-jitter case.
If that makes little difference, then PLL/VCO jitter might not be an issue here.

Below are the frequency settings I've been testing at. Now, before you gasp in horror, remember that I am not an EE :crazy:
I modified _FRQ_180 to use 20M/5. It does indeed look like it could be self-induced jitter.

I would sure appreciate you reviewing the below and making changes as needed to what you think might be the best values, and I will test.

And maybe you could put together a brief "PLL for dummies" that explains each values effect for us non-EEs? I know I would sure appreciate it.

Below are the frequency settings I've been testing at. Now, before you gasp in horror, remember that I am not an EE :crazy:
I modified _FRQ_180 to use 20M/5. It does indeed look like it could be self-induced jitter.

I would sure appreciate you reviewing the below and making changes as needed to what you think might be the best values, and I will test.

And maybe you could put together a brief "PLL for dummies" that explains each values effect for us non-EEs? I know I would sure appreciate it.

That's a 1MHz PFD (20M/20 = 1), and that can certainly be easily improved to be 20M/5 (more flexible) or 20M/1 (best)

The known issue with the P2 VCO/PLL, is that high values of _XDIV, are worse than lower ones.
This can also have temperature zones, where jitter worsens, then improves as P2 warms.

ie _XDIV = 1 is the best, giving the lowest jitter, and it should be used whenever possible, or when you may suspect jitter.
_XIDV = 40, is very high, and that will contribute jitter.

The PFD is the Phase Frequency Detector, which is Xi/_XDIV. That is the speed at which the PLL checks and corrects for lock.
If you wanted to measure jitter, you could take the 1ms USB frame marker, and capture a counter difference.
(that gives average time movement over a 1ms window, so is a little indirect, but it is easy to do)
In an ideal world, that would be +/- 1 SysCLK, if we assume 0.25/12MHz is tolerable, that's 2.5 sysclks at 120MHz , or 5 sysclks at 240Mhz of variation.