The USB mass storage performance issue

I’ve had a look at this with FIQprof </plug> the past couple of days, and I think I’ve found what the problem is (or one problem, at least). SCSISoftUSB uses the 100Hz TickerV to initiate all its USB transfers, but, for my test case of saving 64MB sprite to an SD card, each individual transfer was completing in around 3.3-3.5ms – meaning there’s over 6ms wasted time inbetween the transfer completion and glue_Tick() realising that the DeviceFS buffer is empty. So the obvious solution to me (at least for improving the speed of large transfers) would be to find some way of triggering a callback on completion of the transfer, so that SCSISoftUSB can wake up and tell DeviceFS to refill the buffer.

Although I’m now fairly familiar with the innards of the NetBSD driver, I know practically nothing about how it’s presented to RISC OS via DeviceFS/BufferManager. I can see in the USB driver source that it’s possible for a C/APCS function to be called as a callback inside usb_transfer_complete(), but it doesn’t look like that callback interface is exposed via the DeviceFS API (or maybe DeviceFS relies on the callback itself?)

By looking at my profile results I can see that a “buffer emptying” upcall is triggered by the buffermanager when the USB driver receives the “transfer compelete” interrupt – would this be an appropriate upcall for SCSISoftUSB to listen out for? SCSISoftUSB already knows the buffer ID associated with the DeviceFS pipe, and it already uses upcalls to trigger glue_ReopenStream(), so it might be OK for it to listen out for buffer emptying/filling upcalls as well. It could then find some safe way of entering glue_Tick() in order to refill the buffer (just use OS_AddCallback, so it gets entered once the USB driver finishes servicing the interrupt?)

Or is there some other callback method that I’m missing? I can see in glue_tick() that there are some comments about using Event_OutputEmpty, which looks like it’s a more ‘official’ way of detecting the buffer being empty. But it also mentions UpCall_DeviceThresAbove/Below, which I haven’t seen triggered (not that I can say for certain that they’re not triggered – or is there a SWI that needs calling first in order to set the threshold?)

And also there’s the comment in glue_Tick warning about reentrancy problems if events & upcalls are used – but if I just make sure interrupts are off before calling glue_Tick then everything should be OK due to the semaphore check at the start, right?

After staring at the code a bit I came to the conclusion that the buffer emptying & buffer filling upcalls are there for the user to make use of (after all, why would the USB driver care that a buffer it’s reading from is becoming empty? the driver knows how much data is available when it starts to read from it). I’ve also confirmed that the BSD USB callback facility is used internally by the USB driver, for the purpose of notifying the buffer manager that the data has been added/removed.

So I’ve now modified SCSISoftUSB to listen out for buffer filling/emptying upcalls and check if it’s a buffer it owns. If it is, it registers glue_Tick with OS_AddCallback (calling glue_Tick directly seemed to only result in bad things). However for the callback to actually be triggered I’ve also had to add code to SCSISwitch to trigger callbacks in the loop that waits for background transfers to complete. I’ve used the same code that the USB drivers use; i.e. the rather ugly method of looking at IRQsema in the kernel’s workspace and only triggering callbacks if we’re not in an interrupt handler (This should really be added as a SWI!). Failing to check IRQsema only resulted in bad things happening. But with the callback in place, USB performance for bulk transfers improved significantly – the 64MB sprite now saves in 9 seconds instead of 23.

I’ll do some more tests tomorrow, and see if I have a faster device to test with (the SD card doesn’t have any class rating printed on it, so I’m not sure what max speed it’s capable of). I’ll also do some checks to make sure we’re double-buffering the data properly (at the moment the profile graph looked like about half the time was spent filling the buffer, and half the time was spent waiting for the next USB completion interrupt – although that may just be because I’ve gone overkill and my 800kHz profiling timer is eating all the CPU time). Some tweaking of buffer sizes, and a longer term goal of using DMA to copy the data into the buffer, may also be appropriate.

And I still need to take a look and see if there’s anything wrong with transfers of small files, and to take a look at what the hell’s going on with mounting filecore-formatted devices (An 8GB SDHC card literally takes 2 minutes to mount if it’s filecore formatted, but FAT formatted devices seem to take the same amount of time no matter what size they are)

Assuming these driver modifications don’t suddenly fail, and no-one can come up with a reason why SCSISwitch shouldn’t be allowed to trigger callbacks, I’ll check them into CVS in a few days time once I’ve had more time to test them.

I’m a little hesitant about the idea of triggering callbacks in the middle of filing system operations. As I recall, the USB DeviceFS files need you to use a filing system special field to enable that behaviour (which is disabled by default) because various engineers at Pace had serious doubts about it being added.

I think possibly a better approach would be to use the callback facility provided by the RTSupport module instead. These are specifically designed for serialising time-sensitive low-level routines – and although what finally prompted me to write it was audiovisual decoding, it’s equally applicable to things like data transfer routines. It’s similar in principle to WSS’s CBAI module, though I was able to address some of the CBAI module’s shortcomings by tweaking the kernel interrupt dispatcher to accommodate it. No doubt if I’d written the RTSupport module earlier, I might have used it for SCSISoftUSB in the first place…

I’ve now rewritten SCSISoftUSB’s glue_Tick code to use RTSupport instead of TickerV. The code seems to work OK, giving the same performance as my TickerV+callback code, but it looks like there’s a bug in RTSupport that I was hoping you’d be able to provide some insight on.

The RTSupport implementation follows the same basic rules as the TickerV one, i.e. it deregisters glue_Tick when there’s no work to be performed. But after registering & deregistering with RTSupport 127 times, the next registration attempt will result in a “Too many RTSupport routines” error, because RT_Register can’t find any free space in its dynamic area for the SVC stack. Initially I thought it was just a logic bug with the stack pointer wrapping around (since 127 is the same number as the #define ROUTINES in the RTSupport source), but it looks like it’s actually caused by OS_Memory failing to return an error for the pages that had been unmapped from the dynamic area when the previous instances of glue_Tick were deregistered. Do you know if this is a bug in the kernel or a bug in your code? (Before I spend the next few hours trying to decipher the OS_DynamicArea/ChangeDynamicArea/Memory source code!)

I’ll first check that you’ve realised that a RTSupport callback routine doesn’t automatically get deregistered once it has returned. In this respect, it’s more like a non-transient callback handler (OS_SetCallBack) than a transient callback handler (OS_AddCallBack). This saves on heap management calls (data blocks relating to each transient callback are allocated and freed in the system heap for each call to OS_AddCallBack) which you generally don’t want in time-critical code. But it does have the implication that we have to set an arbitrary limit on the number of RTSupport routines that we preallocate. In this case, I’d anticipate one RTSupport routine per USB pipe, so I’d be surprised if you were really getting anywhere near 127 of them.

Of course, RTSupport is relatively new as RISC OS modules go, and has received relatively little use outside of the STB design it was originally written for, which wouldn’t have fully exercised it. As a result, it would be the first place I’d go looking for bugs.

Yeah, I’m definitely using RTSupport properly and deregistering the callback before attempting to re-register it. If you’re not sure off the top of your head whether there could be a problem then I guess I’ll just have a dig around and see what I can find.

For the past few days I’ve been using a modified version of RTSupport where I’ve rewritten the SVC stack allocation in order to avoid the above bug. I think I’ve eliminated all the unnecessary stalls in SCSISoftUSB – performance for large and small file operations has now improved considerably. To give you all some juicy stats:

Saving/loading a 66MB sprite now takes 9 seconds instead of 23 (Yes – the sprite is 66MB, not 64MB – I forgot about the space used by the mask channel in my earlier post!)

Using the WIMP filer to copy the sprite from one folder to another, on the same device, with a 4MB ‘next’ slot and ‘Faster’ enabled – now takes 20 seconds instead of 225 seconds (With the current driver, the delays with starting transfers really hurt the performance of the small transfers the filer does for interactive file copies – and I think it confuses the transfer rate adjustment algorithm a bit too)

And finally, for fans of small files – a single-tasking BASIC program which does nothing but create 1000 small files now runs in 63 seconds instead of 1151 seconds! For reference, the same program ran in around 53 seconds when writing to my Iyonix’s hard disc, so I’d say that the performance of USB is now pretty damn good.

Also, another fun fact – I tested the performance of the SD card I’m using on my PC, by writing a 1GB file to it. It finished in 190 seconds, giving a 5.38MB/s write speed – which, if you believe my numbers, is slower than the 7.3MB/s RISC OS was able to achieve when saving the 66MB sprite.

Unfortunately there are still a few bugs I need to work out before I can release the updated driver. At the moment it looks like it’s just some issues with my MUSBDriver module (and possibly the EHCI hardware lockup/failure bug on the BeagleBoard), but some more testing will be needed to make sure.

Random question: does this affect the time it takes to mount a device? ISTR that you don’t want to make your partition too big when formatting with SCSIForm because it takes ages for RISC OS to mount. I can’t actually remember the finer details of what that issue was, or even if it was related to SCSI or USB so forgive me if it’s a nonsense question. ;)

I haven’t taken a look at the long mount times yet. It wouldn’t surprise me if my changes have improved them, but I suspect there’ll be a deeper issue somewhere that needs fixing before we’re free of long mount times.

Are the long mount times peculiar to Filecore, or do they apply for example to FAT formatted devices too?

I know this is really a wider issue and a lot of work, but filecore may not be the best choice here, especially for things like SDcards. In that case, we might even consider things like YAFFS, JFFS2, etc.

Are the long mount times peculiar to Filecore, or do they apply for example to FAT formatted devices too?

I think it’s just FileCore. A 2GB FAT device mounts in a few seconds, but an 8GB FileCore one took over 2 minutes when I tried it.

Anyway, after a bit of hunting I’ve tracked down a bug in RTSupport. The only problem is that I’m not really sure what needs doing to fix it!

The bug is quite simple – if the foreground process yields to allow a callback to run, and the callback then attempts to kill itself with RT_Deregister, the ‘Die’ routine in s.scheduler gets confused and decides that no context switch is needed – which means that RT_Deregister returns control back to the dead callback, and bad things happen soon after depending on exactly how the callback is written.

Looking at Die, there are the three branch instructions that attempt to work out what to do:

BHI %FT99 ; must be in interrupt context
BLNE SomethingsGoneWrong
BCC %FT99 ; we were actually in the foreground

It looks to me like the first branch will trigger if IRQSema != LastKnownIRQSema, and IRQSema != 0. Which I guess is something to do with detecting nested interrupt handlers, and only gets checked here to help with the SomethingsGoneWrong check?

The second branch will trigger if IRQSema != LastKnownIRQSema, and IRQSema = 0. Which I guess means that we’ve left an IRQ handler without RTSupport being aware (Is this the “In the foreground and didn’t know it” case that the comment at the start of the function talks about?)

The third branch will trigger if IRQSema = 0. Which means that whenever we’re not in an interrupt handler, RT_Deregister won’t switch to a new context.

To me it looks like the obvious fix would be to switch the BCC for BCS, so that if we’re in an interrupt handler, and IRQSema hasn’t changed, we don’t switch contexts. However I’m not really sure how this relates to the “we were actually in the foreground”/”Only return if in an interrupt routine or in the foreground and didn’t know it” comment. Presumably the foreground you’re talking about here is the foreground Context – in which case don’t we already know that we’re not in the foreground, since Die only gets called by Deregister(), and it’s impossible to deregister the foreground?

To me it looks like the obvious fix would be to switch the BCC for BCS

Or maybe not, since I just tried that and it now fails to deregister properly if the callback is entered as we are leaving an interrupt handler.

(Time passes)

I’ve just tried with the BCC check deleted, and it seems to work OK - Does this sound like a safe thing to do? It doesn’t look like it should affect the SomethingsGoneWrong code, since SomethingsGoneWrong preserves the flags.

And is SomethingsGoneWrong safe anyway? If we entered the foreground Context without RTSupport knowing, then we could end up killing the foreground without knowing – so surely the code should be:

If IRQSema != LastKnownIRQSema, and IRQSema != 0, do nothing

If we haven’t unexpectedly left an interrupt handler, kill the thread

Else (we have unexpectedly left an interrupt handler), so do nothing (since we must now be in Context 0)

Just wanted to say well done on those USB file transfer results. I think almost everyone’s using USB memory sticks nowadays, so that’s something that every RISC OS (5) user will really benefit from. Great work!

I had no idea about the RTSupport module. Looks like a useful module, and it seems like your work with it should end up fixing a few bugs there, so all to the good!

I had a look at the “Too many RTSupport routines” bug today. It turns out that there are two bugs:

Because the RTSupport routine in SCSISoftUSB deregisters itself, it often gets called when IRQSema is set. This then causes OS_ChangeDynamicArea to fail because it isn’t allowed to move memory from within an interrupt handler. So RTSupport’s memory allocation strategy is flawed and will need modification (unless we want to disallow deregistration of routines from within IRQ handlers, or from IRQ-triggered RTSupport routines)

There’s also a bug in the kernel that stops OS_ChangeDynamicArea from reporting the error – failure_IRQgoing in s.ChangeDyn hasn’t been updated to be 32bit safe, and tries to poke the V bit in the return address on the stack. Which is obviously a bad thing. The ‘lr’ on the stack actually is the PSR, as with all other kernel SWIs. But after calling OS_ChangeDynamicArea the batcall code in OS_DynamicArea clobbers the V flag before it checks it, so never reports any errors returned by OS_ChangeDynamicArea.

I’ve now got fixes ready for all the abovementioned bugs, and it doesn’t look like there are any more bugs to fix before I can release the new SCSISoftUSB. However, there’s one more issue that needs resolving – How should I allow RTSupport to detect versions of the kernel which have dodgy error handling for sparse dynamic areas?

At the moment the Cortex kernel is on version 5.14, and the development HAL kernel on 5.15. We’ll obviously want the dynamic area fix to be applied to both branches of the kernel, but should the new version be 5.15 or 5.16? (I’m assuming the version number usually only changes when an official release is made – so 5.15 will remain the development version until the official 5.15 release is made, at which point 5.16 becomes the development version. Also IIRC the Cortex branch was created only a short while before the official 5.14 release, so it shouldn’t be particularly difficult to implement any missing changes to bring it in line with the 5.15 development version, thus allowing us to safely change the version number from 5.14 to 5.15)

My basic plan for when RTSupport does detect a bad kernel is to make it use a regular dynamic area, pre-allocated to contain enough space for all the stacks. This is the simplest approach to take for the moment, until we start getting complaints from Iyonix owners who want their 1MB of RAM back!

Changing the RTSupport makefile to use ’-cpu 3’ to work around the horrible mess that is Norcroft’s broken support for packed structs

Rewriting RTSupport’s SVC stack allocation code to not rely on OS_DynamicArea being able to successfully unmap pages. Instead it uses a basic linked list of free & allocated stacks. It also keeps a list of 8 ‘emergency’ stacks for situations where new stacks can’t be allocated (e.g. when the WIMP isn’t running and something like BASIC has all of the application slot claimed). Although SCSISoftUSB will cope with RT_Register failing (or RTSupport not being present at all), I’d be much happier if the module never failed, since it has the potential to be used in many different areas of RISC OS. Spending 64K on a buffer of 8 emergency stacks seems like a good idea.

Fixing RTSupport’s “Die” code so that it copes properly with routines killing themselves while IRQSema is zero. For this I’ve made the code follow the rules I listed a couple of posts above.

And finally the bad news – a while ago I spotted that may “make 1000 files” test was now taking 83 seconds instead of 63 seconds. I can’t think of anything I’ve changed that could make it slower than it was, and I can’t spot anything obviously wrong after a brief look at the FIQprof output. So at the moment I’m just going to ignore it and blame it on my SD card getting worn out, or slightly different format settings used after I had to reformat the card. But on the bright side, I guess it’s still a good 13 times faster than the original code!

I’ve just tried with the BCC check deleted, and it seems to work OK - Does this sound like a safe thing to do?

Well, I finally found a few hours to stare at the code, and I independently came up with the basically the same fix as you emailed to me privately, although scheduled slightly better for the common case where there hasn’t been an exception. (I think you may have already refined the fix from what you described above – simply removing the BCC instruction is not ideal.)

Having checked the complete source file, I’m fairly confident this is the only bug of this type relating to LastKnownIRQsema.

I also noticed a couple of other faults just browsing the code – in s.scheduler, the first line of the comment about PreEmptionRecovery is out of date and should be deleted, and in h.scheduler the #define of Pollword_PreEmpted has obviously been copied and pasted from the neighbouring definition and hasn’t been properly updated. Maybe you can roll these in before you next commit the module?

And is SomethingsGoneWrong safe anyway? If we entered the foreground Context without RTSupport knowing, then we could end up killing the foreground without knowing

The address of the thread_t block for the foreground is unlikely to be hanging around to be accidentally passed in as a handle – its scope is limited to the block in which it is declared. So the chances of the (original_Context == index) test in Deregister() succeeding are minimal and you can be pretty certain Die() won’t be called for the foreground.

However, consider the possibility that a RTSupport calls a routine which aborts, so the OS flattens the IRQ stack. One of the actions taken rapidly afterwards by the error handler might well be to deregister the aborting routine – but RTSupport’s state hasn’t been updated and it thinks it is still executing the routine, so Deregister() will call Die() erroneously. So it is definitely necessary for Die() to call SomethingsGoneWrong() if it sees that the IRQ stack has been flattened.

Having said that, this scheme isn’t 100% foolproof because it won’t be able to detect the case where the foreground yields in favour of a routine which then aborts, and which is then deregistered! Fortunately, the aim of all this is to help developers when writing new software, in particular to prevent the case where you get aborts triggered from interrupts, which can easily get stuck in an infinite loop and make debugging very difficult. Aborts generated from a foreground process are rather less likely to fall into that category.

RTSupport’s memory allocation strategy is flawed and will need modification (unless we want to disallow deregistration of routines from within IRQ handlers, or from IRQ-triggered RTSupport routines)

Very true. I don’t know how I missed that one before.

It may help to explain things if I say that the CBAI module upon which the RTSupport module is conceptually based doesn’t allow you to do very much at all from within its callback routines – very few SWIs are safe, certainly not things like memory allocation (so you can’t even do things like call OS_AddCallBack). When I realised I couldn’t use CBAI for an STB project I was working on (because it also broke the Internet stack) and I had to write a replacement, I wanted to create something which would make it easier to port media stacks designed for other platforms, hence the allowance for threads, variable prioritisation and dynamic memory allocation. I also expect that adding and removing threads from within other threads is likely to be a requirement of such code, which is why I specified it the way I did. You will see there are attempts to allow for registration and deregistration of routines from any context, even from within interrupt handlers, but I obviously missed the biggie about OS_(Change)DynamicArea not being reentrant.

It probably won’t surprise you to hear that in reality, the only software that has made heavy use of RTSupport to date registers and deregisters its routines exclusively from the foreground and relies on setting a pollword from an interrupt handler to trigger execution of the routine. There’s obviously a balancing act here – registering and deregistering a routine is a fairly heavyweight operation which you don’t want to be doing all that often, but likewise you don’t want many tens of routines to be registered at once, else checking all the pollwords will result in effectively flushing the data cache on every interrupt.

We actually had a similar problem on a smaller scale with routines failing once in a blue moon – it turned out to be if a routine needed a stack extension when the RMA was nearly full it would fail because during heap extension, OS_ChangeDynamicArea refused to function. As a result there’s a utility floating around somewhere to ensure there’s always a page or two free in the RMA - much like your ‘emergency’ stacks.

However, there’s one more issue that needs resolving – How should I allow RTSupport to detect versions of the kernel which have dodgy error handling for sparse dynamic areas?

It’s a little regrettable that the version numbers of the HAL and Cortex branch OSes have diverged. I’d like to think that by the time the OMAP port is “production ready” we could try merging them back together properly so there’s one less branch to maintain – in the meantime I suggest at least moving the Cortex branch to 5.15 if only so RTSupport has a chance of figuring out what kernel facilities are available. By all means make your checks on the assumption that you’re on the latest development version of 5.15, since whatever it is that you require should also be present in the release version of 5.15 as and when that becomes finalised. Anyone using an old development version snapshot should be aware that things are liable to change until the formal release.

Some people have suggested an approach whereby even-numbered versions are always stable and odd-numbered versions are under development. Obviously that’s not been true of the Cortex branch so far, but I don’t think it affects the principle above – that if you’re testing for 5.15 then you behave as though the only version of 5.15 in existence is the latest one.

Which brings me on to another thing – it’s been a while now since the last stable release, and assuming your USB mass storage improvements continue to prove reliable, I imagine a lot of more timid users would appreciate the speed boost. As such, perhaps we should be thinking about putting together release candidates and doing some formal testing in preparation for a new formal release?

This is the simplest approach to take for the moment, until we start getting complaints from Iyonix owners who want their 1MB of RAM back!

1MB was still quite a lot of RAM, especially for embedded uses, until fairly recently. There was method in the madness!

Well, I finally found a few hours to stare at the code, and I independently came up with the basically the same fix as you emailed to me privately, although scheduled slightly better for the common case where there hasn’t been an exception.

Oh well, nobody’s perfect ;) I’ll be sure to submit your version of the fix instead of mine.

(I think you may have already refined the fix from what you described above – simply removing the BCC instruction is not ideal.)

It was the behaviour described by the three bullet points that I had in mind – by that point I’d worked out that just removing the BCC would be a bad thing.

Having checked the complete source file, I’m fairly confident this is the only bug of this type relating to LastKnownIRQsema.

Yeah, I did have a quick look at similar bits of code myself and didn’t spot anything that looked like it would suffer from the same problem.

Maybe you can roll these in before you next commit the module?

Sure.

Which brings me on to another thing – it’s been a while now since the last stable release, and assuming your USB mass storage improvements continue to prove reliable, I imagine a lot of more timid users would appreciate the speed boost. As such, perhaps we should be thinking about putting together release candidates and doing some formal testing in preparation for a new formal release?

A formal release could be a good idea, yes. It would certainly be worth it if we could fix any outstanding issues with the !Boot sequence/install disc so that people can safely use the latest !Boot sequence if they wish. (Although this is obviously something that’s best discussed in a seperate thread if ROOL decide to go forward with a new release!)

Also there are the long mount times which I still need to look at – at the least I’d want to do a quick investigation and find out what the cause is before we make an official release.

In particular, after a quick check of the components files it looks like the SCSI modules aren’t part of the Iyonix ROM image – so we’d definitely want to get the disc image updated to include the soft-load for those modules (something that I obviously missed when I redid its components file!)