RISCOS Hypervisor

I’d like to gauge how much interest there is for a Hypervisor in the community and how willing the RISCOS devs are to assist extend RISCOS where appropriate.

From the experience I’ve gained over the past 15 months coding the JIT and partial VMM/Hypervisor that manages software running under it within ADFFS, I believe it will be possible to code a full type 2 Hypervisor that works from ARMv3 (ARM610) upwards.

On ARMv3 (ARM610) thru ARMv5 (80321) there’s no hardware virtualization, but that’s not really a limiting factor. The CPU mode can be paravirtualized (ie all code runs in USER mode) to ensure the Hypervisor is running at a higher security level and separate user/privileged page tables maintained to ensure the page security is correct in the appropriate guest CPU mode.

ARMv6 (ARM11) adds Fast Context Switch Extensions (FCSE) which is ideally suited to hosting 32mb machines without the hit on flushing the TLB on every Virtual Machine Manager (VMM) switch. 32mb would be fine for hosting Archimedes class VM’s (A3xx thru A5xxx), but may be a limiting factor for A7000/RiscPC VM’s. However, as the limit on application space is ~27.5mb, there’s nothing to stop you firing up multiple VM’s to run every app on independent machines. Unfortunately ARM have changed it’s implementation several times and there’s no backward compatibility so it may be simpler to take the hit on flushing the TLB and not use it.

ARMv7-A adds Large Physical Address Extension which is getting closer to Intel/AMD virtualization extensions, but isn’t quite where it needs to be so is probably not worth pursuing. I can also see two issues with it around getting the CPU into HYPER mode initially (which has to be done at reset) and the way they’ve implemented hardware vectors, which are offered to the guest before the Hypervisor.

I’m proposing to code a Hypervisor that will run on ARMv3 and sits between RISCOS5 and the VMM’s. It will take over the hardware vectors and steer IRQ’s to the host OS and where appropriate trigger guest IRQ’s (eg VSync/Timers etc). This is already implemented in the upcoming release of ADFFS and just needs extending to be trigger appropriate Mouse/Keyboard IRQ’s on the Guest when it has focus.

On top of the Hypervisor will sit various selectable VMM’s allowing you to host ARM3/IOC up to StrongARM/IOMD. These will provide both the virtual hardware and Hypercall code to resolve incompatible CPU instructions.

Although ADFFS has a fast performing JIT (50%+ host speed) its done with codelets. These have the benefit that all code ends up running natively on the host CPU and the JIT eventually stops processing instructions, but does mean maintaining a heap of codelets and remove them as the original instructions are overwritten

I’m proposing to start from scratch for the Hypervisor and use Hypercalls instead of codelets for sensitive instructions and either emulate them or use hardcoded code blocks where appropriate inside the Hypervisor. MOVS PC, R14 for example could have a fixed Hypercall code block, but LDR R0, [PC, #24] may be emulated.

ADFFS already handles self-modifying code and I’m proposing to take that across as is, its done though judicious use of memory access protection and an Abort handler that either proxies memory writes or cleans the cache as appropriate. The downside is it has to split the code and data into separate memory areas, however switching to Hypercalls means the code doesn’t have to sit within the 1st 32mb, so can run outside of the VM’s memory.

Performance wise is unknown currently, the biggest hit on what ADFFS currently can achieve is the hit on flushing the TLB when switching VM’s and out to the host. This could be mitigated to a certain degree by allowing a VM to take priority where appropriate. If you’re playing a game full screen for example, you’re probably not worried about other VM’s running in the background, so the only context switching is going to be when passing back IRQ’s to the host OS.

Initially I’ll start with an A440/1 VMM and with community support build other VMM’s for RiscPC / Iyonix etc. I’m not proposing to go beyond an A440/1 myself as emulating a RiscPC is a fair chunk of work, but the Hypervisor will be designed to allow other VMM’s to be loaded as Modules, allowing it to be easily extended and future proofed. It will be the responsibility of the module to provide both the virtual hardware and CPU instruction interpretation into Hypercalls with code appropriate for ARMv7.

There are however some outstanding questions that need to be resolved:

Timers. If HAL_Timers is to be used, an appropriate API would need adding to RISCOS to share timers. ADFFS currently works around this limitation be using the blitter to trigger T1, that’s fine for games that palette swap, but means timer IRQ’s don’t fire outside of the VSync triggered blit. Assuming we’re not worried about Scene demos etc that probably use more that just T1, T2 may not be required as its providing the serial port, T3 provides the keyboard so could trigger on events only, leaving T1 to be handled by the blitter and T0 triggered via RTSupport. This will work provided T1 is only used for palette swapping.

Memory switching. RISCOS doesn’t currently have a appropriate legal means to switch memory outside of appspace in the way required by a Hypervisor. The Hypervisor will need to switch out page zero and mirror the screen memory to match MEMC. It would be helpful if RISCOS provided a means to to do this on either a task switch (ie the Task registers pages it wants swapped in/out in addition to appspace) or there’s an API that allows memory to be mapped in/out without the limitation inherent in OS_Memory. The ability to forcibly switch tasks also needs to be made public, so VM’s can run as separate tasks.

Page Zero (implemented). Moving vectors high would be advantageous allowing the guest and host complete separation. This could be worked around by swapping the page in/out on context switch and preventing the guest from writing to them. The guest’s hardware vectors would then point to the Hypervisor. The drawback of this method is that reads from page zero then have to be trapped so the Hypervisor can return the values the guests believe are at the hardware vectors. This is how the next release of ADFFS deals with it, which seems okay for most games (there a few that it trigger 40k+ Aborts/sec) but it’s not ideal and I’m not sure what the performance impact would be on Arthur thru RISCOS 4.x running under the VM – possibly quite substantial, just to catch the odd read from the hardware vectors.

Cache flushing (implemented). Issuing SWI OS_SynchroniseCodeAreas isn’t the most optimal way to flush the cache when it’s happening 10’s of thousands of times a second. A means to get the OS_SynchroniseCodeAreas entry point directly and bypass the SWI handler would improve performance.

MOVS PC, R14 for example could have a fixed Hypercall code block, but LDR R0, [PC, #24] may be emulated.

Another problem, which affects a recent discussion regarding the supposed relative ease of porting 26 bit software to 32 bit is when non-obvious things are done to registers. There was one program that went a little like this:

...function entry
...do stuff...
push bit 28 in R9
...do other stuff...
conditionally or together R8 and R14, result in R14
...do even more stuff...
MOVS PC, R14

Making this 32 bit by getting rid of the MOVS meant the program now failed in two ways; but most notable was that PC was some bogus value so the thing crashed.
I had to go back through the code replacing the MOV PC,R14 with OS_Exit to see which one was causing the problem (as DDT is a tad unreliable for stuff like this on the Pi, unfortunately). When I narrowed it down to app-bomb at normal crash point, I then had to walk the code backwards to spot that flags were being set in R9 and pushed into R14. I couldn’t just NOP out that instruction as something later on checked the flag to determine how to behave. This required more instructions. Thankfully a screenful or so down there was a NOP so I was able to manually shift everything down one (and patch up branches in the entire program) so I had space to insert an MSR command. Otherwise it would have required automated disassembly and then being reassembled; but this is itself a bit of a challenge at times…

1. Timers. If HAL_Timers is to be used, an appropriate API would need adding to RISCOS to share timers.

Very much so. I think we need not only a standardised high resolution timer API, but also something akin to CallBack that can work in millisecond units, not centisecond ones.

One area that should be scrutinized is the SharedTimer interface, which creates a common usage one-shot timer. This code has caused problems in every port the Crank Software Embedded Development Consulting Services team has ever been involved.

Assuming we’re not worried about Scene demos etc that probably use more that just T1, T2 may not be required as its providing the serial port, T3 provides the keyboard so could trigger on events only,

If “T” you mean timer; are you looking at it from a host or an emulated point of view? I think most hosts these days have fewer available timers (the Pi has four but it seems only two are available to the OS and RISC OS itself uses one…).

2. Memory switching. RISCOS doesn’t currently have a appropriate legal means to switch memory outside of appspace in the way required by a Hypervisor.

This is something that may need to be investigated along the way as well. For example, my suggested for cheap’n’cheerful-PMT absolutely requires the Wimp to do this natively (not a third party hack like Wimp2) because that appears to be about the only place where it is possible to associate tasks and task switching. As far as I can determine, none of this mechanism has been documented anywhere, so it is fun trying to figure out WTF is going on. I think it is the AMB stuff in the kernel that does the actual memory fudging (under direction of the Wimp) but I have only had a cursory examination of the code.

Another problem, which affects a recent discussion regarding the supposed relative ease of porting 26 bit software to 32 bit is when non-obvious things are done to registers.

Thankfully, the JIT in ADFFS and the new one I’m proposing remove the need to manually patch apps to 26bit. You’re right though, sometimes it does take a few passes to figure out exactly what code is doing when it’s manipulating the PSR.

If “T” you mean timer

Sorry, should have made myself a bit clearer. I was referring to triggering the 4 IOC Timers on a guest VM running as an IOC based chipset – nothing to do with the host OS.

Realistically, taking the Pi as an example, there’s only one free timer so without some API for sharing it, it’s not much use. I can write my own internal API to share the timer, but that will only work provided there’s no HAL_Timer subscribers on the host OS. Perhaps RTSupport can be extended to support higher resolutions, the framework is already there but it would need modifying to use a HAL_Timer instead of TickerV (I think it’s currently based on TickerV) and deal with overlapping and closely triggering events.

my suggested for cheap’n’cheerful-PMT absolutely requires the Wimp to do this natively

It was your PMT suggestion that got me thinking in more detail about a Hypervisor as the requirements are much the same. My thought was along the lines of running every app in it’s own VM and paravirtualize the Wimp SWI’s back to the host OS to seamlessly integrate them into the host. Akin to XP Mode on Windows if you like.

I’ll implement Wimp support in ADFFS along those lines, with ADFFS being the Wimp task and it isolating the actual tasks running under the JIT so they’re hypervised. At this point ADFFS could break into the task at an appropriate point and provide PMT outside of the guest app calling Wimp_Poll.

I’ve not looked at Wimp2, but from the discussions so far it sounds like it needs perfectly behaved apps for it to work, so isn’t ideal. ADFFS is far more integrated into the host as it takes over all hardware vectors whilst the JIT is running, in this context I believe all of the issues mentioned with Wimp2 can be avoided.

ADFFS has more technical challenges than a Hypervised VM though, as it has to make the host OS think it’s seeing a RO5 app and the app think it’s running on a RO3 OS/machine. With a VMM its running the original OS so compatibility issues are no longer a concern from the hosts perspective. The Wimp Module could then simply be replaced with a shim that passes everything back to the host OS via Hypercalls.

Initially I’ll start with an A440/1 VMM and with community support build other VMM’s for RiscPC / Iyonix etc.

Damned… RISC PC emulation is really needed :)

Could I suggest something with less work as a complete RPC emulation, but more useful for modern uses? A Qemu like emulation, for Linux, NetBSD and possibly even RISC OS 5. Could be also a Pi emulation.

That would solve all our problems of browsers (by running Linux), security (by isolation), etc. Of course, 32 MB is not a lot, but for containers, it could be enough. To get more memory, hardware virtualization is an option too, on Pi2. Or secure mode on Cortex?

I’ll implement Wimp support in ADFFS along those lines, with ADFFS being the Wimp task and it isolating the actual tasks running under the JIT so they’re hypervised. At this point ADFFS could break into the task at an appropriate point and provide PMT outside of the guest app calling Wimp_Poll.

Could solve many problems with a ROS5 > ROS5 translation (just to add PMT).
Not sure I’m clear :)

I hear you, but there’s no enough hours in the day, I’ve already put in around 400 hours on ADFFS in the past month and 17 hours today alone.

I’ve been slowly adding RiscPC support to ADFFS over the past year, the VIDC translation for example was all internally done as VIDC20 from the outset so the blitter is already emulating a VIDC20. In the upcoming release of ADFFS, I’ve started adding extensions to get RiscPC games to work but it’s still nowhere near where a VMM would need to be.

I still need to look at adding ARMv4 extensions to the JIT, not to mention IOMD emulation. The later is the time consuming part as the documentation is patchy at best.

For a VMM it would need accurate MMU and cache emulation, which hasn’t been done to date on any emulator. Red Squirrel is the closest so far, but is closed source so not much help. It would probably be a case of a lot of trial and error to get a working StrongARM.

Could solve many problems with a ROS5 > ROS5 translation (just to add PMT).

I did consider RO5 > RO5, with a RiscPC/StrongARM VMM it would certainly work.

Actually, FCSE was introduced in ARMv5. In ARMv6 it’s deprecated, and in ARMv7 support for it is optional.

ARMv7 adds a much better way of doing things – dual translation table base registers. You can configure the system so that all virtual address below N use TTBR0 while all addresses above N use TTBR1 (where N is a power of two between 32MB and 2GB inclusive). Using the short descriptor format (i.e. not using LPAE) TTBR1 is designed to be used for global memory while TTBR0 is designed for process-specific memory, using an 8 bit address space identifier (ASID) to identify the process. I’ve long had a goal of using this to replace the use of page table manipulation during Wimp task swapping, as it reduces the page table manipulation required for task swapping to just a couple of CP15 register writes and some sync instructions.

If we combined it with a version of the OS which has high processor vectors enabled (and if we also relocated/removed scratch space), and were to implement support for sparse application slots then it would allow us to give each Wimp task complete freedom in how it maps and manages its memory.

Incidentally, if anyone feels like modifying FPEmulator to not require a word of workspace in zero page then that would be appreciated – at the moment that’s the only module, apart from the kernel itself, which needs to know at compile time whether zero page is high or low. Also feel free to start lobbying for ROOL to host ROM downloads with high processor vectors enabled (or to make high processor vectors the default), the OS has supported it for over 3 years now but probably hasn’t had much exposure to all the nasty third-party apps which are full of null pointer dereferences or which deliberately access zero page locations. If I was to start a new hardware port I’d definitely make it so that high processor vectors were enabled by default, forcing programmers to fix their badly-behaved software if they want it to work, but so far it seems that everyone else is jumping on all the new hardware before I get chance!

Actually, FCSE was introduced in ARMv5. In ARMv6 it’s deprecated, and in ARMv7 support for it is optional.

I did struggle to establish exactly when it was introduced. ARM should really improve their documentation!

The immediate deprecation of it just highlights ARM’s somewhat scattergun approach to virtualization. ARMv7’s way of doing it, which on the face of it does look like a workable solution, is badly let down by their implementation of the hardware vectors. As far as I can establish, they’re passed to the guest VM before the Hypervisor – which is a major mistake. I believe they’ve learned the error of their ways and are correcting in ARMv8.1?

I’ve gleaned all this from reading numerous attempts to code VMM’s on ARM from the likes of VMware etc, so don’t quote me on this. The ARM detail is hard to find via ARM’s site.

As with cache flushing, virtualization is now an utter mess on ARM and somewhat in flux, so we have to fall back to worst case scenario…hence my plan to code all this for ARMv4 and ignore ARMv5+ extensions. At least until they settle on a final design and we’re on the Pi3..4..5 etc. We could use FCSE etc where appropriate to reduce the hit on the TLB but just need to be mindful of the limitations, which are pretty big in some cases….32mb limits being an example.

Incidentally, if anyone feels like modifying FPEmulator to not require a word of workspace in zero page then that would be appreciated

I can probably do this…dip my toe into RISCOS development so to speak. Might need some hand holding to figure out how to use the tools/compiler though.
Where would it go? In the RMA? Why is it in page zero at the minute – doesn’t make any sense to me.

Also feel free to start lobbying for ROOL to host ROM downloads with high processor vectors enabled

Who controls this kind of decision? Why do we not have a regular steering committee to cover these sorts of decisions? Even meeting once every six months would cover most of these big ticket changes.

OS has supported it for over 3 years now but probably hasn’t had much exposure to all the nasty third-party apps which are full of null pointer dereferences or which deliberately access zero page locations.

From the work I’ve done in the next ADFFS release to handle LDR’s in page zero, this is a big can of worms. If you look at this post on the JASPP site, I’ve detailed the games (out of most of the supported ones on the Pi) that read from page zero inadvertently due to bugs.

I opted to fix the bugs, but in reality we’d need to handle them on-the-fly. My plan here requires vectors to go high so we get Aborts, I even considered switching ADFFS to require high vectors on the Pi and forcibly move them whilst it’s running. With vectors high, ADFFS’ JIT would see a massive improvement in speed as it wouldn’t need to emulate LDR’s on 1st pass to establish if they were reading from page zero. Writes are dealt with via an Abort as I restrict the pages whilst ADFFS’ JIT is running – all very clean and simple to fix up in real-time.

I can code an Abort handler for this if you like, to allow vectors to go high and proxy any read/writes to page zero. We’d have to write off 0…4000 as unusable space though – which I think is probably the intention anyhow. What’s your opinion on this?

If I was to start a new hardware port I’d definitely make it so that high processor vectors were enabled by default, forcing programmers to fix their badly-behaved software if they want it to work

From what I’ve seen to date in legacy games (ref. link above), the bulk of the issues are caused by:

Acorn’s C compiler using pointers before they’re setup. It seems to set some up after 1st use in some situations. I suspect this was fixed in later C revisions, but was certainly prevalent back in 87..93

A possible bug in SoundDMA. I’ve seen quite a few Voices that are getting NULL pointers from the SCCB the 1st time they’re called. Acorn did change the way Voices were initialised, so it may or may not be related to the change. It’s certainly raised itself enough for me to consider writing initial values to the SCCB to avoid having to fix the bugs

so far it seems that everyone else is jumping on all the new hardware before I get chance!

Yes, I’ve noted this as well. Were you involved in the Pi2 process? Was there any kind of peer review of the build before it was publicly release? The way it magically appeared and was then quickly fixed to resolve some minor issues certainly raised some questions about the current process.

That’s not to denigrate any of the work the somewhat elusive elite team developers do to bring these builds to release, but it doesn’t seem like a very “Open” process. Having said that, I can understand there may be commercial secrecy around the process – the Pi2 was certainly sprung on the world out of the blue. Obviously a decision was made between Ebon and ROOL to bring a RISCOS build out for day 1, which I wouldn’t expect to be made public. It’s a tricky one, I grant you.

Where would it go? In the RMA? Why is it in page zero at the minute – doesn’t make any sense to me.

FPEmulator’s undefined instruction handler needs to be able to find FPEmulator’s workspace pointer. Because of the way the OS-agnostic core code (which was written by ARM, AIUI) operates I don’t think there’s a spare register which can be used to hold the value – instead it uses the AdrWS macro to look it up on-demand. Presumably Acorn just stuck it in zero page because that was the easiest solution, or maybe there were still in BBC mode thinking that statically allocated workspace is a good thing.

The solution I was thinking of was to move the initial undef entry point into the RMA, using a PC-relative LDR to get the workspace pointer (or maybe an ADR if the entry point lives in the workspace itself). Then work out some way of plumbing the value through the core code so that all of the places which need it can still access it (making r12 or some other register the workspace pointer would be the obvious solution). It looked like it was possible last time I looked at the code, but I hadn’t actually tried doing it yet, so I clearly still have some reservations ;-)

Also feel free to start lobbying for ROOL to host ROM downloads with high processor vectors enabled

Who controls this kind of decision? Why do we not have a regular steering committee to cover these sorts of decisions? Even meeting once every six months would cover most of these big ticket changes.

I suspect you’ll need ROOL to answer those questions!

I can code an Abort handler for this if you like, to allow vectors to go high and proxy any read/writes to page zero. We’d have to write off 0…4000 as unusable space though – which I think is probably the intention anyhow. What’s your opinion on this?

Yeah, some kind of abort handler to provide compatibility for old/buggy software would be good. It’s something I was planning on doing myself but evidently never got around to. And making the first 16K of memory, and eventually the first 32K, completely unmapped by default would be the eventual goal of the changes.

With zero page relocation enabled, the first 16k of workspace and the processor vectors get moved to &ffff0000. However the low 16K of address space isn’t completely empty – there’s one page (at &1000 IIRC) for the Debugger to use as its workspace. The reason for this is that the Debugger wants to be able to use “MOV PC,#xxx” in order to jump from any breakpoints into its code. We’d probably want to fix that by changing it to use the BKPT instruction (upside: No more static workspace needed, downside: Corrupts some registers in ABT mode)

Also note that rather than have your abort handler assume the location of the relocated workspace values, you’d want to look up the locations using OS_ReadSysInfo 6. At the moment that SWI only lists the values that are used internally by RISC OS, so if there are other values which have leaked out over time then we’d probably need to extend it to expose those as well.

so far it seems that everyone else is jumping on all the new hardware before I get chance!

Yes, I’ve noted this as well.

Well, I’m not really complaining. There are still plenty of things left to do before we come close to using the full potential of a BeagleBoard or a Pi 1, let alone all the multi-core machines that have come after them.

Were you involved in the Pi2 process?

Nope. The hardware + software release was as much of a surprise to me as it was to (almost) everyone else here.

FPEmulator’s undefined instruction handler needs to be able to find FPEmulator’s workspace pointer.

Ah. I avoided that issue by not using one in my Undefined handler ;)

The solution I was thinking of was to move the initial undef entry point into the RMA, using a PC-relative LDR to get the workspace pointer (or maybe an ADR if the entry point lives in the workspace itself). Then work out some way of plumbing the value through the core code so that all of the places which need it can still access it

Is the core code not in the Module then?

Yeah, some kind of abort handler to provide compatibility for old/buggy software would be good. It’s something I was planning on doing myself but evidently never got around to. And making the first 16K of memory, and eventually the first 32K, completely unmapped by default would be the eventual goal of the changes.

I’d need to code it for ADFFS if we’re going to put vectors high, so can spin it out into a dedicated Module easily enough.

With zero page relocation enabled, the first 16k of workspace and the processor vectors get moved to &ffff0000. However the low 16K of address space isn’t completely empty – there’s one page (at &1000 IIRC) for the Debugger to use as its workspace. The reason for this is that the Debugger wants to be able to use “MOV PC,#xxx” in order to jump from any breakpoints into its code. We’d probably want to fix that by changing it to use the BKPT instruction (upside: No more static workspace needed, downside: Corrupts some registers in ABT mode)

This would need resolving, we’d been page zero completely clean for any fix-up Module to work. I can look at rewriting the Debugger so it doesn’t have that requirement.

The solution I was thinking of was to move the initial undef entry point into the RMA, using a PC-relative LDR to get the workspace pointer (or maybe an ADR if the entry point lives in the workspace itself). Then work out some way of plumbing the value through the core code so that all of the places which need it can still access it

Is the core code not in the Module then?

Yes. The undef entry point would be in the RMA (so it can find the workspace pointer), then it would call into the core code held in the module.

From a quick look at the Debug exceptions page it looks like ABT_r14 is the only register altered? I’d expect that though, is there more going on?

I think that page may be talking specifically about when CPU debugging is enabled (e.g. via JTAG). If the system is running normally then it looks like BKPT operates by generating a prefetch abort (I’m not overly familiar with the behaviour myself – I vaguely remember having a couple of issues when I tried to use BKPT for JTAG debugging once)

I’m guessing the IFSR will show the abort cause as being an ‘instruction debug abort’, so I’d expect to see the following registers changed:

IFSR

SPSR_abt will be set to the CPSR (and CPSR set to ABT mode, obviously)

R14_abt

It might be desirable to make some changes to the prefetch abort handler in the kernel so that it can cleanly pass the abort onto the debugger module, otherwise it will go straight to the prefetch abort environment handler, which will most likely go straight to triggering a crash dump from the C runtime or whatever.

Let’s say you have a branch instruction at &8008. If your program is entered by a branch to &8000 then the CPU might actually start off by prefetching &8000-&8020 into the pipeline. Until the branch instruction at &8008 reaches the execute stage of the pipeline, the CPU might not know whether the instructions at &800C-&801C are actually needed (especially if it’s a conditional branch). If the branch is taken, the CPU will flush the instructions from &800C-&801C out of the pipeline and start fetching instructions from the new location (if it hasn’t already started fetching them). If the branch isn’t taken, those instructions will remain in the pipeline, and if &800C happened to be a breakpoint then it will now find itself in the execute stage of the pipeline and the prefetch abort will be triggered.

So what the description is saying is that although the BKPT instruction causes a prefetch abort, the prefetch abort only occurs when the instruction has made it far enough through the pipeline that the CPU will try to execute it. This is the same as with the most common type of prefetch abort, where you’re trying to execute unmapped memory or memory for which you don’t have read permissions – the abort only occurs when you try executing the location, not when the CPU first tries accessing the memory and fails.

So what the description is saying is that although the BKPT instruction causes a prefetch abort, the prefetch abort only occurs when the instruction has made it far enough through the pipeline that the CPU will try to execute it.

Thanks for the lucid explanation. That makes perfect sense. Why couldn’t ARM have just said it like that?

I’m guessing the IFSR will show the abort cause as being an ‘instruction debug abort’, so I’d expect to see the following registers changed:

*IFSR
*SPSR_abt will be set to the CPSR (and CPSR set to ABT mode, obviously)
*R14_abt

Normal Abort behaviour then, that’s fine – we’re not expecting Debugger to handler Aborts in Abort mode are we? It’s aimed at general programmer debugging for USER/SVC/IRQ? I did implement re-entrancy Aborts in my Abort handler but it was a bit messy and did eventually disable it.

It might be desirable to make some changes to the prefetch abort handler in the kernel so that it can cleanly pass the abort onto the debugger module, otherwise it will go straight to the prefetch abort environment handler, which will most likely go straight to triggering a crash dump from the C runtime or whatever.

We can either have a permanent hook in the Prefetch Abort Handler to pass to the Debugger Module – or have the Debugger Module independent and add itself to the Prefetch Abort Handler as required. Considering the amount of use its likely to get, the later may be a better option as it keeps Debugger self-contained – the obvious caveat being that there’s currently no method to cleanly insert/remove1 from the hardware vectors if they’ve been since been taken over by someone else.

I’m thinking keep Debugger self-contained.

Where’s the FPEmulator source? I can’t seem to find it in the CVS tarball – it’s not obvious as any rate.

1 Can’t we fix that problem with a jump table that acts as a middle-man and have it pass the jump table address instead of the actual handler address when handing out handler address to new handlers? You just shuffle the table entries up/down to insert/remove handlers then.

We can either have a permanent hook in the Prefetch Abort Handler to pass to the Debugger Module – or have the Debugger Module independent and add itself to the Prefetch Abort Handler as required.

The way I see it, there are three approaches:

Make the debugger register a prefetch abort environment handler. This will make it hard to set a breakpoint and then start an application, because the application may replace the abort handler with its own (although I guess most applications wouldn’t bother and would just rely on the abort being promoted to an error)

Make the debugger hook into the hardware vector via OS_ClaimProcessorVector. This will slow down lazy task swapping, although that could be mitigated by only claiming the vector when a breakpoint is set (but then you’d have an increased chance of hitting the vector claim/release ordering issues)

Come up with a new way of handling data/prefetch aborts. This is the technique I was using for the unaligned load fixup module – the kernel would build an abort descriptor record containing the main registers, DFSR/IFSR, etc. and then call a vector. The return code from the vector claimants would indicate whether the instruction should be restarted, skipped, or whether an error should be raised. Since the kernel would only read the fault status registers the once, it would solve lots of issues with the registers being changed by recursive aborts. In fact this approach may get resurrected at some point if/when we implement support for ROL-style abortable DAs (which we’ll definitely want at some point in order to support cacheable screen memory, or things like tracking screen writes in order to implement software translation layers like BPP conversion, scaling, rotation, etc. Or even stuff like ADFFS tracking reads/writes to its emulated memory.)

The third option is obviously the one I’m leaning towards, although since one of the main users would be the abortable DAs we’d probably want to wait until we’re ready to implement that before we try changing the abort handling.

Where’s the FPEmulator source? I can’t seem to find it in the CVS tarball – it’s not obvious as any rate.

Make the debugger hook into the hardware vector via OS_ClaimProcessorVector.

This was my thinking, wasn’t aware of the third option although can see the advantages. Sounds like we’ll have to go with the hardware vector approach initially and revisit at a later date once your abort descriptor method is in place.

Might be worth starting a separate thread on this if you’ve not already defined the API.

From what I’ve seen to date in legacy games (ref. link above), the bulk of the issues are caused by:

# Acorn’s C compiler using pointers before they’re setup. It seems to set some up after 1st use in some situations. I suspect this was fixed in later C revisions, but was certainly prevalent back in 87..93
# A possible bug in SoundDMA. I’ve seen quite a few Voices that are getting NULL pointers from the SCCB the 1st time they’re called. Acorn did change the way Voices were initialised, so it may or may not be related to the change. It’s certainly raised itself enough for me to consider writing initial values to the SCCB to avoid having to fix the bugs

I’ve been looking into issue 2, more specifically how its causing Conqueror and Pac-mania to read from page zero. I’ve tracked the problem down to the GateOn entry not always being called when the sound is first used and as a consequence the Voices are not initialising the SCCB with their working variables.

Unless I’m misunderstanding how it should work, I believe GateOn should always get called once when the sound is first used and again whenever the envelope changes (which may not have been implemented in RISCOS?)

The problem doesn’t occur on Arthur, but does from RISCOS 2 onward so it’s a issue that’s been around a while.

In the Level 1 instantiate code it doesn’t appear to set the GateOn flag when the sound is instantiated, but instead initialises the flags to ForceFlush, so instantiating doesn’t force a call to GateOn.

The only place I can see that sets the GateOn flag is in the SoundControl code, it checks R1 for bit 7 and sets the GateOn flag if it’s clear. Looking further up the code, it looks like R1 is “emulation of amp/(env !)” or the amplitude depending on the result of R4 here

There’s also a comment here suggesting GateOn is only set if R1 is &101-&17F so it may possible to make a sound without GateOn being called if R1 is &181-&1FF

Most game Voice handlers that use working variables in the SCCB suffer from this issue, which makes me believe that either the documentation wasn’t clear at the time – or the behaviour changed in RISCOS 2 such that it’s possible for the Fill entry to be called before the GateOn entry.

I’m not sure how this should be fixed though, possibly by changing the instantiation code to force GateOn as well a ForceFlush? GateOn should really be called when a sound is used for the first time, regardless of the amplitude – if I’m understading the purpose of the GateOn correctly.

The PRM statement for GateOn says:

The GateOn entry is used whenever a sound command is issued that requires a new
envelope. Normally any previous synthesis is aborted and the algorithm restarted.

I’ve no documentation from the Arthur days to see if that contained more detail on it. I vaguely remember speaking to Acorn directly back in 1987 to find out how to code Voice handlers, but I can’t find the documentation, it’s probably long been lost.

Come up with a new way of handling data/prefetch aborts. This is the technique I was using for the unaligned load fixup module – the kernel would build an abort descriptor record containing the main registers, DFSR/IFSR, etc. and then call a vector.

Seconded. I really think that any other approach is asking for trouble, with the number of system extensions/features that need to handle aborts. There are now a large number of instructions (including coprocessors, remember) that may access memory, and to duplicate this (quite possibly incompletely/incorrectly) in each of these extensions seems ridiculous. I think the interface between the kernel’s abort handler and the vector claimant, which should specify one or more address ranges when registering itself should be quite high level, transferring a number of bytes/halfwords/words/double words without regard for signedness/interpretation of the raw data. Akin, if anybody else has hardware experience, to the interface presented on the AMBAAXI bus.

It’d also be expedient, when specifying the interface, to consider the possibility of the kernel cacheing information so that repeated his to a particular address range may be processed as quickly as possible with minimal decoding work.

Bonus points for considering the impact upon DMA transfers :) What happens if I decide to call OS_File to load data from a DMA-using filing system into the ‘emulated’ address range?

Seconded. I really think that any other approach is asking for trouble, with the number of system extensions/features that need to handle aborts.

I think all three of us are agreed this is the better route to take, it’s needs defining in detail and we need to consider the impact and work required on the incumbent Abort handler – or simply decide to replace it and start from scratch.

I presume we’d do the instruction decode in the Abort handler and pass the Vector the instruction type (LDR/STR/SWP/LDM/STM + modern variants etc). In my Abort handler I do the initial decode and pass the instruction as well, so LDM/STM can do post correction. We could have the Vector pass back a parameter if it doesn’t want post register fix up to occur and have the Abort handler deal with all post corrections. This would make sense, centralising as much code as possible.

Another thing to be aware of is that emulators don’t always accurately emulate Early Abort Mode, so you can’t rely on OS_PlatformFeatures 0 bit 4. I perform the test on LDR / LDM seperately and treat them independently in the Abort handler, the bit in OS_PlatformFeatures assumes both instructions behave the same – which isn’t always the case under emulation.

Register wise, do we store all 14? Or try to avoid the banked registers if possible, to avoid switching CPU states? I don’t think the overhead is that great on newer CPU’s, SA probably takes a hit though.

As well as DFSR/IFSR we’d want to pass FAR, possibly caching info as well. We need to consider the scenarios it’s going to be used for and ensure we cover them all, for example: Chocolate, virtual memory, Sparse DA, alignment faults, rotated load faults, protected access, page zero read/write etc.

What happens if I decide to call OS_File to load data from a DMA-using filing system into the ‘emulated’ address range?

Wouldn’t it break? Won’t DMA routines need to validate the memory before issuing the DMA?

What happens if I decide to call OS_File to load data from a DMA-using filing system into the ‘emulated’ address range?

Wouldn’t it break? Won’t DMA routines need to validate the memory before issuing the DMA?

We’d probably want to make it so that DMAManager can detect that the memory isn’t ‘real’ and make it use a bounce buffer instead. Or DMAManager could negotiate with the owner of the memory in order to lock it in place while the DMA operation is in process – e.g. if an abort handler is being used to track writes to memory (for something like BPP conversion in video drivers, where apps read/write to a fake screen buffer and the driver translates the data to a second buffer used by the video hardware) then it could DMA to the memory as normal and then notify the owner of the memory that the contents has been changed so that it can do a bulk conversion of the entire block.