However you end up doing this, can I suggest that it be submitted to ROOL as 16 colour mode emulation would be a very worthwhile addition to the OS generally. Indeed, I suspect some of your other routines that you're developing for ADFFS might make worthwhile additions to the operating system as a whole.

I appreciate the code also needs to be present in ADFFS for compatibility with older systems, but the whole "handling 16 colour screen modes" is going to be an issue for every port of RISC OS, pretty much, going forwards.

Indeed, one suggestion would be to look into using the second core of a dual core CPU to sit there doing mode conversion code (one proposed solution to the infamous RGB<->BGR RISC OS "issue"), although I suspect it'd bottleneck because the conversion code cannot be executed until the 16 colour buffer has been calculated...

Message #122861, posted by sirbod at 19:43, 27/11/2013, in reply to message #122860

Member
Posts: 563

I was pondering that very question today. The solution is so deceptively simple it could easily be added into the core OS to provide legacy MODE support.

There's no hackery involved, it's all done using valid RO SWI's and leaves the OS to handle just about everything, with the screen buffer being in DA2 instead of the GPU.

I may well knock up a stripped down stand-alone module at some point, once I've added 1 bpp and 2 bpp translation.

The only botch I had to do, was to figure out the logical GPU screen buffer address. The OS could really do with an SWI to get that info...or OS_Memory extended to handle IO physical addresses, which seems the sensible thing to do.

That's 8 pixels per 24 instructions, compared to 40+ for your version. But using AND to extract a pixel and then ORRing it in at the correct offset is faster, as it'll be two instructions per pixel instead of three:

8 pixels in 10 instructions, using the full palette hack, compared to the 13 instructions for your approach. But considering the inner portion is so short, you could easily boost it further by unrolling the loop a few times.

The PLD instruction should also come in useful. The cache line size on the Pi is 32 bytes, so I'd suggest unrolling the loop to the point where each iteration processes 32 source bytes, with a preload instruction somewhere to preload a future cacheline. I'm not sure off the top of my head how far ahead the data should be preloaded, but I'd say 128 bytes ahead should give the hardware plenty of time to fetch the data before you need it.

It's also worth noting that this research suggests that the optimum write size is 4 words.

I'll leave the production of cycle timing optimised routines to someone with more spare time than myself