Well, the prefetch feature of the 68000 is no secret. My theory was, this will cause an internal stage and another internal stage XOR a word memory access (total 2x 4 cycles) to be executed before a blit starts. The theory hasn't been tested for much more than DIVSing perspective while drawing lines 23 years ago, basically cos I couldn't find anything else that was more useful that wasn't a normal sub-12 cycle instruction. Sometimes the lines would be shorter than the DIVS cycle time of course, but it was rare. The point was that it was started immediately thus calculated internally in parallel finishing faster.

The other one is aligning table lookups (usually 14c) or a taken branch (10c) and similar with the alternating 4-cycle CMA/DMA memory access. In the vblank period, when no DMA is active it's "as written", just sum up the cycles. But while actually displaying something some of them would just be out of luck and have their CMA execute the NEXT 4-cycle slot the bitplane DMA wasn't hogging access.

Normally this is too much work really (really!) since you can't really go "oh, I'll halve the number of colors on screen and I'll be able to fit one or two CMA's between bitplane accesses" cos you'd have ruined the original idea (by making it look shit) and also would have gained much more already, by removing a bitplane's DMA, both for blitter and CPU.

So I haven't tried this unless I got some routine right by accident so I expect someone is going to debunk it instantly (and thunderously!)

But it's basically the last straw to grip when a frameful of effect is a sequence of perfectly and godlikely optimized instructions (according to yourself of course). Considering you'd likely have to sync with raster at the start of something, you'd probably lose more by that sync than you gained! But there might be a situation. Not one that wouldn't be completely 'surpassed' by precalc or infinite bobs or whatever, of course. Hah.

Trying to optimize often leads to a few lines of strange new code that does it faster or shorter but looks irrelevant to the task Even though this looks more like good code for implementing variable typing in a higher level language, I liked it.

Trying to optimize often leads to a few lines of strange new code that does it faster or shorter but looks irrelevant to the task

So true. Leading to the weird experience of looking at some of your very own code and thinking: huh? what the hell was I doing that for?

Followed by a few minutes of groping through hazy memories and realising: oh, yeah, that's why.

It's another reason why I personally find it difficult to nigh on impossible to figure out other people's demo code. They did so many weird little things and optimisations that only they understood the reason for that I've got no chance.

Much easier to code your own effects from scratch than figure out how some other coder did it their way.

This works for other numbers too - try it and see what works. Obviously there'll be a cut off somewhere where all the shifting and adding will mount up and it might be quicker or no different to use a mulu.w instead.

Also, you do have to watch out for shifting digits "off the end" of the size the registers in use can hold.

On 68000, instead of doing shifts/adds, you could also use a multiplication table. Disadvantage is that you might need a spare address register (depending on where in memory your table is) and that the table needs some memory of course. Advantage is that it is faster than lots of shifts+adds.