Anima wrote:The only "proof" I have is that you can modify any Blitter register after starting it. That shouldn't work either if any memory access is blocked.

nice idea, I did only measurements CPU/BLiTTER vs video raster.can you please post an example code.

leonard wrote:Anyway AMIGA blitter is still even better. It can run on 3 operands at once ( mask + or in a single pass ). I did an amiga sprite test long time ago, far from being optimized to death, but I just checked I put 36 sprites ( 32*31, 3 bitplans ).

yep, amiga blitter has a nice cookie-cut function. It needs 4 memory accesses per one sprite word, where ST needs 6 memory accesses (3 for masking and 3 for sprites).

leonard wrote:But It was on a amiga 1200 ( don't know if the blitter is running at the same speed on a plain amiga 500)

Could be interesting to know how many 32*31, 3pl sprites can run on an normal amiga 500.

a1200 blitter is exactly the same as in a500 from performance point of view. the only difference is that in case of a1200 video chip can steals less memory accesses due to 64bit video memory path.

leonard wrote:Since I wrote the "We Were @" demo, my opinion about blitter has changed a bit Blitter is efficient because you don't have instruction prefetching penalties. For 32*32 sprites, because of the "mask set each scanline" trick, I realize blitter is faster than CPU for 32*32. Maybe that's not the case for smaller sprites.

Correct. Unfortunately the setup costs are higher for smaller sprites and less bitplanes. Btw.: We Were @" is an awesome work.

leonard wrote:Could be interesting to know how many 32*31, 3pl sprites can run on an normal amiga 500.

Please keep in mind that the Atari Blitter is quite efficient where the mask is $ffff. So in this case it copies the data instead of doing expensive RMW operations (with the new approach).

Edit:

Some numbers: in general the Amiga Blitter is exactly two times faster in clock cycles for each memory operation. So the "cookie cut" method uses a total of 8 cycles. Compared to the new blitting method above this speed can only be achieved on the Atari on sprite areas where the ENDMASK is $ffff.

It would be nice to see what's left when we add the Blitter setup overhead with the speed penalty where ENDMASK != $ffff and keep the faster CPU (8 MHz vs. ~7.1 MHz) in mind. In other words: does the CPU speed advantage compensate the setup and mask penalty costs?

2nd edit:

I forgot that the Atari Blitter has also a NFSR option. So it seems that could be helpful as well.

However, compared to the higher system clock of the Atari the Amiga Blitter timing would result in having 3072 x 8 / 7.1 ~= 3461 Atari CPU cycles "wasted".

So on a rough estimate the Amiga 500 is still about 46% faster. This number is probably lower due to the DMA bandwidth usage by other devices!?

Interestingly without the setup cost for each line the number for the Atari would be: 2560 + 1229 = 3789 CPU cycles. Compared to the equivalent 3461 CPU cycles for the Amiga version that's not bad at all!

Did I miss something? Comments?

Edit: the Blitter setup numbers are wrong. Forget everything in this post. Argh...

alexh wrote:32-bit. And I think Chip RAM is 2x the speed (14MHz instead of 7MHz)

AGA has 64bit data access, 68020 32bit and blitter 16bit access. a1200 chip ram works exactly in the same manner as in a500 - it has ~224 memory access cycles per PAL frame.14MHz with 32bit bus should offer bandwidth 56MB/s but on a1200 you can get max 4.5MB/s for read and 6.9MB/s for write.

When all amiga DMAs are OFF it is 8 clock cycles per word for the cookie cut. In case of 4 bitplane mode it is 16 cycles. IMO better is do calculations based on available memory slots.a500/a1200 has 224 available memory slots per PAL scanline when screen is off and sound is off. In case of 320px line, 5 bitplanes need 100 memory slots, 4bpls - 80 memory slots per line.

good point. I saw that some coders use term "64-bit bitplane fetch" also I saw notes about "8 byte alignment" for AGA data.Now I see that "64-bit bitplane fetch" refers to double CAS / 32bit mode in FMODE

leonard wrote:I did an amiga sprite test long time ago, far from being optimized to death, but I just checked I put 36 sprites ( 32*31, 3 bitplans ). But It was on a amiga 1200 ( don't know if the blitter is running at the same speed on a plain amiga 500)

Could be interesting to know how many 32*31, 3pl sprites can run on an normal amiga 500.

can you share your test program? we can check that on Winuae where is cycle exact a500 emulation.

I had a shot at this myself (the code-generating variant) and will have some conclusions soon. Haven't quite tested it properly - just internal debug-simulation so far and checking code quality by eye. However output does look fairly interesting. Will report something here when I know for sure its all working ok...

Have been busy to post much - but in summary, the line-reordering optimization seems to help a lot with the amount of data needing loaded/stores required and consequently registers allocated. Just need to make sure the line order is optimized for all mask preshifts simultaneously (and not one at a time) since it requires reordering the colour data at the same time - and preferably only one copy!

I'll have a lot more notes to add later when I have more time - for other types of optimizations - but this part was worth a mention since it definitely has an obvious impact in my tests.

[EDIT]

Forgot to clarify - in this case I'm generating custom code per preshift, where they share an optimized line order. So it takes more space than the version you describe. The only difference is that the line order is optimized for delta minimization per individual endmask register, since its different code per preshift per line. Other than that its basically the same sort of thing...

Oh forgot one other detail on this part - you can also add a small extra penalty for line orders which unnecessarily separate deltas on subsequent lines i.e. two EM stores on two output lines are not adjacent. This increases the number of cases where you can use postinc/predec addressing to update the EMs with a single register / using fewer offsets.

As far as I understand you are optimising the masks for each shifting step, right? So do you have a universal routine to draw the sprites or do you have a different (precompiled) code for each sprite and shifting step?

As far as I understand you are optimising the masks for each shifting step, right? So do you have a universal routine to draw the sprites or do you have a different (precompiled) code for each sprite and shifting step?

Yes exactly - precompiled code per preshift - so it does consume a lot more space than the version you describe, and probably less appropriate for something like animated large sprites. But should be faster for equivalently sized sprites which don't need many frames (like a single bob). I think I estimate around 9k per 32x32 sprite image including colour data. I guess yours will be closer to 1.5-2k including colour data - something like that.

So basically your method looks like the best *general case* method for something like games - which I think was your main criteria. I'm just playing around with alternatives to see what else is worth trying for the most optimized case, and what may lie in between.

BTW I do see a way to make a sort of hybrid which has some benefits of both (midway between speed and cost) but its a bit more complicated and not really sorted out all the details yet. It involves building a more conservative 'dedicated routine' which serves more than one preshift at a time, while wasting some redundant stores on each one. You'd then be able to control the speed vs storage cost in meaningful steps. Not sure if its worth the hassle to write but its a valid thought experiment at least.