Frank B wrote:Possibly but we're comparing general purpose code against demo code I bet there is some serious cheating going on

I agreee we compare two things very different. But my point is you already compare BLITTER generic code with PRE-SHIFT CPU routine. To me "preshift" is already not so "generic" (because of memory amount needed, especially with sprite animation).So as soon as your benchmark compare blitter vs "preshift" cpu, I though it would be interesting to add a compare with generated code, wich is exactly just another form of "preshift" to my opinion.

Another thing interesting to add (but takes time to code ) is the "generic" case in a game: save & restore background. In that case, and in 4 bitplans, maybe the CPU is slighty better, just because the read background is mixed with the register load to apply the mask. ( 1 read instead of 2 for blitter ).

Anyway I just want to say that blitter is faster than CPU for many things. But people have to be aware of one thing: if you have to write a sprite demo, depending on many parameters (sprite size, shape etc), you HAVE to carefully think between CPU & Blitter because blitter won't be the best in all case!

To get a real compare with 4 bitplans, there is the nice reset demo in Decade Demo, showing 11 32*32 4 bitplans sprites, using pre-shifted data (but no generated code) and real time waveform. I draw 18 of the exactly same sprite using generated code in the genius demo, back in 1990

I'll maybe do another one with a 3 or 4 plane sprite, including restoral of the background. First of all I want to add a Falcon 030 renderer, proper blitter detect, fix the delayed write of the video counters and a HOP filtering trick (5 bitplanes scroller I'd also like to compare CPU shifted scroller vs blitter and port both to the Amiga. Might be interesting to have a "restart method" blitter renderer too.

This is in addition to optimising the code. I'm still resistant to adding certain optimisations at the expense of the API though.I'll publish the code for others to play with

I started work on benchmark by my ideas: what would be close to some game sprite draw system. So, full background restore, low res ... It will not display Vbl count for given sprite amount draw, but time needed to draw (and undraw) certain count of sprites - on any screen loc. For start there will be no clipping support, so all sprites will be drawn fully, what means that position will not go near to edges. But real test should include draw with clipping too .I don't see that we need special code for Falcon when ST low res is in case.

Considering CPU based fine scroller: that's fast enough only on TT, on Falcon is slower than blitter based scroll on some ST(E). So, I guess that on some Amiga 1200 will be OK, although don't see the point. I used pretty long code, with separated rutines for all shifts 1-15 .

Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.

AtariZoll wrote:I started work on benchmark by my ideas: what would be close to some game sprite draw system. So, full background restore, low res ... It will not display Vbl count for given sprite amount draw, but time needed to draw (and undraw) certain count of sprites - on any screen loc. For start there will be no clipping support, so all sprites will be drawn fully, what means that position will not go near to edges. But real test should include draw with clipping too .I don't see that we need special code for Falcon when ST low res is in case.

Considering CPU based fine scroller: that's fast enough only on TT, on Falcon is slower than blitter based scroll on some ST(E). So, I guess that on some Amiga 1200 will be OK, although don't see the point. I used pretty long code, with separated rutines for all shifts 1-15 .

Cool My CPU realtime shifter can draw 9 32 * 30 2 plane objects a frame with the teletype and frame counter draw. Reasonably fast for the CPU. It'll slow down if I have to restore rather than clear the background.

Mine also has a raster timer btw. You can enable it with the 6 key and see how much time each component needs.If we're going to compare mine to yours we should set some constraints first. Ie the API and format of the input data.

I don't see why API is relevant. Considering input data: 4 bpp. ST low res is what is most used when sprites are in question, and overall with blitter in games and demos.Most interesting for games is following sprite drawing way: first save background for later sprite undraw, then must apply mask on background - using AND op. and mask data. Then with OR draw sprite self. Undraw is simple copy of saved background data. Little speed can be gained if complete background is stored separately, then no need for saving it before sprite draw - at price of some RAM.There is way what needs not background save and restore - with constant updating of whole background, so sprites will be "automatically" removed.But that costs pretty much CPU or blitter time. However, that is necessary if we want scrolling background. May be good in case of many sprites too, because will save lot by not doing background save and restore.Because of masking, I concluded that in sprite data color index 0 should be transparency = will be background color. Then masking can be done fastest. Did some basic speed comparison by simple drawing 32x32 px sprites using CPU or blitter. Blitter way is faster some 20%, but it needs 4x more space for mask data, because blitter works only with 2 dimensions, so can not use same mask for all 4 bit planes. On Mega STE set to 16 MHz CPU way is faster - not surprise . My test prg. is still in very early stage, so will not post it yet.Tests on Falcon are little strange: With CPU and blitter set to 16 MHz CPU way is faster some 30% .

Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.

Next version of Steem SSE will correctly clear "hog" bit, but there's also a timing problem, Steem is too quick (up to $14 blobs/VBL).I found a bug where Steem didn't count the reading cycles on "FXSR", but that's not enough. This program is happy only when we do count cycles for "NFSR", which you shouldn't ... further investigations required.So see how useful your little bug-ridden program turns out to be.

When not using skew and those source reread registers with blitter speed is same as on real HW. I will soon check with skew too.However I noticed strange thing in 3.2 : it says 14 T states for move.l d0,(a1)+ , but that's 12 for sure . And 14 will be expanded to 16 at end .

Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.

Because I want the code to be usable and not a hacky mess. The way the code is now it's trivial to adapt it to any number of planes. It's also trivial to add background saving and restoral.

AtariZoll wrote: Blitter way is faster some 20%, but it needs 4x more space for mask data, because blitter works only with 2 dimensions, so can not use same mask for all 4 bit planes..

Why do you need to duplicate the mask? You can simply blit the same mask (Not SRC AND DST) 4 times to the screen. The set up time is negligible. The only reason for duplicating the masks is to save reloading the source address.I'd also advise being careful about drawing conclusions about the speed on the Mega STe. Remember that you're not likely to be blitting the same sprite frame repeatedly in a game. Try cycling through various frames to reduce the cache hits on the source.

Steven Seagal wrote:Next version of Steem SSE will correctly clear "hog" bit, but there's also a timing problem, Steem is too quick (up to $14 blobs/VBL).I found a bug where Steem didn't count the reading cycles on "FXSR", but that's not enough. This program is happy only when we do count cycles for "NFSR", which you shouldn't ... further investigations required.So see how useful your little bug-ridden program turns out to be.

It wouldn't have had those bugs if I'd tested on real hardware In any case that behaviour of the hog bit wasn't documented. We all got it wrong. The positive thing is we found out something new about the hw behaviour I'm relying on NFSR as an optimisation on the src data.

Frank B wrote: It wouldn't have had those bugs if I'd developed on real hardware In any case that behaviour of the hog bit wasn't documented. We all got it wrong! The positive thing is we found out something new about the hw behaviour I'm relying on NFSR as an optimisation on the src data.

What I wrote about it was just opposite. I said that coders just set HOG bit by every blitter start based on why not, it costs nothing

Frank B wrote:Because I want the code to be usable and not a hacky mess. The way the code is now it's trivial to adapt it to any number of planes. It's also trivial to add background saving and restoral.

My idea was to do things on my way, and then we can compare results, or at least speed ratios between different ways of sprite draw. I don't think that anything is trivial if you want get max possible speed. And of course, I don't have clue about your code, and actually I don't want to know. it's always better if multiple people does it, then we can see which code is faster. Not sure that it's best idea to make some universal code, what is "trivial" to adapt to any number of planes. My goal is fastest possible code. But not by using some dirty tricks, just efficient code. So, I do it in pure ASM, using Devpac 3 . As test "platform" I use Steem Debugger where it is easy to trace and spot bugs. After that may test on real HW. And as expected, all it works fine on Falcon - because code is not messy, hacky. For me C is mess, and really don't see need for it in such SW, where beside test code you need only few simple tasks as going supervisor, load some file, set screen ...

Frank B wrote:Why do you need to duplicate the mask? I'd also advise being careful about drawing conclusions about the speed on the Mega STe. Remember that you're not likely to be blitting the same sprite frame repeatedly in a game. Try cycling through various frames to reduce the cache hits on the source.

Just because go on max possible speed. Setting and starting blitter for only 8 bytes of sprite data seems pretty bad idea, what would cost some 5-10% in speed. In case of wide sprites, so over 60 px width may be worth.On Mega STE some 15 % speed gain is just because faster loading of opcodes and internal CPU operations in this case. 16 KB cache can hold some count of sprites too So, I would say that CPU with some aid is good, especially because poor blitter was never helped with cache.Anyway, if you don't like my approach I can start new thread in coding section.

Famous Schrodinger's cat hypothetical experiment says that cat is dead or alive until we open box and see condition of poor animal, which deserved better logic. Cat is always in some certain state - regardless from is observer able or not to see what the state is.

Here's a much older build from 1993 or so. It's not a benchmark, just a little intro. Does 30 in a frame though There's one wee cheat I'm using which I'm not using on the benchmark rewrite. Remember my goals aren't necessarily the same as Pepera's or anyone else.I'm interested in putting several styles of renderer up against the blitter and benching the result.

You do not have the required permissions to view the files attached to this post.

1) It will run on machines with no blitter present. If there is no blitter you will be restricted to CPU modes.I've used the OS for this rather than trap a bus access. I might add that later.

2) It has a new blitter "op" draw mode. This can be activated by pressing the 7 key. Key 8 will cycle through each draw mode.It's fun to see xor and or draw modes

3) It supports the VDI. You can access VDI renderer mode with the 9 key.In this mode you can press b on the keyboard to toggle the blitter on and off.

4) The teletype no longer requires the blitter. It'll run on a normal ST.

The VDI renderer is hideously slow. The reason for this is that the number of planes have to match on the source and destination.I had to move to a 4 plane object in this mode and use "no-op" planes to leave the text as is.

Ie it has to do a logical and with 0,0,-1,-1 rather than a clear. It has to use planes 3 and 4 with zeros for the or.Might be useful to benchmark different screen accelerators