So, after some profiling I've realized that a lot of my precious CPU time is being spent on calculating sprites. Not just the reads and writes, but also all the meta stuff, picking the right palette, doing offsets to read out animation data, flipping the sprite if it's turned around, etc. I got a neat animation system that I'm very happy with, but it's a little costy.

I've been optimizing all of this to be as fast and clever as possible, but the one thing that really irks me is that most object's sprites end up being exactly the same each frame, so I'm constantly recalculating the same results over and over.

So I've been trying to implement some sort of caching mechanism, so that I don't have to recalculate everything if nothing has changed. I usually know when things have changed (changed object state, scrolled the background, etc) so I know exactly when to reuse the cache and when to refresh it.

But building an efficient and lightweight cache has proved difficult.

The simplest and fastest way would be to simply reuse my DMA's Sprite RAM, and not clear it every frame. Maybe adjusting some x and y positions if the game has scrolled. But, all the sprite flickering techniques I know involves scrambling the order of the sprites every frame, which means no object ever gets the same Sprite RAM position twice in a row. This ruins everything.

Next up I tried allocating some more RAM as a temporary buffer, so that objects could put their sprite data there, and then it could be copy-scrambled over to the real Sprite RAM right before the DMA. But, to allow for all 64 sprites that's 256 bytes of RAM down the drain. Ouch. Not sure I want to spend that much memory.

According to the wiki, a "simple OAM cycling technique" can be implemented by using a write to OAMADDR before the DMA transfer. However, due to OAMADDR writes also having a "corruption" effect this technique is not recommended. Also, if the technique works like how I think it does, the OAM cycling would be very crude and might leave objects invisible for several frames.

So, is my quest impossible, or are there some other ideas or techniques?

Metasprite rendering continues to be the bottleneck for me, too. There are some improvements I could make, but the biggest I got so far was to simply move to 8x16 sprites, halving the number of iterations the meta sprite drawing routines must do. That's been good enough for now and has given me the performance I want for the game I'm building.

But, all the sprite flickering techniques I know involves scrambling the order of the sprites every frame, which means no object ever gets the same Sprite RAM position twice in a row. This ruins everything.

Well, as usual in computing and in particular in retro-computing, you have to sacrifice thigns in order to get the desired features. You should just have two OAM pages, one where the sprites are not shuffled, which is your cache, and one where you shuffle the sprites from the cache so they're re-ordered and flickers properly when there's more than 8 per line instead of disappearing. That sounds rather simple to do.

Quote:

Ouch. Not sure I want to spend that much memory.

Well, that's the price for your sprite caching system. You can save memory by caching only some of the 4 parameters if RAM usage is really this much a problem.

Sprite updates are indeed pretty expensive. Some switch to 8x16 sprites purely so less time is spent rendering an object.

If your game has larger objects, having a separate render routine when you know the object is entirely onscreen can skip a lot of extra logic for checking offscreen per sprite. (Alternatively... don't check for offscreen per sprite.)

Reserved1 is high X, reserved 3 is high Y. Reserved6 is how many sprites are left.

But I made this note as an optimization (untested, so I'm not including it with the code I know works)

Quote:

Indivisible's rolled drawmetasprite loop can probably be made faster. They end with

dec <reserved6 bne dms.o.loop

But: cpy <reserved6; (or some other zero page variable, since reserved6 can't really be changed bne dms.o.loop; without making the other code slower)

cpy is 2 cycles faster than dec, but it also ensures a clear carry when the loop begins again.

Basically the setup code should just add <reserved6 *4 to y and store it somewhere. I imagine the reason I didn't is because reserved6 is technically variable (due to the greater than 64 sprite stuff), but it wouldn't really affect this if the loop were set up properly.

I also have a separate subroutine when I want to do "versatile" things like dynamically changing the palette of every sprite in the object.Edit: Oh wait, no it's just the one above. That's what the reserved8 thing is. So basically I have a fast and a slow one.

Basically I recommend having a different routines for every case. Usually you don't want to do anything advanced, so at least have one for the fastest possible case. (Guaranteed on screen, no dynamic anything.)

The use of OAMADDR with values other than zero is heavily discouraged, since that can result in sprite corruption.

The best method for caching sprites I can think of is indeed using another 256-bytes for a second OAM shadow, so you can alternate between them every frame and copy data from one to the other if the sprites are known to not have changed.

What kind of sprite cycling method are you currently using? Are you willing to change that to accommodate the sprite caching? Maybe you can come up with a solution that swaps individual OAM entries when they need to be kept, and simply overwrites the ones that don't.

Well, that's the price for your sprite caching system. You can save memory by caching only some of the 4 parameters if RAM usage is really this much a problem.

That's fair enough, and it's probably what I will fall back to if no other secret technique pops up. I've been playing around with doing an "in place" shuffle of the original Sprite RAM so that objects have a new position in the buffer every frame, yet still retain their old values. The shuffling process is fairly expensive though, since you gotta shuffle 256 different values.

Post some code? It's surprising to me that sprites would be so expensive.

Now, I haven't posted any code in this post. It's not like my code is secret, but I'd much rather explain what it does (and why it's slow) instead of posting a big blob of asm and forcing everybody to decrypt what's going on. Also, I haven't commented it yet

Some games like Super Mario Bros 3 has a lot of restrictions on what game objects can exist where, doing tricks like hardcoding the available palettes and CHR banks to the level. So if this is a Goomba+Koopa Troopa level, you simply can't use the Boo or Thwomp enemies or they will look strange and miscolored, and vice versa.

I've been working on a system to defeat such restrictions, by dynamically loading and unloading CHR and palettes as they are needed. The way things work, when an ingame object is created, it attempts to grab an 8k sprite CHR page and a palette for itself. I use a lot of techniques to maximize their potential like reusing as much as possible, having optional alternative graphics and color schemes, and even splitting palettes in two (and if it's just utterly impossible to fit in, the object simply despawns before it's seen).

But all of this only happens when the object is created, not every frame, so it's not the expensive part. But the point is, any object might end up with any of the CHR pages or any of the palettes. So, while SMB3 can optimize it's Koopa Troopa drawing routine by always refering to palette 2, my system has to do a lookup to see which of the four palettes my object was assigned.

Then there is animation data, and some extra goodies I've baked in there like allowing small x/y offsets on the sprites or vflip/hflip flags on individual sprites in the meta-sprite. Despite having so much more stuff than SMB3, my system is still faster due to better coding.

Still, I can see the potential for massive gains by reusing those values rather than having to recalculate them every frame.

Last edited by Drakim on Fri Jan 12, 2018 11:40 am, edited 1 time in total.

Then there is animation data, and some extra goodies I've baked in there like allowing small x/y offsets on the sprites or vflip/hflip flags on individual sprites in the meta-sprite.

That shouldn't really affect rendering at all. Whether an individual sprite is flipped or not doesn't matter to the block copy, whether an individual sprite is offset a little doesn't matter to the block copy.

Is the issue that you're also trying to save space? I stored every frame twice, once flipped and once not, rather than flipping it at runtime and I don't feel bad about it.

That shouldn't really affect rendering at all. Whether an individual sprite is flipped or not doesn't matter to the block copy, whether an individual sprite is offset a little doesn't matter to the block copy.

The thing is, I don't block copy my animation data right to the sprite buffer, I do stuff like XOR the global flip flags with the individual flip flags, and add the global x/y coordinate with the sprite's local x/y offset. I also have to fish out the correct palette since it's not hardcoded.

Quote:

Is the issue that you're also trying to save space? I stored every frame twice, once flipped and once not, rather than flipping it at runtime and I don't feel bad about it.

Huh....I hadn't thought about that. That's genius! It totally saves me the XOR of the flip flags. I could even use a macro, and potentially do it for other things too. Thanks mate!

The palette thing is easy, if the object only uses one palette. (Which, if you're dynamically allocating palettes is probably the common case) It's one instruction:

Code:

lda [reserved4],yora SPRpalettesta OAM+2,x

You store the palette the object wants to SPRpalette before rendering and lose 3 cycles per sprite, oh well.

In a case where palette 0 is like... a shared palette (reserved for player one or something, that enemies can also use)... you could maybe get cute with the data and store it shifted right one bit.

Now the highest bit is free to use as a flag for that.Essentially:Bit 7: Use Palette 0Bit 6: Flip Sprite VerticallyBit 5: Flip Sprite Horizontallyetc.

Code:

lda [reserved4],yasl a;Whether to use palette 0 is now in the carry, flip sprite vertically is in bit 7 where OAM expects it, etc.bcs storepalette;If the high bit was set, we use palette zeroora SPRpalettestorepalette:sta OAM+2,x

Admittedly that's still a bit heavy 64 times, but well...Edit: Just to say it, I'm not sure how much help caching sprite data will be in a game that scrolls. But I'll think about it.

The palette thing is easy, if the object only uses one palette. (Which, if you're dynamically allocating palettes is probably the common case) It's one instruction.

I am saving the palette per sprite per frame, for the entire metasprite, but in hindsight it's just as you say, maybe 90% of all metasprites use only one palette. I'll make a faster drawing routine they can use instead of the full one, that doesn't need to look up the palette byte each time.

Quote:

Edit: Just to say it, I'm not sure how much help caching sprite data will be in a game that scrolls. But I'll think about it.

That's a good point, but I think it could be worked around since scrolling most of the time only happens on one axis even for games with multi-directional scrolling like SMB3. You'd simply have to loop over every 4th byte and do an addition The carry flag will tell you if the sprite is now off-screen.

But even if it's not viable, just being able to reuse the pattern and attribute bytes would still be a boon.

Who is online

Users browsing this forum: No registered users and 6 guests

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum