Disclaimer: I've only tested this on mapper MMC5, but it should work with any mapper that has some manner of WRAM bankswitching.

So, I've found a neat little technique, which I'm sure isn't new or anything, but I haven't seen much discussion about it. I figured I'd post it here for comments and just to gather my thoughts on the matter. The TLDR explanation of this technique is that you use WRAM bankswitching to create a sort of pseudo hardware-accelerated array, saving cycles and freeing up registers. Just be aware, this technique has some limitations, so it's not necessarily something you want to use for everything.

the MMC5 can have up to 64K of WRAM, divided into 8K pages. Since 64/8 = 8, I'll be working on the assumption that we have 8 separate WRAM banks we can utilize. The technique becomes weaker the fewer banks we have.

The most straightforward way to utilize this might be to organize your gameobjects/entities. Now, to compare, the traditional way to do gameobjects is usually to dedicate some RAM to them, like so "Object_Health: DBS 8", organized as in the shape of a whole bunch of parallel arrays. Then, we use indexed absolute access, like so "LDA Object_Health,X" to read/write the various values that makes up the gameobject, where register X decides which of the 8 gameobjects we want to access (by being a value from 0 to 7).

But what we do instead is only allocate one byte of RAM "Object_Health: DBS 1" but make sure that it's identically allocated across all 8 WRAM banks, in the same exact location. Then, to switch between accessing the health of the 8 different gameobjects, we bankswitch the WRAM bank from 0 to 7 in lieu of using the X register, and use a vanilla "LDA Object_Health" to load out the health value. That should give you the idea of how this works and why.

Now, at this point the technique might seem like overkill, why go though so much trouble? So, let's dig into the advantages!

When updating, animating, and moving a gameobject, you frequently access that gameobject's values. You'll probably be accessing values like Object_XPosition,X, Object_YPosition,X, Object_Width,X, Object_Height,X, Object_XVelocity,X, Object_YVelocity,X, Object_Attributes,X, Object_Direction,X, and you'll probably be accessing them quite a lot. Normally to do this, you have to either ensure that X stays unmolested as the object index, or you have to copy out all those values to the Zero Page RAM before working on them. But, with the WRAM bankswitching you free up the X register, as the currently active WRAM Bank takes up the role of being your "index". That means you are free to write better and faster code utilizing the A, X and Y registers freely.

A common technique for gameobject behavior is to have pointers/addresses to subroutines stored in RAM, ready to be called regularly or under certain conditions. For instance, you might have "Object_DeathAddress: DSW 8" that you use the Indirect Jumping or the Reverse RTS trick to call when the object dies. Or you might have one for "update" that you call every frame. With the WRAM bankswitching, you can do it a lot more efficiently. Just call "JSR Object_DeathAddress" and you are done. Since we don't need to use absolute indexed mode (which JSR lacks) we don't need to jump though all those hoops to call the subroutine.

While switching RAM banks to select between gameobjects might sound annoying and inefficient, you actually do this a lot less than you'd think. When looping over and updating the 8 gameobjects every frame, we start by switching the RAM bank at the top of the loop, and that's it. We can now invoke all manner of code, for moving the object, checking object collisions against tiles, drawing out sprites, and all of the code simply assumes that the various Object_* labels are pointing to the one object they should be working on. It actually makes your code a lot more clean and straightforward.

One thing you might get hung up on is when two gameobjects need to interact, for example to check collision. There are a number of easy ways to solve this. If your mapper allows for more than one place to bankswitch WRAM (like MMC5 does) simply bankswitch in two banks of WRAM in at the same time in two separate locations. You can have gameobject 1 at $6000-$7FFF and gameobject 2 at $8000-$DFFF, to easily compare them. You haven't seen a fast and clean collision check routine until you have seen one that doesn't need register X for gameobject 1 and register Y for gameobject 2. If your mapper can only bankswitch WRAM in one location, you can quickly copy gameobject 2's values temporarily into Zero Page RAM for the same experience.

With "LDA MyAddress,X" you can access 256 bytes indexed by X. Sometimes 256 is clearly not enough though, such as the tiledata for your level. In cases like that, you gotta use some other trick like "LDA (CalculatedAddress),Y" Indirect addressing, which is expensive and clunky to setup for each access. However, with WRAM bankswitching, you can stack the 256 bytes in WRAM parallel for up to 256*8 = 2048 bytes while still using "LDA MyAddress,X", almost at the same cost. It's not enough for everything but it's still neat.

In Super Mario Bros 3, the tiledata buffer is 6480 bytes (27 rows * 16 columns * 15 screens). Let's imagine that for whatever reason, we need to increase the data size of each tile from 1 byte to 2 bytes, so that each tile has an additional byte of metadata. That's going to be extremely hard, even if we adjust all our code so that the tile calculations takes this new double offset into account, the fact is that we don't have 12960 continuous WRAM bytes. But you guessed it, just put up two 6480 tiledata buffers in WRAM parallel so that to access the second metadata byte we just switch the bank after accessing the first byte.

Now, for the disadvantages!

Obviously, this doesn't work on most mappers. Therefore, it's always going to be a niche technique.

While absolute indexed LDA can index up to 256 different gameobjects, this technique can only do 8. That might not be enough for your needs. Super Mario Bros 3 only allows for 4 enemies onscreen at once, but it obviously depends on each individual project. Luckily you can do the trick more than once, and have 8 important gameobjects, 8 projectile objects, 8 sfx objects, etc.

Maybe you have other important things in WRAM that you need to switch occasionally to, such as tiledata. In such cases you have to back up which gameobject was currently "active" in your code (akin to how you'd back up register X in the vanilla setup).

This reminds me of the trick on some 8080/Z80/LR35902 machines (ZX Spectrum, MSX, Game Boy, and Master System/Game Gear, but not ColecoVision/SG-1000) to treat RAM as a 2D array, with the array index in H, the field offset in L, and different arrays using non-overlapping field offset ranges of the same pages. But I don't see it as quite as useful on 6502 because zero page access is only one cycle slower than transfers in and out of X and Y.

The 8 object limit sounds very restrictive to me... And having different sets of 8 objects separated by type isn't a very good solution because several subroutines have to work with all types of objects (sprite drawing, collision checking, etc.), so you'd end up needing multiple copies/variations of those routines for each type, and for each combination of types that can interact, not to mention the extra logic to pick the right copy/variation to use in each case.

A smaller window to increase the total number of objects, like FrankenGraphics said, would be much more useful. That would also solve the object interaction issue, since an 8KB range would fit up to 4 2KB objects at a time.

As it is now, with only 8 objects, out of which only one is accessible at any given time, I consider this technique worthless. Sure you save a few cycles here and there, but end up wasting those cycles elsewhere, basically negating any performance gains, and at the expense of extra hardware in your cartridge.

Quote:

When updating, animating, and moving a gameobject, you frequently access that gameobject's values. You'll probably be accessing values like Object_XPosition,X, Object_YPosition,X, Object_Width,X, Object_Height,X, Object_XVelocity,X, Object_YVelocity,X, Object_Attributes,X, Object_Direction,X, and you'll probably be accessing them quite a lot. Normally to do this, you have to either ensure that X stays unmolested as the object index, or you have to copy out all those values to the Zero Page RAM before working on them. But, with the WRAM bankswitching you free up the X register, as the currently active WRAM Bank takes up the role of being your "index". That means you are free to write better and faster code utilizing the A, X and Y registers freely.

You can save X in a "CurrentObject" variable in ZP right before calling the object's code, so that if you do need the X register for something else, you can quickly restore it by loading from that variable. If you design your code well, you may not need to do this very often.

Quote:

A common technique for gameobject behavior is to have pointers/addresses to subroutines stored in RAM, ready to be called regularly or under certain conditions. For instance, you might have "Object_DeathAddress: DSW 8" that you use the Indirect Jumping or the Reverse RTS trick to call when the object dies. Or you might have one for "update" that you call every frame. With the WRAM bankswitching, you can do it a lot more efficiently. Just call "JSR Object_DeathAddress" and you are done. Since we don't need to use absolute indexed mode (which JSR lacks) we don't need to jump though all those hoops to call the subroutine.

I don't think it's common for an object's subroutines to be accessed from the outside, so JSR'ing in one object's logic to a subroutine in the same object is not a problem. Object communication is better done via a messaging system ("hey object 17, you've been hit by the player's sword - deal with it"), rather than by direct modification of another object's state.

Quote:

While switching RAM banks to select between gameobjects might sound annoying and inefficient, you actually do this a lot less than you'd think. When looping over and updating the 8 gameobjects every frame, we start by switching the RAM bank at the top of the loop, and that's it. We can now invoke all manner of code, for moving the object, checking object collisions against tiles, drawing out sprites, and all of the code simply assumes that the various Object_* labels are pointing to the one object they should be working on. It actually makes your code a lot more clean and straightforward.

If you need more than 8 objects though, you'll still be doing a lot of extra checks and indexing in order to access the different groups of 8 objects.

Quote:

One thing you might get hung up on is when two gameobjects need to interact, for example to check collision. There are a number of easy ways to solve this. If your mapper allows for more than one place to bankswitch WRAM (like MMC5 does) simply bankswitch in two banks of WRAM in at the same time in two separate locations. You can have gameobject 1 at $6000-$7FFF and gameobject 2 at $8000-$DFFF, to easily compare them. You haven't seen a fast and clean collision check routine until you have seen one that doesn't need register X for gameobject 1 and register Y for gameobject 2. If your mapper can only bankswitch WRAM in one location, you can quickly copy gameobject 2's values temporarily into Zero Page RAM for the same experience.

Not being able to see more than one object at a time is completely unacceptable. Copying the values of one of the objects to RAM completely negates the benefits of not using indexing, IMO.

Quote:

With "LDA MyAddress,X" you can access 256 bytes indexed by X. Sometimes 256 is clearly not enough though, such as the tiledata for your level. In cases like that, you gotta use some other trick like "LDA (CalculatedAddress),Y" Indirect addressing, which is expensive and clunky to setup for each access. However, with WRAM bankswitching, you can stack the 256 bytes in WRAM parallel for up to 256*8 = 2048 bytes while still using "LDA MyAddress,X", almost at the same cost. It's not enough for everything but it's still neat.

You can do that with partially unrolled code, without the overhead of switching banks.

Quote:

In Super Mario Bros 3, the tiledata buffer is 6480 bytes (27 rows * 16 columns * 15 screens). Let's imagine that for whatever reason, we need to increase the data size of each tile from 1 byte to 2 bytes, so that each tile has an additional byte of metadata. That's going to be extremely hard, even if we adjust all our code so that the tile calculations takes this new double offset into account, the fact is that we don't have 12960 continuous WRAM bytes. But you guessed it, just put up two 6480 tiledata buffers in WRAM parallel so that to access the second metadata byte we just switch the bank after accessing the first byte.

If you need 12960 bytes of tile data in an NES game, you're probably doing it wrong. Also, having to switch banks twice for every entry you need access is a hell of an overhead! But if you are in fact managing a large dynamic world (e.g. Sim City), then yeah, there's no way around using bank switchable RAM.

Thanks for the brutally honest reply. I much prefer honestly over "good job kiddo"

tokumaru wrote:

The 8 object limit sounds very restrictive to me... And having different sets of 8 objects separated by type isn't a very good solution because several subroutines have to work with all types of objects (sprite drawing, collision checking, etc.), so you'd end up needing multiple copies/variations of those routines for each type, and for each combination of types that can interact, not to mention the extra logic to pick the right copy/variation to use in each case.

I agree, if you want more than 8 objects I'd not use this technique. The separate stuff was more about how you can use this technique for separate systems, for instance, if you have tiny particle objects, you can have a separate array of 8 of them, as opposed to using gameobjects to do tiny dust particles and pain stars, since they wouldn't be sharing any subroutines anyways.

tokumaru wrote:

You can save X in a "CurrentObject" variable in ZP right before calling the object's code, so that if you do need the X register for something else, you can quickly restore it by loading from that variable. If you design your code well, you may not need to do this very often.

Sure, but sometimes you'll find yourself in an annoying situation where you want to use both the X and Y register for something in a loop, and you'll be forced to constantly restore X from your CurrentObject variable in each iteration. It's just a neat little bonus, nothing groundbreaking. After switching from X indexing to bank indexing I managed to shave off a few cycles here and there and was happy with that.

tokumaru wrote:

I don't think it's common for an object's subroutines to be accessed from the outside, so JSR'ing in one object's logic to a subroutine in the same object is not a problem. Object communication is better done via a messaging system ("hey object 17, you've been hit by the player's sword - deal with it"), rather than by direct modification of another object's state.

There are a few things that gets accessed from the outside, the general purpose "update object" subroutine being the most obvious one. JSR Object_Update costs 6 cycles. Unless I'm mistaken, to execute a jumptable jump and while returning to our current location costs 21 cycles fully optimized. With 8 game objects, that's 120 cycles saved.

The messaging system would also save a little, but it's not as important as it doesn't happen every frame.

tokumaru wrote:

If you need more than 8 objects though, you'll still be doing a lot of extra checks and indexing in order to access the different groups of 8 objects.

Yeah, agreed. As I mentioned, if you need more than 8 objects, I just wouldn't go for this technique at all. But a lot of games simply don't. SMB3 can have 4 enemies, and I can't recall ever seeing a Megaman, Ducktales or Castlevania game with more than 8 enemies. Contra would definitely run into trouble though!

Usually, on a platform like the NES, you don't make stuff like special effects and bullets the same kind of object as the enemy or the player. Code reuse isn't that great when their roles are so utterly different.

tokumaru wrote:

Not being able to see more than one object at a time is completely unacceptable. Copying the values of one of the objects to RAM completely negates the benefits of not using indexing, IMO.

To be fair, you would only need to copy the values relevant for the collision test. I wouldn't be surprised if a lot of developers already do that to free up one of the registers when constantly comparing gameobject 1 with gameobject 2.

But with the MMC5, you can "see more than one object", you can load your WRAM into the pages you'd usually have PRG ROM. That means I get to compare Object_XPosition with Object2_XPosition, which is clean, fast, and straightforward.

tokumaru wrote:

You can do that with partially unrolled code, without the overhead of switching banks.

Oh, tell me! I want to know how.

tokumaru wrote:

If you need 12960 bytes of tile data in an NES game, you're probably doing it wrong.

I mean, sure, it's a lot, but it's the exact amount of bytes SMB3 uses for tile data, times two. Is that really so forbidden? Is SMB3 just on the borderline of legality? If you want SMB3 levels but with some extra metadata for tiles you go right to jail, don't collect 200$ if you pass go?

I don't think it's common for an object's subroutines to be accessed from the outside, so JSR'ing in one object's logic to a subroutine in the same object is not a problem. Object communication is better done via a messaging system ("hey object 17, you've been hit by the player's sword - deal with it"), rather than by direct modification of another object's state.

The Smalltalk programming language uses the word "message" to mean a virtual method call, and so does Objective-C because it shoehorns a lot of approximations of Smalltalk concepts into C. The 6502 equivalent of this sort of messaging would involve pushing the object index of the sender, loading the object index of the receiver into X, calling the receiver's method, and restoring the sender's object index.

Or did you mean to associate a queue of asynchronously handled messages with each object? If so, you'll need storage for that queue and some sort of handling for when an object's queue of incoming messages fills. How do you propose to arrange that storage?

As for the 12960 byte map issue, if your map is that big, it's probably much more dynamic than that of Super Mario Bros. 3. That's the world data size I'd more often expect to see associated with something like Animal Crossing, where any of a few thousand objects can be left in any outdoor or indoor grid space. And yes, it's beyond the capability of an unmodified MMC3, though I've in the past proposed an extension to MMC3 to put a WRAM bank number in the unused bits of $A001.

That's an interesting concept. I've mostly been relying on having a few standard "public" subroutines that gameobjects expose to the outside, like update, collide, hurt, and death. My game engine calls them when the conditions are right (or every frame for update) along with some additional parameters about the situation.

Thanks for the brutally honest reply. I much prefer honestly over "good job kiddo"

I didn't mean to be rude, sorry if went too far! Anyway, I was judging this as a general solution for handling objects, and considering the need for not so common/cheap hardware configurations and all the other cons, I didn't consider the pros significant enough to justify this. But if you feel like your own game is benefiting from this setup, that's great, and it's nice of you to share the idea in case others find it useful for them as well.

Quote:

sometimes you'll find yourself in an annoying situation where you want to use both the X and Y register for something in a loop, and you'll be forced to constantly restore X from your CurrentObject variable in each iteration. It's just a neat little bonus, nothing groundbreaking. After switching from X indexing to bank indexing I managed to shave off a few cycles here and there and was happy with that.

Yeah, I guess. In the end it will depend on which order and how frequently you interleave complex object logic with object attribute manipulation.

Quote:

There are a few things that gets accessed from the outside, the general purpose "update object" subroutine being the most obvious one. JSR Object_Update costs 6 cycles. Unless I'm mistaken, to execute a jumptable jump and while returning to our current location costs 21 cycles fully optimized. With 8 game objects, that's 120 cycles saved.

Yeah, you can save a bit of time, but there are cases where you'd want to have the update address to be dynamic (i.e. different update addresses depending on the state of the object), in which case you'd want to use an indirect JMP or RTS trick anyway, so again, it depends on how you do things.

Quote:

Yeah, agreed. As I mentioned, if you need more than 8 objects, I just wouldn't go for this technique at all. But a lot of games simply don't. SMB3 can have 4 enemies, and I can't recall ever seeing a Megaman, Ducktales or Castlevania game with more than 8 enemies. Contra would definitely run into trouble though!

NES should target a low number of on screen objects for obvious reasons, but there's the fact that objects live a little bit outside of the screen too, maybe even in both axes depending on the type of scrolling you have. And there may also be invisible objects for special purposes, such as triggers for changing palettes, paths, patterns, etc., that could add up to the number of active objects even if they don't have any visual representation on the screen.

Quote:

Code reuse isn't that great when their roles are so utterly different.

Not logic code, but collision checking and drawing for example are supposed to work the same across all object types.

Quote:

To be fair, you would only need to copy the values relevant for the collision test. I wouldn't be surprised if a lot of developers already do that to free up one of the registers when constantly comparing gameobject 1 with gameobject 2.

I'll have to check my own collision code to be sure, but if I'm not mistaken, I came up with a pretty fast collision system using X and Y to point to the two objects.

Quote:

But with the MMC5, you can "see more than one object", you can load your WRAM into the pages you'd usually have PRG ROM. That means I get to compare Object_XPosition with Object2_XPosition, which is clean, fast, and straightforward.

That's much better, the only problem is that the MMC5 is not a very accessible mapper.

Quote:

Oh, tell me! I want to know how.

It's nothing fancy, you just access data in chunks of a few bytes each loop interaction instead of a single byte. For example:

I personally consider SMB3 very wasteful of RAM, and often defend that very similar, if not identical, level layouts and destructibility could be achieved with just the 2KB of built-in RAM.

This is probably a case where commercial "rules" of development stepped in and they figured that they could get the job done quick and help the project fit the budget, or get the job done elegantly (slower) and risk blowing the budget. If other design decisions pointed towards justifying the use of mmc3, they might aswell...

In homebrew, the stakes are a little different. It's more about if your game will see a release this year or not and how good you are at enduring progress (or visual evidence thereof) being slow. Then again, if blowing up proportions of RAM requirements invents new problems to solve, that's no good.

Who is online

Users browsing this forum: No registered users and 7 guests

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum