To me, the biggest advantage of using the stack is free indexing. This makes it easy to transfer variable amounts of bytes, without having to waste time setting up indices. The fact that destination addresses and byte counts (in the form of RTSable pointers) also come from the stack makes things even faster.

One way things could be faster than this is of you used ZP for the buffer, but without indexing it for reading, or you'd lose the speed boost, so I don't see how that would work. If you used self-modifiable code to read the correct amount of bytes from the correct memory positions (something that would most likely require extra RAM on the cartridge), you'd actually be better off loading immediate values in the self-modifiable code instead, which would result in even faster transfers.

If anyone can come up with a practical way to read arbitrary amounts of bytes from arbitrary ZP locations without using indexing, I'd love to hear about it.

That only takes care of the byte count, not the position of the data. Sure I can have an unrolled loop that copies 128 from ZP to $2007, and I can JMP or RTS to start near the end of it and copy only 4 bytes, but it will ALWAYS be the SAME 4 bytes, because there's no indexing whatsoever, the addresses are hardcoded! That's useless for transferring variable amounts of data, because you need to transfer different blocks from the middle of the buffer, not always the last N bytes.

Is hardcoding the addresses really a problem? Exactly how many different combinations of PPU string lengths do you need to handle? (This is rhetorical though; if your problem is really that specific, you need a specific solution, obviously.)

Of course! Say that I have an 8-way scrolling engine, using 4 name tables, which I update by drawing 17-metatile rows and 16-metatile columns as necessary. Depending on the position of the camera, the 17 metatiles of a row will be distributed differently across 2 name tables... It might be 1 on the left and 16 on the right, 16 on the left and 1 on the right, or anything in between. The same goes for columns, with all combinations between 1 + 15 and 15 + 1.

So, for each row of metatiles, I need 4 transfers:

- left part, top half;- left part, bottom half;- right part, top half;- right part, bottom half;

And for columns:

- top part, left half;- top part, right half;- bottom part, left half;- bottom part, right half;

All those blocks of data vary in size depending on the position of the camera. I fail to see how I could read an arbitrary number of bytes from an arbitrary section of the buffer, without using indices.

A video update buffer is supposed to be dynamic, you don't know what's going to be put there or in what order... And the same type of data could have different lengths each time (like happens with rows and columns of metatiles, which can be distributed differently across the 4 name tables).

Sure, I could have separate buffers for each task (palette updates, metatiles updates, etc.), each with its own transfer routine reading from hardcoded addresses. In a game that only scrolls in 1 direction, even rows and columns could be handled this way, since the number of bytes to copy would always be the same. But this is not versatile. If you don't need one type of update in a frame, you can't reuse the memory for another type of update, meaning you'd waste more of the precious ZP memory than you'll actually need each frame.

And the 8-way scrolling issue I presented above is practically impossible to solve using hardcoded addresses unless you code a routine for each possible distribution of metatiles, times 2 (row + column) times 2 halves, which would mean an insane waste of ROM.

I guess I can see a simpler engine settling for the hardcoded address approach, though. If you only scroll in one direction, 8 pixels at a time, there's nothing too bad about an unrolled loop that copies 30 or 32 tiles from a constant location. Or if you need to update up to 4 patterns in CHR-RAM per frame, you could have an unrolled loop to copy 64 bytes from another fixed location, which could be filled backwards, and different entry points in the unrolled loop would allow copying 4, 3, 2 or 1 pattern(s).

The possible settings are far from arbitrary though, and the use of memory is not very flexible. For games with heavier VRAM usage (an extreme example would be Battletoads), I don't think this would be feasible.

To me that sounds like a paradox really, considering that a simple game is less likely to need a speed boost in VRAM transfers than a complex game with lots of dynamic updates.

I think one of the advantages of the PLA/STA $2007 loop is that you can determine the point to jump in by shifting the negative number of bytes to the left by 2.

Can you elaborate on this? I'm not sure I understand, but I'd like to.

tokumaru wrote:

To me, the biggest advantage of using the stack is free indexing. This makes it easy to transfer variable amounts of bytes, without having to waste time setting up indices. The fact that destination addresses and byte counts (in the form of RTSable pointers) also come from the stack makes things even faster.

I'm not sure I understand. If you have a buffer in ram somewhere, you should be able to read it using lda $0X00,y at a cost of 4 cycles. Oh, I think I see what you're getting at, that you don't have to deal with y thus eliminating an ldy and a dey? Yeah, okay, I didn't think of that lol. That is pretty cool. I wonder if I have enough stack to use this now lol.

Quote:

If anyone can come up with a practical way to read arbitrary amounts of bytes from arbitrary ZP locations without using indexing, I'd love to hear about it.

Nope, just the annoying way of an unrolled loop.

However, if your copies are large, you could have two seperate lists: one with AddressHi, AddressLo, and Length, the other would be in zero page with data. NMI would read from the first list, but copy data from the second list using an unrolled loop from zero page. Time of 3 rather than 4. Or you could just pull from the stack or use a buffer in ram. I like the easy solutions.

I think the paradox comes from thinking about generic problems rather than specific problems. There's like a billion different ways to do multidirectional scrolling, or the myriad of other tasks you're rifling through here. You're trying to solve too many problems with the same code, and you're over-constrained. You can always add constraints until a problem becomes impossible. There's a lot of ways to make the ZP work, if you want it.

Personally, I don't think you really need ZP for this; how much ZP space is worth devoting to the NMI? The answer for most games would likely be smaller than what you can already push to the PPU by other means.

I think one of the advantages of the PLA/STA $2007 loop is that you can determine the point to jump in by shifting the negative number of bytes to the left by 2.

Can you elaborate on this? I'm not sure I understand, but I'd like to.

He means that since each instance of PLA/STA $2007 assembles to 4 bytes of ROM, you can use bit shifting to calculate the address to jump to in order to copy a specific number of bytes. Want to copy 5 bytes? Jump to EndOfCopyCode - 5 * 4. Personally, I don't see this as much of an advantage, you could just as easily and probably just as fast use a lookup table of entry points.

Quote:

Oh, I think I see what you're getting at, that you don't have to deal with y thus eliminating an ldy and a dey? Yeah, okay, I didn't think of that lol.

Yeah, that's the idea. Technically, the dey could be eliminated anyway if you used sequential addresses (i..e. LDA Buffer+0, y/LDA Buffer+1, y/LDA Buffer+2, y/etc.) in the unrolled loop, but Y would still need to be updated for each transfer, so that's something we can avoid by using the stack method. ROM usage is lower too, since LDA $XXXX, y/STA $2007 is 6 bytes while PLA/STA $2007 is only 4 bytes.

I think the paradox comes from thinking about generic problems rather than specific problems. There's like a billion different ways to do multidirectional scrolling, or the myriad of other tasks you're rifling through here. You're trying to solve too many problems with the same code, and you're over-constrained.

Well, this is part of my recent methodology of development, which is actually working out pretty well for me. I've been trying to come up with the most generic solutions possible, and my programs are way less confusing than when I had a lot of specific routines for each little task. The previous iteration of my scrolling routine did in fact have separate buffers for everything, and dedicated transfer routines, and there was practically no flexibility in how the time was used though (e.g. the routine that updated up to 8 arbitrary metatiles would always claim an entire update slot for itself even if there was only 1 metatile to update).

Now, since I have a completely dynamic buffer, I can stuff whatever type of update I want in it, no matter how small or how large, and as long as it fits in the buffer, I know it will be handled during the next vblank.

My current solution is good enough for my needs (my buffer is 212 bytes long, so if I were to use it all in a single update I could upload 208 bytes to VRAM - this is hardly ever the case though, because each transfer has an overhead of 4 bytes: 2 for the target address and 2 for the RTSable entry point), it's not like I'm desperately chasing after a faster method for updating VRAM, I was just curious if someone was able to come up with a fast generic solution using ZP.

Quote:

Personally, I don't think you really need ZP for this; how much ZP space is worth devoting to the NMI? The answer for most games would likely be smaller than what you can already push to the PPU by other means.

Who is online

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum