Apparently the blender did have an additive mode, but it didn't clamp the output. Assuming this is true, it explains a lot. The lack of additive transparency had a profound effect on the visual character of the system; if I could change one thing about the N64, this might be it.

Is this behaviour easy to change in microcode, or would you have to go the long way (load the framebuffer as a texture, do the addition and clamping in the render units) and kill your fillrate? Or is it outright impossible?

From what I've heard, Tepples is correct. If I'm not mistaken, the RSP gives the RDP different parameters for drawing something, (does the geometry) and the RDP renders it. Getting the RSP to actually draw anything (if it's possible at all) would probably be prohibitively slow. (It would have to be done in software.)

I'm surprised the lack of additive blending is what you would first change about the N64. Most people complain about the low resolution textures (small texture cache, although I think cartridge space had just as much to do with it) or the "Vaseline covered" video. (I have no idea what this is caused by, but it seems more prevalent is some games than others.)

The texture cache was twice the size of the one in the PSX, and as Rare showed it was certainly possible to work around it. Nintendo/SGI were just stingy with their tools. The RDRAM was high-latency, sure, but it was very fast and there weren't a lot of better options available at the time. I'm sure the CPU/RAM interface could have been improved, but I'm not a hardware engineer so I don't know what you'd have to sacrifice to do that. It'd be a major change for sure. The cartridge format had advantages and disadvantages vs. CDs, and the lack of FMV never bothered me; IIRC the N64 basically pioneered the "in-engine cutscene" for this very reason. And I'm pretty sure you could turn off both antialiasing and texture filtering if you really wanted to.

But the blender not clamping its output is just a straight-up dumb move. Take a look at this:

Attachment:

bodyharvest1.jpg [ 7.2 KiB | Viewed 2849 times ]

Or this:

Attachment:

naboo1.jpg [ 17.78 KiB | Viewed 2849 times ]

Notice how despite the fact that they look like they're from different generations, they both have the same problem. Faking it with alpha just doesn't cut it. The only thing that looks decent is pure white.

From the looks of things, the color combiner clamps its output (or at least it clamps the chroma key operation, and the description of the blender - which is immediately downstream of the color combiner - asserts that all the inputs are clamped), and the formula does look flexible enough to do additive transparency. Unfortunately it doesn't operate on framebuffer pixels, meaning you would indeed have to do this the long way. Maybe the really long way, if you can't convince the RDP to combine transformed and untransformed texture data...

Last edited by 93143 on Sun Aug 27, 2017 1:41 am, edited 2 times in total.

What I mean is that you can't use the current framebuffer pixel directly as a source for the color combiner. That's the blender's job. It seems like if you want to do additive transparency with the color combiner, you have to get the framebuffer in the front end of the pipeline somehow, and the easiest way to do that is probably through the texture cache.

Now the question is: can you combine a transformed, perspective-corrected trilinear-interpolated texture map with a linear bitmap this way? If not, you might have to draw each additive object to a blank buffer and load it in together with the relevant section of the framebuffer to do the combination properly. (Maybe I should look into how environment mapping works.)

You could also just convert the scene to 24-bit (without expanding the RGB values) to give yourself some headroom for the blender, and then just clamp it back to 15-bit after drawing all the additive objects. But that only gives you enough room for about 7 whites stacked on top of a light background before it overflows...

The point is, all of this extra maneuvering costs fillrate, which is the bottleneck on the N64. It'd probably be okay for a modest number of objects, but a busy battle scene full of lasers and explosions could be a problem. As a non-expert, I'm not sure how much of a problem exactly, but if Nintendo/SGI had just clamped the output of the blender this wouldn't be an issue.

I'm sorry if this seemed like a bit of a rhetorical question. It wasn't. The responses actually resulted in me getting [what I think is] a better understanding of how the RCP works, once I cross-checked what people were saying with what I thought I remembered from the manual.

So I guess I got what I asked for, if not exactly what I wanted.

...

The N64 is a bit frustrating for me, since I'm reasonably sure I will never be able to produce a game that maxes out the hardware. It's too much work. The SNES is on the border, but this... I maintain that BotW is probably possible on N64. The idea that I might be able to prove it some day - even with a team behind me - is laughable. That's just an example, but it's illustrative.

It's fun to read about the N64's internals, but I think I'll end up sticking to SNES. Just finishing my current project is a big enough challenge...

I'm just saying that there was a surprising amount of overlap given the power difference, which may be attributable to the N64 being hard to program, compounded by the inefficient official microcode holding the system back

I would say for the vast majority of developers any inefficiencies in the official microcode would not have been problem, because they would have been memory bandwidth bound well before they were at the stage to push RSP hard. Rather, the issue was that the official microcode was built around the use of a z-buffer, which guzzled away at precious memory bandwidth (as noted in my previous post, the N64 did not have significantly higher "peak" memory bandwidth than its console competition anyway).

The more technically advanced N64 games, with memory bandwidth under a certain level of control, could certainly have put pressure on RSP. For example, Conker with its multiple light sources, or the Factor 5 games with their relatively complex terrain generation from heightmaps. I believe the consensus among developers was that SGI did not get the balance right in the official microcode between vertex/lighting quality and performance (probably owing to their workstation background), which is why the top-level developers had to tweak provided microcodes or write their own. SGI did actually provide updated versions of the official microcode throughout the N64's life which would have made life a fair bit easier. As for the official audio microcode, it seems to have been unnecessarily resource hungry, maybe because SGI did not have the experience of an audio hardware company. For developers like Factor 5 who needed to squeeze more performance out of RSP, they wrote their own audio 'driver'.

93143 wrote:

Comparing bad N64 games with good PlayStation games can make it look like the PlayStation was more powerful, which is not something you tend to get with (say) the NES vs. the SNES

I don't think I quite understand your point here. It has always been true that the best programmed games on weaker hardware in a console generation look better than the worst programmed games on stronger hardware in the same generation. For example, the Xbox was unequivocally stronger than the PS2 (blending aside, thanks to eDRAM) but Xbox game Bruce Lee: Quest of the Dragon looked worse than the older PS2 game The Bouncer.

The comparison between NES and SNES is not legitimate because the technical difference between the two consoles was several times that of PS1 and N64, or PS2 and Xbox. A sufficiently large generational difference in technology and tools will usually rescue even poorly programmed games from looking less graphically advanced than older titles.

93143 wrote:

Speaking of RAM, what was the latency like on the RCP side? I've heard figures as high as 600 ns for a CPU cache miss, but I can't imagine that's representative of RCP random access

600ns for a CPU cache miss sounds right. Of course that timing also accounts for the memory controller being built into RCP and being physically external to the CPU. That means the latency for RCP's random access will definitely be lower but I'm not sure exactly by how much.

93143 wrote:

Also, I believe the N64's RAM was divided into four banks; did this have any relevance to latency?

It is divided in two banks, with color intended to go into one and z into the other (though not mandatory), but the banks are not simultaneously accessible. Yes, it's supposed to greatly decrease latency through address pipelining. Because RDRAM does row and column addressing separately, you should theoretically be able to change banks without incurring significant latency even though the access is going outside the current page (which should be useful when switching between color and z buffers). Unfortunately, like most things related to RDRAM, in practice it didn't work nearly that well so you still got much of the latency. Though I'm unsure if the fault was with the RDRAM design itself or the memory controller in RCP.

EDIT: Oops. Checked into it. The RDRAM really is divided into four banks.

And yet, it seems that certain late games, such as World Driver Championship, managed to match or exceed the PlayStation's advertised capabilities (180,000 textured tris per second) while maintaining the additional features that made N64 polygons so power-intensive...

World Driver Championship didn't use a z-buffer, so that meant it was far less memory bound than most other N64 games. As RSP is both wider and clock-faster than GTE, and RDP has almost double the pixel-fill rate of PS1's GPU (even with most features turned on) it's unsurprising that N64 can put out considerably more and better polygons than Playstation when memory bandwidth is out of the way. Not having a z-buffer won't be good in every situation though, so that's why SGI would have considered z-buffering the sensible "default option". EDIT: More choice for developers would have been even better...

93143 wrote:

I'm sorry; I can't accept that without more detail. What's wrong with the methods I proposed? I could guess, but that's not a path to edification

My apologies, I was simply describing what would have been SGI's intended use case for RCP's additive blending. I have the following points to make about the theoretical workaround described in your earlier post.

1) Yes, the color combiner does clamp its output.

2) Copy mode won't increase the speed of copying the framebuffer into TMEM (that would continue to be memory bus bound anyway). All copy mode does is make pixels go through RDP's pipeline faster by skipping almost all processing. So rather than making framebuffer -> TMEM go more quickly, it would make TMEM -> framebuffer go faster. The purpose is to make 2D tiling really fast.

3) I don't think you can have reliable pixel-level control over primitives already dispatched to RDP (not without insane cycle counting across the whole console, anyway). So you can't change RDP attributes or modes on individual primitive pixels and get expected results every time. Reliable per-primitive changes are possible through the synchronization function though.

But I think what you are describing is definitely possible with the color combiner in two cycle mode. The color combiner maths is newcolor = (A - B) x C + D. So you could write your own color combiner function like this newcolor = (1 - 0) x TEXEL0 + TEXEL1, with TEXEL0 being the texture for blending, and TEXEL1 being the framebuffer chunk both loaded into TMEM. You could still use the result of this in a second color combiner operation. The N64's little 'pixel shader' may yet save the day.

I haven't given sufficient thought as to what this would mean for the blender (including anti-aliasing or fog) so I may get back to you on that. But as you can imagine, this would be hell on TMEM. I still think it may all be madness in practice, but maybe it is not.

There's no way to safely pre-clamp the pixels if you want to use the blender, except as I stated before, be really conservative and cautious with the hardware additive blending feature as SGI intended. I suppose you could do the additive blending with the CPU if you wanted for some software rendering. Since the N64 has a unified memory architecture, the CPU can just read/write straight to the framebuffer. Hardly an efficient use of resources, but it's there.

Last edited by realityengine on Sun Apr 08, 2018 1:21 pm, edited 4 times in total.

Comparing bad N64 games with good PlayStation games can make it look like the PlayStation was more powerful, which is not something you tend to get with (say) the NES vs. the SNES

I don't think I quite understand your point here. It has always been true that best programmed games on weaker hardware in a console generation looks better than the worst programmed games on stronger hardware in the same generation. [...]

The comparison between NES and SNES is not legitimate because the technical difference between the two consoles was several times that of PS1 and N64, or PS2 and Xbox. A sufficiently large generational difference in technology and tools will usually rescue even poorly programmed games from looking less graphically advanced than older titles.

Could comparison of graphical production values between inter-generation consoles and both the best of the previous gen and the worst of the next gen be valid? Examples could include good NES games vs. bad TG16 games, or good TG16 games vs. bad Super NES games. Or ColecoVision/MSX between Atari 2600/Intellivision and NES, or Jaguar and 3DO between Genesis/Super NES and Saturn/PlayStation.

realityengine wrote:

600ns for a CPU cache miss sounds right. Of course that also accounts for the latency of the memory controller being built into RCP and existing externally to the CPU.

GPU-on-the-northbridge is also seen in Xbox 360 and Intel GMA. How do those overcome it? Is the embedded VRAM in the Xbox 360's Xenos northbridge/GPU the key?

realityengine wrote:

World Driver Championship didn't use a z-buffer, so that meant it was far less memory bound than most other N64 games. As RSP is both wider and clock-faster than GTE, and RDP has almost double the pixel-fill rate of PS1's GPU (even with most features turned on) it's unsurprising that N64 can put out considerably more and better polygons than Playstation when memory bandwidth is out of the way. Not having a z-buffer won't be good in every situation though, so that's why SGI would have considered z-buffering the sensible "default option".

What's the standard workaround for lack of depth buffering? I've heard a few, such as running bin sort on triangles or precomputing each object's ordering from each of the 8 corners of the bounding cube. Is there a way to write Z but not read Z, in order to have geometry intersect a transparent water plane?

Could comparison of graphical production values between inter-generation consoles and both the best of the previous gen and the worst of the next gen be valid? Examples could include good NES games vs. bad TG16 games, or good TG16 games vs. bad Super NES games. Or ColecoVision/MSX between Atari 2600/Intellivision and NES, or Jaguar and 3DO between Genesis/Super NES and Saturn/PlayStation.

I'm not setting a strict rule. Of course it is possible to make even a PS4 game have less advanced 3D visuals than a Sega Saturn game! It just gets less and less likely as tech and tools get better. A comparison IMO is really only valid when the specs are sufficiently close. 'Sufficiently' is subjective.

tepples wrote:

GPU-on-the-northbridge is also seen in Xbox 360 and Intel GMA. How do those overcome it? Is the embedded VRAM in the Xbox 360's Xenos northbridge/GPU the key?

The GPU being on the same die as the memory controller is actually an advantage to the GPU, since it can access RAM with less latency than the CPU. So I'm not sure what exactly you mean by those GPUs needing to overcome something. Either way, it's not an absolutely massive difference. Intel CPUs only stopped having external memory controllers with Core i7 in 2008. IIRC AMD moved to internal memory controllers with Athlon 64 in 2003. Yet, AMD's Phenom demonstrated no substantial memory advantages to Intel Core 2 Duo. The Athlon 64 demonstrated memory advantages over the Pentium 4, but maybe that was just because the Pentium 4's architecture wasn't particularly good.

tepples wrote:

What's the standard workaround for lack of depth buffering? I've heard a few, such as running bin sort on triangles or precomputing each object's ordering from each of the 8 corners of the bounding cube.

The Z-Sort microcode, which was an official (yet feature incomplete) N64 microcode that formed the basis for Boss Studios titles (e.g. World Driver Champion) allowed developers to provide any suitable triangle sorting algorithm which would then be used to generate back-to-front display lists.

BSP trees were also very popular. There was also the Crash Bandicoot approach of just pre-computing the depth buffer on workstations (meant no user camera control though).

tepples wrote:

Is there a way to write Z but not read Z, in order to have geometry intersect a transparent water plane?

Comparing bad N64 games with good PlayStation games can make it look like the PlayStation was more powerful, which is not something you tend to get with (say) the NES vs. the SNES

I don't think I quite understand your point here. It has always been true that the best programmed games on weaker hardware in a console generation look better than the worst programmed games on stronger hardware in the same generation.

Part of the problem is that it's difficult to be precise when I'm trying to express my interpretation of my impression of other people's opinions. What I mean is that the N64 (for debatable reasons) failed to convince a certain subset of gamers that it was even more powerful than the PSX at all. There's a popular perception that PSX games tended to be higher-poly than N64 games, and have higher-res textures; even if this is partly or largely false, there's a reason people think this stuff. And there are definitely people who prefer the PlayStation's image quality. Maybe there's fanboyism mixed in there, as there tends to be with Mega Drive audio vs. SNES audio, but as in that case there's something to be said for it. PSX polygons plus high-level artistry could look very impressive, particularly in screenshots, or to someone who'd decided he didn't care about affine swim or vertex jitter, or preferred jaggies to blur. And yes, I believe additive blending helped. Prerendered backgrounds and FMV absolutely helped...

It was possible for the N64 to trounce the PSX pretty much on all counts in terms of real-time graphics, as demonstrated by a handful of high-level games. But it seems to have been unusually difficult to get anywhere near that level, leading to the comparison being much more of a matter of taste than it should have been given the power gap on paper.

The fact that some people think the PS2 was massively more powerful than the GameCube is much more bewildering. Maybe they're just going on published performance numbers (which were extremely apples-to-oranges that gen)...

Quote:

The comparison between NES and SNES is not legitimate because the technical difference between the two consoles was several times that of PS1 and N64, or PS2 and Xbox.

The SNES wasn't all that much more powerful than the NES. Twice the CPU clock speed, twice the word length (but an 8-bit bus that nerfed it a bit, not to mention slow ROM and RAM), similar resolution and video architecture but with higher-quality pixels (kinda like the N64 vs. the PSX) and, uh... six times the video memory bandwidth, with the PPU beefed up to handle it. Yeah, that could be important...

Okay, it's not a great analogy...

Quote:

It is divided in two banks, with color intended to go into one and z into the other (though not mandatory), but the banks are not simultaneously accessible.

Are you sure? I've seen a reference to it having four banks of 1 MB each, and a reference to it having two banks of 1 MB each.

Quote:

Copy mode won't increase the speed of copying the framebuffer into TMEM (that would continue to be memory bus bound anyway).

I was kinda thinking out loud there. I had absentmindedly assumed that copying the framebuffer into TMEM would take the same amount of time as drawing a pixel, which was never true, but that's what I was comparing copy mode to. (I believe copy mode can be used for loading TMEM, so you wouldn't have to do the transfer entirely manually, but I suppose whether using copy mode is a good idea would depend on the implementation.)

Quote:

I don't think you can have reliable pixel-level control over primitives already dispatched to RDP (not without insane cycle counting across the whole console, anyway)

That's what I was afraid of. I don't know how mipmapping works, but I assume it just uses the same coordinate transform on a different texture tile. Changing the transform would appear to be out of scope.

Quote:

But I think what you are describing is definitely possible with the color combiner in two cycle mode. The color combiner maths is newcolor = (A - B) x C + D. So you could write your own color combiner function like this newcolor = (1 - 0) x TEXEL0 + TEXEL1, with TEXEL0 being the texture for blending, and TEXEL1 being the framebuffer chunk both loaded into TMEM.

But where would you get TEXEL1? It has to change per-pixel, but not based on the transform you used to get TEXEL0. Can you get the RDP to load one of the constant colour registers from TMEM every other cycle?

It would work fine going the long way, by pre-rendering the additive graphic and reloading it as a texture so you can just step through the pixels one by one (and maybe you could speed that up by pre-rendering and reloading in 8bpp indexed, if the colour profile is sufficiently one-dimensional or if you don't mind nearest neighbour), but I don't see how you get this in one step unless the additive graphic is to be rendered untransformed (like in a 2D sidescroller or something).

...

Wait a minute. The rasterizer generates an RGBA pixel value in addition to the texture coordinates and LOD level. Where does that come from? Is it just the vertex shader value?

Quote:

There's no way to safely pre-clamp the pixels if you want to use the blender

The method I described should work mathematically. It's not an attempt to avoid TMEM thrash; it's just a modification of the color combiner method to maybe improve framebuffer metadata handling. Use the first color combiner cycle to do the clamped add, then use the second cycle to subtract the framebuffer texel from the added and clamped sum (the result of this subtraction cannot overflow, so it won't be re-clamped). The output of this process should be blender-safe by simple arithmetic, since it's exactly the result you want minus the thing you're about to add it to.

It may seem like a waste of math, and it means you can't actually shade the pixels because it uses both color combiner cycles just to pre-clamp them, but if it helps with antialiasing and/or Z-buffering it might be worth it.

Who is online

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot post attachments in this forum