Yes I am trying to maximize the number of floating point operations done per pixel. So I have color and alpha gradients on each triangle (I've stopped using quads)and I draw transparent objects from back to front to force overdraw each and every time. On level 2 of my test, each pixel should be redrawn about 30 times. On level three this is nearer 50 redraws, and so on. The redraws go up linearly, but drop in performance is closer to cubic now, so the number of vertexes seems to be relevant.

Blending is handled by the ROPs, so blending shouldn't have a very high per-pixel cost in practice. The ROPs are bandwidth limited, and they are made to handle deferred shading and HDR rendering which uses a lot more bandwidth, so it's almost free but you might want to test that for your specific card. Color gradients should be free too since the per pixel interpolation is hardware accelerated and also most likely done per pixel regardless of if the colors are constant or varying per vertex. The most expensive operation you can add without shaders is texturing, which introduces some bandwidth and math costs for sampling and multiplying in the texture color.

Raster OutPut unit. The hardware units that handle combining the pixel shader output with the frame buffer. This is why we don't have blending shaders and still have the fixed functionality blending with glBlendFunc() and glBlendEquation().

Raster OutPut unit. The hardware units that handle combining the pixel shader output with the frame buffer. This is why we don't have blending shaders and still have the fixed functionality blending with glBlendFunc() and glBlendEquation().

I'm glad, we don't have any shaders for that part of the GPU. Hardware is much faster than Shaders, and at this point, I can't think of any other funcionality needed, exept for the already existing blend functions, and abilities in the fragment shader.

Well I must say I am appalled. My new GT430 does not perform significantly better than the 3450 - 10% better at most. It may be that my test is ROP-intensive and the GT430 does not have a lot of ROPs.

The GLProfile tells me the code is being hardware accelerated. I will try download a new driver tomorrow and see if that fixes the issue.

I had a suspicion that my newbie GL code might be preventing hardware acceleration from actually happening, but my daughter's PC (a gaming PC from three years ago, she has no idea what GPU she has, she only knows it plays Warcraft...) gets more believable results:

Test level GT430 Gaming PC1³ 80 592³ 13 593³ x 254³ x 15

I presume she has some setting in the driver that is stopping her card from rendering faster than the screen refresh rate.

Well I must say I am appalled. My new GT430 does not perform significantly better than the 3450 - 10% better at most. It may be that my test is ROP-intensive and the GT430 does not have a lot of ROPs.

The GLProfile tells me the code is being hardware accelerated. I will try download a new driver tomorrow and see if that fixes the issue.

I had a suspicion that my newbie GL code might be preventing hardware acceleration from actually happening, but my daughter's PC (a gaming PC from three years ago, she has no idea what GPU she has, she only knows it plays Warcraft...) gets more believable results:

Test level GT430 Gaming PC1³ 80 592³ 13 593³ x 254³ x 15

I presume she has some setting in the driver that is stopping her card from rendering faster than the screen refresh rate.

That's V-sync limiting the FPS to 60. I could try it on my laptop for you. If you're not getting good scaling with a better graphics card you might be CPU limited. A GT430 (assuming the desktop version) can do 268.8 GFLOPs compared to your old laptop's card's 40, so it should be a few times faster at least. Not necessarily linearly though, GFLOPS is a very inaccurate way of measuring actual game performance.

UPDATE: disabling blending of transparent layers gives a totally different picture, with an order of magnitude difference between the performance of the GT430 and 3450. So if I cut out the fifty-plus alpha calculations per pixel the GT430 gives the performance differential that I expected over the 3450, exceeding my expectations in fact.

I knew the GT430 only has 4 ROPs so the result makes a lot of sense (the 3450 also has four ROPs, seemingly performing only a bit - 33% - slower). But when the test is to draw a high volume of polygons with no blending, the 430 pulls far ahead of the 3450.

TL;DR: This is a ramble about how a GPU works and why GPU performance is so hard to predict for different GPUs and OpenGL settings. It's not meant to be read by people who aren't interested in 3D or how GPUs work on a lower level than OpenGL.

It's worth noting that vertex performance does not scale linearly with GPU power. A better GPU is of course faster, but not linearly. Pixel filling performance scales a bit better, but what scales the best is probably shader complexity. (Note: That came from my own personal experience, so it might not be accurate across the board.) Most games don't want to be able to draw 10 million triangles per frame or have 50x overdraw over the whole screen. They want expensive render targets (usually 3 or 4 x 16-bit float RGBA), complex lighting calculations, more texture bandwidth and the likes, while MINIMIZING overdraw. BF3 even uses computing to do the lighting, which has proved to be a lot more effective than rendering lighting geometry with OpenGL for deferred shading. GPUs are obviously made for the games that use them, so a pathological case with lots of blending, lots of vertices or lots of cheap pixels, etc is not going to perform as well as a more realistic case for the GPU. We pretty much encounter a somewhat similar problem to microbenchmarking but for GPUs. We might also bottleneck one part which can leave other operations free since the hardware is a lot more specialized.

The GPU also does lots of optimizations based on what you enable. Enabling the depth test actually increases performance for 3D games (may be as high as 2-3x) depending on how many pixels that get rejected by the depth test since it can do the depth test before the color of the pixel is calculated. Face culling can also improve performance a lot. However, some of these optimizations might be unusable with some combinations of settings. For example, for blending to be accurate the polygons need to be processed in the order they are submitted, while without blending we're only interested in the closest pixel, so the GPU can in theory process them in any order it want to (this is a speculation). Other things like enabling alpha testing or modifying the depth of a pixel with a shader also forces the the shader/fixed functionality to be run before the depth/alpha test. It's easy to accidentally produce a case where the GPU cannot use such optimization. Even worse, the flexibility of the GPU varies between GPU generations and even more between vendors, so what works for you might crawl on another GPU, or vice versa.

I don't even know a fraction of what my GPU does, but at least I know that I don't know much about it and I take that into account. Ensuring that a you make it as easy as possible for the GPU to use its optimizations is important for real-world performance. Ever heard of a z pre-pass? It's when you draw everything in the game twice to increase performance. Makes sense, doesn't it? By first drawing only the depth of the scene to the depth buffer, we can then enable the depth test to only run the shader on the pixels that are actually visible. We might double the vertex cost of the game to reduce the amount of SHADED overdraw to 0, which might be a perfectly valid tradeoff if your pixel shaders are expensive enough.

GPUs are massive parallel processors. A GT430 "only" has:

- 96 unified shader processors. They can switch between processing vertices and pixels to adapt to some extent for an ineven workload. This was also introduced when deferred shading became big, when doing deferred shading you first have a very vertex-heavy workload, but then switch to lighting processing which is 100% pixel limited instead. Ancient cards with separate vertex shaders and pixel shaders would have half their shaders stalled when doing deferred shading.

- 16 texture mapping units. Not sure about all the things that they do, but they do handle bilinear filtering and spatial caching of texture samples. Bilinear filtering is free for 8-bit RGBA textures on today's GPUs thanks to these. The texture cache also helps hugely when we are simply sampling a small local part of a texture. In this thread I made a program that benefited a lot from this cache. By zooming out too much, I could see what happens when we start to pretty much randomly sample the tile map, and FPS dropped from 1350 to 450 FPS. The shader workload remains identical, but we get a texture bottleneck! Wooh! That means we could do math in the shader for free as long as it isn't dependent on the texture samples! Confused yet?

- 4 ROPs. You made me a bit curious, and it seems like "The ROPs perform the transactions between the relevant buffers in the local memory - this includes writing or reading values, as well as blending them together." (Wikipedia). That would mean that the ROPs also handle the depth test and stencil test too in addition to blending. You learn something new everyday!

That's usually written as 96:16:4. On the other hand, a Radeon HD 7970 has a core configuration of 2048:128:32. We have 21,3x the number of shaders, 8 times the number of texture mapping units and 8 times the number of ROPs. What's up with the number of shaders?! Well, Radeon cards have traditionally had a higher number of shaders. NVidia countered this by having a separate shader clock which ran at double the clock of everything else on the card, meaning that the HD 7970 "only" has 10.6x the number of shaders in practice. (NVidia have with the recently released 700 series ditched the separate shader clock and tripled the number of shaders to match AMDs setup which is more power efficient.) On top of that, a ROP on one card may not equal a ROP on another card. For example, Nvidia's ROPs are famous for being able to pump out more pixels per clock than AMD's. Memory bandwidth also affects texture performance, ROP performance, how well the game scales with higher resolutions, etc. On top of THAT, we also have a GPU clock and a memory clock, which affect performance the same way they do for CPUs. Yes, it's common to overclock your GPU if you have the cooling for it. NVidia's newest cards released months ago even have built in overclocking when the card isn't using as much power as it similar to how a quad-core CPU increases their clock-rate when not all cores are used. This was a great idea, since GPUs have so many hardware features that may not be used to their potential in a certain game. Here's a diagram over a GTX 680, NVidias most powerful GPU at the moment:

Only the small green boxes are unified shader units. What're all those other things?! Here's a zoomed in picture of one of an SMX:

Polymorph Engines (yellow): DX11 generation cards (your GT430 included) has hardware tesselators which can effectively cut up a triangle to generate smaller triangles, which can be displaced to create real uneven surfaces from a single triangle. NVidia does this with their Polymorph Engines, among other things as you can see (The 400 and 500 series use the first generation, the GTX 600 series the second, therefore 2.0). Does your game use tessellation? If no, we have idle hardware on your card. Doing too heavy tessellation of triangles can easily turn into a bottleneck too.

SFUs (dark green): These handle special floating point math (Special Floating point Unit), presumably things like trigonometric functions and maybe even square roots. They're shared by a few shader units each, so they could be a bottleneck too.

Raster Engines (yellow): I believe these take in vertices and optionally indices, put together triangles and outputs which pixels are covered by the triangle (= rasterizing). Could these be the bottleneck when we have to much overdraw of cheap pixels?

Conclusion: I could go on and on, but I think you get it by now or rather, you get that you don't get it. xD What's bottlenecking your program? I don't know. =D I wasn't saying that the ROPs were bottlenecking your programming, I meant that it could be anything, and that you're not utilizing your GPU in the way the makers of it expected.

Great information there. I agree the GPU architecture is too complex to make simplistic comparisons. And that's leaving aside the drivers which introduce another level of uncertainty. The only way would be sure how the app performs relative to architecture would be to run it on lots of different cards.

Great information there. I agree the GPU architecture is too complex to make simplistic comparisons. And that's leaving aside the drivers which introduce another level of uncertainty. The only way would be sure how the app performs relative to architecture would be to run it on lots of different cards.

... which is why Nvidia and AMD spend lots of money supporting game development. Ever seen "The Way It's Meant To Be Played" and Nvidia's logo when starting a game? That's because Nvidia have helped them out, and obviously added some optimizations specific for their Geforce cards. Game makers obviously want to get even performance from both Nvidia and AMD cards, but those companies are competing and obviously want to look faster than the other.

Great information there. I agree the GPU architecture is too complex to make simplistic comparisons. And that's leaving aside the drivers which introduce another level of uncertainty. The only way would be sure how the app performs relative to architecture would be to run it on lots of different cards.

... which is why Nvidia and AMD spend lots of money supporting game development. Ever seen "The Way It's Meant To Be Played" and Nvidia's logo when starting a game? That's because Nvidia have helped them out, and obviously added some optimizations specific for their Geforce cards. Game makers obviously want to get even performance from both Nvidia and AMD cards, but those companies are competing and obviously want to look faster than the other.

Great information there. I agree the GPU architecture is too complex to make simplistic comparisons. And that's leaving aside the drivers which introduce another level of uncertainty. The only way would be sure how the app performs relative to architecture would be to run it on lots of different cards.

... which is why Nvidia and AMD spend lots of money supporting game development. Ever seen "The Way It's Meant To Be Played" and Nvidia's logo when starting a game? That's because Nvidia have helped them out, and obviously added some optimizations specific for their Geforce cards. Game makers obviously want to get even performance from both Nvidia and AMD cards, but those companies are competing and obviously want to look faster than the other.

while writing my bachelor thesis about CUDA I also learned a lot about how GPUs work.And for every one who is interested in a more in depth view how the GPU works I really recommend to take a look at the CUDA documentation from NVIDIA which describes the architecture in a really easy way.

I have continued my experiments, this time trying to see the relative effects of retained mode versus immediate mode. I built a vertex buffer of 20k vertexes and drew it using either retained or immediate mode. I did it in a cycle increasing the number of times I drew the vertexes each cycle by the square of the iteration. To exclude pixel rendering as an issue I set the camera to be very distant from the scene (thus reducing the number of pixels being drawn).

Anyway, here are my results, if you've ever wondered if there is much difference between these two modes. (My loops terminate when the frame rate goes below 40).

The overhead of immediate mode can be offset by having a good CPU. A less balanced computer (or better balanced for gaming?) might have a better GPU and a worse CPU, so it might suffer a lot more from immediate mode (assuming you are comparing glBegin()/glEnd() to glDrawArrays()). Loading stuff into a vertex buffer can also be done on multiple cores, something that is impossible with immediate mode. All in all there's no reason to use more CPU power than you need since your game most likely can make use of any spare cycles for AI, physics, more entities, etc.

Heh, I had an AMD Athlon dual core paired with a GTX 295 for a few months...

Sorry, I've got my terminology a bit mixed up - I was using glDrawArrays for both modes. The only difference was whether the buffer was a VBO located on the graphics card or it was a FloatBuffer located in the PC's RAM. I am sure if I did true retained mode with lots of calls to glBegin/End there would have been an even larger difference. It would be way too much bother to change my current test code to compare VBOs to glBegin/End code, although I will update the thread when I have a new test showing PC RAM buffers vs VBOs vs Display Lists vs glBegin/End. I've heard display lists are even faster than VBOs so it would interesting to test this out, and if I'm building display lists then I can easily test raw retained mode too.

My PC is definitely CPU limited - a Pentium Dual Core 1.8GHz matched with a GT430. The PC might be a bit slow, but I got it for 15 euros! I had to buy the graphics card separately and it cost three times as much as the PC.

I'm writing these tests because it is not immediately apparent whether this or that change in code results in a performance improvement. As a noob it is really easy to write something that looks OK but actually degrades the performance.

Sorry, I've got my terminology a bit mixed up - I was using glDrawArrays for both modes. The only difference was whether the buffer was a VBO located on the graphics card or it was a FloatBuffer located in the PC's RAM. I am sure if I did true retained mode with lots of calls to glBegin/End there would have been an even larger difference. It would be way too much bother to change my current test code to compare VBOs to glBegin/End code, although I will update the thread when I have a new test showing PC RAM buffers vs VBOs vs Display Lists vs glBegin/End. I've heard display lists are even faster than VBOs so it would interesting to test this out, and if I'm building display lists then I can easily test raw retained mode too.

My PC is definitely CPU limited - a Pentium Dual Core 1.8GHz matched with a GT430. The PC might be a bit slow, but I got it for 15 euros! I had to buy the graphics card separately and it cost three times as much as the PC.

I'm writing these tests because it is not immediately apparent whether this or that change in code results in a performance improvement. As a noob it is really easy to write something that looks OK but actually degrades the performance.

Ah, okay. There's a pretty good reason why immediate mode and the gl******Pointer functions that take a Buffer object were removed in OpenGL 3.2 (or was it 3.1?). Even if you use VBOs and update them each frame it should be faster if you use glMapBuffer(). You should try that, it's pretty simple. It should pretty much be the fastest way of doing it. Theroetically your program shouldn't be bottlenecked by the memory transfer since it should happen in parallel to the rendering, though such an optimal case where the copy is 100% is of course rare...

Display lists were deprecated too! VBOs should be just as fast for rendering stuff, but I do agree that display lists have some use since you can also store state change commands, though that doesn't mean that they are free or should be called more often than without display lists. Try it out, the driver is usually able to optimize the data a lot for display lists. A thing to note is that texture binds are NOT stored in the display list, though this might be a driver bug (not likely though) since it's written in the OpenGL specs that they should be stored.

Technically there is no such thing as "retained mode" anymore. That was a feature from Iris GL, DirectX didn't maintain it after its first public release (and dropped it in DX10) and OpenGL never had it. Obviously, using vertex arrays blurs the distinction somewhat, so most people know what you mean, but if you use the term in a DX forum, you'll get a lot of sideways looks from people wondering why you're using such an ancient deprecated API.

Technically there is no such thing as "retained mode" anymore. That was a feature from Iris GL, DirectX didn't maintain it after its first public release (and dropped it in DX10) and OpenGL never had it. Obviously, using vertex arrays blurs the distinction somewhat, so most people know what you mean, but if you use the term in a DX forum, you'll get a lot of sideways looks from people wondering why you're using such an ancient deprecated API.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org