I can't fathom it either. It proved quite a distraction from other things I should be doing; it was nearly 3am by the time I made my post last night, by which time I was starting to doubt the validity of my judgement... I haven't been totally rigorous but I can't see a major flaw in my methodology, and now jezek2 seems to be finding the same.

I was running 10,000 particles in some of my tests, 30,000 in others. Either is enough to make particles easily the most expensive thing in the app; framerates around 30fps (much faster with particles off).

Interesting point about the pipeline... I've been planning to make some changes to the way my rendering works that would make implementing something along the lines of jezek2s suggestion very easy.

At some point, I might do my particle animation on the GPU instead... that really should be faster. Also, I've certainly seen big gains going away from immediate mode in other parts of the program.

For the record, I'm using it for quite small number of vertices (much smaller than the amount xinaesthetic mentioned), also it's not primarily for particles though I currently render the few particles using immediate mode too.

Err... you should be surprised, because using plain old fashioned vertex arrays, he's got 2x the performance of VBOs. Which is exactly the opposite of what just happened to me in my sprite engine. Admittedly Riven's code is pretty much a microbenchmark and my sprite engine is a real-world application doing real things, so possibly my results are more relevant, though I need to test on the Mac, ATI cards, Intel cards, and various PC configurations before I draw any firm conclusions.

Riven, could you try to submit 10 times (or more) the vertex data per iteration and post another comparison?

Btw, we can't really compare display lists here, this is rendering of dynamic geometry submitted to the GPU each frame (each iteration in Riven's code represents a rendered frame).

yes, that why I suppose in case of static geometrie displaylist would have be faster (a way faster as only one JNI call for thousands of draws), so everything have it own usage depending on application. nb: also mixing different displaylist (and recompil them on the fly) , reorder them too, can work pretty weel for dynamic scene rendering

but the comparaison can be made in a certain manner by drawing the same thing to the screen

Err... you should be surprised, because using plain old fashioned vertex arrays, he's got 2x the performance of VBOs. Which is exactly the opposite of what just happened to me in my sprite engine. Admittedly Riven's code is pretty much a microbenchmark and my sprite engine is a real-world application doing real things, so possibly my results are more relevant, though I need to test on the Mac, ATI cards, Intel cards, and various PC configurations before I draw any firm conclusions.

Cas

Yeah but you're never ever supposed to use immediate mode anymore. So if you're using a combination of VBOs and immediate mode then it makes sense to be slower than glDrawArrays or glDrawElements.

Perhaps the valuable lesson for now is that VAs are still valuable for very dynamic geometry since the vbo's slower update doesn't outweigh the rendering benefits. From my personal experience VBOs offer a very consistent speed boost when they're not constantly being updated, but this is to be expected.

Also, for people who encourage the use of display lists, I've had troubling issues with them on Mac hardware. I've seen cases where rendering with them is significantly slower than vbos or vas, and where it's much faster. It's also been the cause (or a very coincidentally unrelated) of odd graphical glitches with the Mac windowing manager/compositor.

The internal format of data in DLs is also very very slightly different in some cases than arrays or immediate mode, which leads to rendering artifacts. I forget where I read this - it was a long time ago - but it was the final nail in the coffin for me.

Also, for people who encourage the use of display lists, I've had troubling issues with them on Mac hardware. I've seen cases where rendering with them is significantly slower than vbos or vas, and where it's much faster.

I concur that I've seen a program of mine making quite heavy use of display lists running much much worse on a powerbook with afaik semi-decent graphics than on a pretty basic older windows laptop with integrated graphics... it was ok on a PowerMac with I think 8600GT (as one would hope).

The internal format of data in DLs is also very very slightly different in some cases than arrays or immediate mode, which leads to rendering artifacts. I forget where I read this - it was a long time ago - but it was the final nail in the coffin for me.

Cas

I bet this is only showing when rendering identical VA/VBO and DL geometry, (maybe) causing z-fighting and slightly different edges in the rasterization step. Minecraft is built entirely using DLs, and from what I see it is 'good enough'. Maybe you should ask Markus Persson what mysterious bug reports he gets from his players.

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don't care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode.

Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that's taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack) and used the new glMapBuffer() method that takes a size argument - why the method doesn't take the capacity() of the buffer is a bit odd but there we go, as that's the only safe argument to actually pass in at this point as the limit() can change after the mapping is made. Some small improvement in framerate is made - good. I'm on the right track here definitely.

Now glMapBufferARB() itself is the actual bottleneck. Hmm. Why should this be taking 20% of my native time? Ahh of course - because it's probably locked by the GPU. The solution is very simple - double buffer it. So I now use two identically sized VBOs, and swap them each frame. The GPU reads from one while I write to the other.

Suddenly I'm getting a 50% increase in frame rate. There may be a bit more to come if I try triple buffering the VBOs as well but I'm not quite sure if that's actually going to make any difference (even if my display is triple buffered).

Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it's not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()?

Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it's not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()?

from Ken Perlin's simplex noise:

1 2 3 4

// This method is a *lot* faster than using (int)Math.floor(x)privatestaticintfastfloor(doublex) {returnx>0 ? (int)x : (int)x-1;}

Hi, appreciate more people! Σ ♥ = ¾Learn how to award medals... and work your way up the social rankings!

I went ahead and ran my benchmarker with XProf so I could compare it to your findings. I'm using an LWJGL nightly build from a few days ago (right after the ATI driver issue was fixed). I don't even get glCheckError() as a blip on the radar. The big one for me is MacOSXContextImplementation.nSwapBuffers (which kind of makes sense) and then glDrawArrays. Am I just missing something?

Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don't care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode.

Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that's taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack)

That's weird, you shouldn't need any hack for that. Since 2.2.0 glCheckError() is only called during display update when org.lwjgl.util.Debug is set to true. See this post.

Hm, I'm almost absolutely certain I had to put an if (LWJGLUtil.DEBUG) {} check around the call last night to stop it from checking. I will report back later when I get back from work.

@4x4: if you're blocked in swapBuffers, that just means that the GPU still has some rendering to do to finish the current frame. Triple buffering can help a bit here I think, but that's buried in the drivers/OS and beyond LWJGL's direct control.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org