So, having more or less doubled the speed of my sprite engine merely by switching to VBO based rendering instead of using traditional heap based DirectByteBuffers, I find now that as soon as I perform immediate mode rendering in OpenGL, performance rapidly plummets back down to terrible.

It would appear that VBOs are kind of an all-or-nothing approach; any immediate mode rendering pretty much buggers the whole advantage of VBO usage up. So now I have to port all my text rendering, background rendering, capacitor zap rendering, building under attack rendering, and powerup beam in effect rendering to use VBOs instead of immediate mode. Bah.

So, having more or less doubled the speed of my sprite engine merely by switching to VBO based rendering instead of using traditional heap based DirectByteBuffers, I find now that as soon as I perform immediate mode rendering in OpenGL, performance rapidly plummets back down to terrible.

Could you explain what you mean with a 'heap based DirectByteBuffer'? A DirectByteBuffer has its pointer outside of the heap. Naturally, the DirectByteBuffer instance will be on the heap, but that's also the case with the object returned from glMapBuffer()

glMapBuffer() will also stall the GPU until glUnmapBuffer() is called. So when you are filling your ByteBuffer, which might take some time, your GPU will be idling. No such problem with glBufferDataARB.

Could you explain what you mean with a 'heap based DirectByteBuffer'? A DirectByteBuffer has its pointer outside of the heap. Naturally, the DirectByteBuffer instance will be on the heap, but that's also the case with the object returned from glMapBuffer()

I think he means a direct ByteBuffer that points to JVM-allocated memory, versus a direct ByteBuffer that points to driver-allocated memory (that's returned from glMapBuffer), which may be used for faster GPU transfers.

glMapBuffer() will also stall the GPU until glUnmapBuffer() is called. So when you are filling your ByteBuffer, which might take some time, your GPU will be idling. No such problem with glBufferDataARB.

That's implementation and usage dependent. It may stall or it may not stall. Cas may be lucky and his driver is clever enough to queue a copy instead of stalling the GPU. That's why I suggested in his other topic that he should explore an algorithmic solution first (using shaders, instancing etc), before trying a simple switch to VBOs (which he should do anyway).

I think he means a direct ByteBuffer that points to JVM-allocated memory, versus a direct ByteBuffer that points to driver-allocated memory (that's returned from glMapBuffer), which may be used for faster GPU transfers.

1. Filling both has exactly the same performance.2. I wonder how the CPU=>GPU copy could be faster for the driver-allocated memory. It might be guaranteed to be page-aligned, but we can guarantee the same with a JVM-allocated ByteBuffer. The check for alignment should be negligible compared to the data copy.

Heap-based direct byte buffer is that which is returned by ByteBuffer.allocateDirect() - it's still on the C heap (not the Java object heap). It's possible to create direct bytebuffers outside of the C heap in native code but not pure Java code. This is how the old NV_fence and AGP RAM allocation used to work.

glMapBuffer() will return a pointer to an address in the process's address space and not on the heap too, it'll be some weirdy location provided by the driver. Now, if I were calling glMapBuffer() on a buffer that overlapped some bit of memory the driver was currently trying to render, then I'd cause a GPU stall most likely. However, because of they way I do rendering - I map, rapidly fill the buffer with data, unmap, and then start rendering from it - I'm unlikely to cause any GPU stalling. In fact if the driver's worth its salt it'd be batching my state change calls and processing them asynchronously with the buffer DMA, effectively making nearly all the calls return immediately.

If I find that filling the byte buffer is taking too long I could double buffer my geometry data - that is, alternately swap between two VBOs. I might yet do this anyway.

Not so; a driver-provided pointer to a strange address over the bus somewhere can completely bypass clientside memory caches and thus eliminate cache pollution, a major factor in slowdown when copying data up to the card. Even the piddly 125kb of vertex data I copy up per frame buggers most CPUs L1 caches.

At least that's my understanding of it, and seeing as everything's twice as fast since I made the change, I assume something good is happening

2. I wonder how the CPU=>GPU copy could be faster for the driver-allocated memory. It might be guaranteed to be page-aligned, but we can guarantee the same with a JVM-allocated ByteBuffer. The check for alignment should be negligible compared to the data copy.

Wouldn't a copy (malloc) of the driver be much slower than providing your own ByteBuffer?

Again, this depends on the usage flags and how clever the driver is, but in my instance, I'm doing GL_STREAM_DRAW and GL_WRITE_ONLY, one of the optimally easy cases for allocation: the driver never needs to return the same pointer twice, or even more cleverly, it can provide the same clientside address space pointer, but pointing to a completely different serverside memory location, so that I can write to it unhindered.

2. I wonder how the CPU=>GPU copy could be faster for the driver-allocated memory. It might be guaranteed to be page-aligned, but we can guarantee the same with a JVM-allocated ByteBuffer. The check for alignment should be negligible compared to the data copy.

I'm not sure, but I think the GPU cannot perform DMA transfers from arbitrary memory locations. For example, on AGP cards, you need to use memory that's been reserved and allocated from the GART for the card to be able to use DMA. The JVM can't do that for you, but glMapBuffer can. The GART memory linearization has been moved to the GPU/driver on PCIe cards, but I think the same issue remains.

Wouldn't a copy (malloc) of the driver be much slower than providing your own ByteBuffer?

The GPU may have allocated a local copy of the VBO in GPU memory and do some kind of double buffering. Instead of stalling on glMapBuffer, it may continue rendering from the local copy, then update the VBO when rendering is done. It can't do that with user-allocated memory, not without having 3 copies of the VBO data. For example:

1) glBufferSubData is called, a user-allocated buffer is supplied.2) the driver copies the user-allocated buffer somewhere else in system memory (possibly DMAable).3) the system-memory => GPU-memory copy is performed to update the GPU-allocated VBO.

That's 3 copies of the VBO data. The driver cannot avoid #2, because the user may modify the user-allocated buffer data before the GPU transfer is performed. In the glMapBuffer case, data modification is controlled, you can't change anything outside a map/unmap pair or without a glBuffer(Sub)Data call. So, glMapBuffer gives you direct access to the #2 memory.

That's all assuming the driver does the double buffer copy. If not, the GPU stalls and you have the same performance. Possibly slightly better with glMapBuffer in case what I wrote above about DMA isn't bullshit. Or well, the data transfer shouldn't be the bottleneck (very few data according to Cas, very high bandwidth available on modern cards), but stalling the GPU can be. glMapBuffer is supposed to help with avoiding such stalls (that's why the WRITE_ONLY flag exists).

No, sorry. I may be talking out of my ass here, it's all based on random stuff I've read around and how I think GPUs/drivers work. I've said this before on the LWJGL forums, I haven't used glMapBuffer more than once, I even replaced it at some point with a better solution. My other warning also applies, VBOs are very GPU/vendor/driver sensitive, you cannot be sure that a rendering setup that performs great on your machine will be anywhere close to optimal for other machines too. On the other hand, I haven't done any VBO tests lately, drivers with VBO support have matured and things may be much better now.

Could you try reusing the driverSideBuffer (store the reference and pass as the last argument to glMapBuffer)? Also, could you download a fresh LWJGL build and try the new glMapBuffer API (with an explicit length argument)?

It's dead long ago. Immediate OpenGL rendering mode was when OGL calls were used between glBegin() and glEnd() to render/specify each of the triangles vertices one-by-one. It was the original rendering mode of OpenGL (back in 1993, if i remember).

Indeed. Although at least up until now, vertex array rendering was fast enough for my purposes, but now it seems I've hit the absolute limits, and the drivers are definitely pushing everybody to use VBOs or face drastically shit performance.

This is kinda significant because a) all my legacy code has a lot of immediate mode in it (I don't mean a lot of rendering, just a lot of places) and b) all the legacy tutorial code on the internets is immediate mode and it's all completely the wrong thing to be teaching people now

Anyway, back to major hackery to convert all my immediate mode stuff into VBO rendering... without breaking any of my legacy code... whilst still being integrated into the sprite engine layering system.... argh

It's dead long ago. Immediate OpenGL rendering mode was when OGL calls were used between glBegin() and glEnd() to render/specify each of the triangles vertices one-by-one. It was the original rendering mode of OpenGL (back in 1993, if i remember).

I must admit, I recommended someone to use display lists on here just a few days ago, without warning them that are actually already deprecated (which I knew full well)... they were using the JOGL TextRenderer and I wasn't really sure how that would interact. Well, I felt guilty so I've at least now let them know.

In my own code, I have VBOs for the major stuff, with some immediate mode, the odd display list and one or two vertex arrays... I don't notice any particular difference to performance whether I have my various trivial immediate mode bits rendering or only VBO stuff - I am largely CPU bound, though. In fact, I've just been experimenting rendering particles in immediate mode vs VA vs VBO and am seeing absolutely negligible differences... if anything, immediate mode maybe slightly ahead. WTF?

I'm still surprised that I fairly consistently see an increase of a few fps changing my particle system to immediate mode instead of VBO... never the other way around, it seems. Maybe I'm confused somewhere down the line.

I'm still surprised that I fairly consistently see an increase of a few fps changing my particle system to immediate mode instead of VBO... never the other way around, it seems. Maybe I'm confused somewhere down the line.

I observe the same thing (at least on 6600GT). It seems that small dynamic geometry rendering is faster using immediate mode than using VBO. But I haven't tried to eg. fill the VBO at start of frame, render other stuff and then use that VBO for rendering which would be more GPU/pipeline friendly.

I can't fathom any way a particle system could be faster using immediate mode rendering. Especially from Java. Unless you draw so few particles it amounts to a microbenchmark and that renders the results a bit suspect.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org