Re: VBOs strangely slow?

The question is, why they push the use of a new feature if it's not implemented well?

Conclusion, even with traditional glBegin/End you can get an outstanding performance as long as you algorithmically optimize vs. instruction/pixel/hardware optimization.

Did you read the thread? He said, "In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array." In short, VBOs worked better for him, once he was using the correct API. So your "conclusion" is errant nonsense.

On topic, you should use glMapBufferRange if that extension is available. Using the invalidation flag, you don't even need the glBufferData(NULL) part.

Re: VBOs strangely slow?

....
Try shoving 10 million tris to the gpu per frame at 60fps without VBOs, while needing flexibility that display-lists do not give (and you'd like to not waste VRAM for the different permutations required otherwise with DLs).

Re: VBOs strangely slow?

Use GL_STATIC_DRAW and don't update your data pointer (via glBufferData or glBufferSubData)in your for loop. I would think these calls cause the geometry to be sent over to the graphics adapter every time.

Re: VBOs strangely slow?

In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array.

All's well that ends well? I guess, but it's still not obvious to me why the other way of using them is in this case /slower/.

That's interesting. When I've tried Map vs. Sub, Sub was faster (with invalidate of course, so multiple buffers are in-flight in the driver [allegedly], and fixing the VBO max size -- no resizing).

But yeah, pure VBOs are odd. You'd think they'd always be faster, but some of the time they're slower (most of the time on pre-SM4 cards). Unless you play the "Ouija board" correctly per card per driver rev.

Map vs. Sub.

Invalidate vs. not.

Sync vs. not.

Static vs. stream vs. dynamic.

Dynamic max VBO size or not.

Interleaved attributes vs. separate.

Multiple batches per VBO vs. not.

Mixing index and vertex arrays in one buffer or not.

Max VBO size X or Y.

32-byte-aligned verts or not.

Ring of N buffers or one.

Vtx fmts X or Y for colors, normals, texcoords, etc.

Latency between upload and use X or Y.

Call glVertexPointer first, last, or in between.

Heck, one of our devs even found it can be faster using CPU-side index list with VBO vertex attributes on some cards, when the index list changes frequently.

On pre-SM4 cards VBO perf used to be a total crapshoot, with it more likely to be slower than client arrays than faster, and that's without any dynamic VBO updates (you laugh, but we still have customers in the field with these and thus have to support them; these cards are only ~3yrs old and our customers use lots of GPUs). For recent gen cards, it's getting easier to be faster with VBOs, though still possible to find cases where VBOs lag. Batch setup seems more expensive with them than client arrays.

VBO updates aside though, I will say I am pleased with VBO performance on recent cards particularly using NVidia's bindless batch data extension. With that, I can get very near to the performance of their legendary display lists (it's ~2X slower without bindless). So no doubt NVidia display lists use bindless internally (of course). VBOs+bindless is definitely the future (unless they come up with something even faster )

The question is, why they push the use of a new feature if it's not implemented well?

That's a very good question. VBO's would have been a much easier sell if they didn't positively suck when they were first introduced, which lasted for several generations of cards. They're still a Ouigi board, but the Ouigi board has gotten much smaller on recent cards.

Another reason VBOs weren't such a slam-dunk sell is the vendors did not provide guidance to say specifically "this is how you get the fastest VBO performance on our cards: use permutation A,B,C,F,M,P,R". And when there was a tip dropped, if you tried it, half the time it was worse performance.

Re: VBOs strangely slow?

Originally Posted by Baughn

In the end, glMapBuffer was (much) faster; preceding it with a glBufferData with a null pointer (discarding its contents), 10% faster than the vertex array.

Could you post your exact map code example in a follow-up? I think it'd be useful/informative for a number of folks to run all 3, and let you/us verify that everyone's seeing similar results on varying GPUs/vendors/drivers on exactly the same code.

Re: VBOs strangely slow?

Just for kicks, and to come to some VBO upload performance conclusions on modern hardware (at least with this 8MB/upload example), I thought I'd take the original two permutations (same VBO sizes/contents/rendering), and try a few variations for comparison: