Better have one big buffer or several smaller ones?

Hi.
I am currently trying to find the right way to use OpenGL. Right now I am using several VBO's, each with about 1 - 3mb of data.
But somebody recently told me that was a bad idea and I should try to use only 1 VBO in the entire application. This way the performance would be much better.
However, using only 1 VBO would make for a huge overhead in my application because then I would have to sort the contents by texture and shader to minimize the number of times I had to change either of those.

Would that be worth it? All this overhead on the application, would that pay of in terms of performance? Is switching a VBO such a costly method?

But somebody recently told me that was a bad idea and I should try to use only 1 VBO in the entire application. This way the performance would be much better.

It is very unlikely. To have better performance you have to find and remove bottleneck.
If binding several VBOs is a bottleneck (in which I really doubt), then yes, you'll experience some performance boost.
Another important aspect is data localization and alignment in that buffer. If locality is poor, execution time will be longer because of cache misses. Alignment is aways at least 32 bit. That's why we don't have 3 byte internal texture representation.

Originally Posted by Cornix

However, using only 1 VBO would make for a huge overhead in my application because then I would have to sort the contents by texture and shader to minimize the number of times I had to change either of those.

That's the third aspect, the complexity of manipulation with such huge buffer.

The fourth aspect is memory management. If OpenGL for some reason cannot store huge buffer in the GPU memory, it won't be rendered (NV) or the performance will be very poor (AMD). Memory eviction count could also be higher with huge buffers.

I'm amazed how some basic concepts are misinterpreted. Yes, generally it is better to have less VBOs for plenty of reasons. But generalization that application should have just one huge VBO for everything is silly.

My current approach is to have one VBO per texture per shader. However, I dont have many different shaders and I dont have many different textures either. I guess I will have no more then 15 VBO's in total at any point in time. This makes it incredibly easy for me to fill these VBO's because I dont have to worry about sorting the data (there is no translucency and all meshes are of the same size).
In my render method I just go from texture to texture, from shader to shader and render one vbo for each of these. So for each VBO only a single draw call.
Would you say that is a good approach or should I try to do something different?

I am not far enough into developement process to run into any performance issues. I just want to make sure I do it right before I start. It would be a pain to learn later that I have to redo everything. Besides, I want to make it right. I want to learn something here, not just finish some useless project.

[...]
For best results in my experience, either use NVidia bindless for vtx attrib/index list binds/enables (particularly in the case where you have a lot of static VBOs with smallish batches), OR use a streaming VBO approach (similar to what client arrays probably does under-the-hood) as those two approaches avoid nearly all of the overhead of binding many VBOs to render a frame.

Otherwise, lots of VBO binds can kill your performance.

So you say, I should only use a single big VBO and change its data dynamically depending on what I want to render? Do I understand you correctly?

Let's, say, we have a single huge VBO with about 640MB of data, and 1GB of RAM. The application is demanding and requires more than 512MB of textures. Does anyone think this scenario will work with a single VBO? Who guarantees the driver won't try to evict such huge buffer whenever it needs additional data?

There are applications where number of batches is huge. That's why NV made Bindless extension. Several years ago I have succeeded to achieve almost interactive frame rate with a modest card and 64K VBOs using bindless, just to prove the concept.

In short, YES it is better to have fewer than many VBOs, but would drivers do their memory management efficiently if there is a single huge VBO?

So you say, I should only use a single big VBO and change its data dynamically depending on what I want to render? Do I understand you correctly?

Not quite. What I'm saying is that in my experience, if you have a bunch of static batches that live in a bunch of VBOs, I haven't been able to beat NVidia bindless (on an NVidia card at least) for the launching these batches with maximum throughput.

If on the other hand, if you have so many potential batches they can't fit on the GPU, the static batches option is a total non-starter, and you need a different approach. There, a streaming VBO approach with batch reuse works well. And since it's only one VBO, it doesn't have much bind overhead associated with it. However, you pay the run-time cost of uploading (and potentially re-uploading) your batch data when you need it.

That said, I don't know everything and would love to hear from other users that have found other ways to load/launch batches that might perform even better.

So I'd encourage you to try multiple techniques, bench them, and pick the best for your use case(s).

Let's, say, we have a single huge VBO with about 640MB of data, and 1GB of RAM. The application is demanding and requires more than 512MB of textures. Does anyone think this scenario will work with a single VBO? Who guarantees the driver won't try to evict such huge buffer whenever it needs additional data?

Batch and texture streaming are somewhat different, so I wouldn't advocate using a shared BO for that. But to your point about GPU memory overruns...

Unfortunately, a cardinal rule of GPU development is to make sure that your total GPU/driver memory consumption doesn't exceed your GPU memory capacity. Otherwise, "bad things" will happen which you don't have direct control over (frame rate halves or drops into the toilet, massive frame spikage as the drive shuffles data on/off the board, etc.).

GL tries to abstract GPU memory, but in practice if you care about performance, you have to know how much you have and engineer your usage to fit within it.

Another unfortunate (and related) cardinal rule of GPU development seems to be "do not run-time delete and recreate textures", because GPU memory reorganization is expensive. I'd really like for that one to go away. And with a different memory alloc/free abstraction, I think both of these could be dealt with.