Uniform Buffer Objects performance issues

First off I read this thread http://www.opengl.org/discussion_boa...29-help-needed but since the last post here comes from 2.5 yrs ago I thougt I could add something here. Basically, I am experiencing the same problems as the author of the aformentioned thread. I have a GF 240 GT with some of the latest drivers.

I render 625 meshes and, obviously, need a world transform matrix for each. Also, there is view proj matrix passed to the shader (this is set only once as it is constant for all objects, so only world transform needs to be updated). Using traditional uniform variables approach I manage to render everything in less than 2ms, which is a little over 500 FPS (before recording the time I call glFinish).

Now I switched to a constant buffer. When I update the buffer's data with MapBufferRange the performance hurts immensely taking around 120ms to render a frame. On the other hand, when I update the buffer's data with glBufferSubData, the CPU time needed to execute API calls is less than 1ms (!) *but* that is before calling glFinish. After calling glFinish the measured time is around 9ms, which gives 120 FPS or so.

The thing that bothers me most is the difference in timing taken before and after calling glFinish. If rendering all objects takes less than 1ms and calling glFinish is so expensive I guess OGL is simply buffering all commands. If so then I think it's quite a lot of data to buffer.

Has anyone ever decided to abandon the use of goold oldie variable uniforms and switched completely to using uniform buffers?

Sorry for not being specific. But "switched to a constant buffer" I mean I have a constant buffer which only holds two matrices, world and viewProj. This constant buffer is updated before each draw call. I'm doing it this way to be consistent with DX10/11.

Has anyone ever decided to abandon the use of goold oldie variable uniforms and switched completely to using uniform buffers?

No but I will be most interested in your results. From what I have read uniform buffers have to be copied to registers (uniforms) prior to use so they do have an overhead.
That was from older articles and may be out of date.

I have been caught out benchmarking with buffering of commands. My assumption is that OpenGL does not
actually buffer that much but sends commands to the gpu where they get stuck in queues. Certain OpenGL commands require a response from the gpu and that is
where the driver suspends waiting for the gpu to execute that command. (Certainly that is how channel control programs worked for mainframe front-end processes when I used to write that
code many eons ago )

Sorry for not being specific. But "switched to a constant buffer" I mean I have a constant buffer which only holds two matrices, world and viewProj. This constant buffer is updated before each draw call. I'm doing it this way to be consistent with DX10/11.

One last question that actually matters. Do you use a single UBO for all the meshes or one per mesh?

One last question that actually matters. Do you use a single UBO for all the meshes or one per mesh?

One UBO for all meshes looks like this I guess:

Code :

for mesh in meshes do
update UBO 0
draw mesh
endfor

Yes, I have only one constant buffer. Moreover, it is set only once so the code should not suffer any redundant API overhead. The only extra function I call for each iteration is glBufferSubData to update data in the constant buffer under slot 0. Basically GL Intercept logs this for each mesh:

Yes, I have only one constant buffer. Moreover, it is set only once so the code should not suffer any redundant API overhead. The only extra function I call for each iteration is glBufferSubData to update data in the constant buffer under slot 0. Basically GL Intercept logs this for each mesh:

The draw calls are not executed the time you send them. In most implementations they are stacked in a command buffer and the driver decides when to send for execution. When you update the buffer the previous draw call depends on that buffer so the driver cannot mess with it because it will affect the previous draw call. What the driver can do is either wait for the dependency to be resolved (prev draw call is done) or it can create a copy (CopyOnWrite) and the new draw call will use the copy. Both solutions are a bit expensive.

What you can easily do to test this theory is to use one UBO per drawcall. I bet that you will see improvement.

And what I am trying to say is that by using one buffer you are not using OpenGL with an optimal way. The sequence you describe has read/write dependency problems because every update to the buffer depends on the previous draw call.