Hybrid View

Inferior OpenGL performance in mapping uniform buffers

I'm porting my engine to different platforms and working on GL performance, which is far inferior to my d3d implementation.

Inspecting more, I have found that the main performance lags are in uniform buffer map/unmap, check out below screenshot of the PerfStudio result.
Same d3d app (with a simple scene) is about 50% faster than GL, when scene complexity (and so more uniform buffer maps) become higher, I get exponentially lower performance for GL.

I've found the same; the only reasonable way I've been able to update UBOs is to create a single large UBO at startup (big enough for my max number of scene objects), do a single big update via glBufferSubData once only and at the start of each frame, then use glBindBufferRange per object. Even then it was still slower than the common D3D method (outlined in the next paragraph) but only slightly slower (in the order of 5% to 10%) so I decided to just live with it.

Obviously this doesn't map well to the D3D code you're likely using (a single small constant buffer which you can map with discard per-object) and means that you'd need to start having divergent code paths which sucks somewhat. Maybe others can chime in with more info.

If there are a lot of these every frame, especially if they're of varying sizes, I wouldn't expect that to be blazingly fast.

Try allocating a single large buffer, streaming results into it with (GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT), and then only when it fills up orphan it with (GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT).

Yes I'm orphaning every update. this is exactly what I'm doing in d3d version.
But d3d version is magnitudes faster than GL, with the same method.

Is this because of inferior drivers ?

It makes sense to lock the buffers whenever we want and shouldn't worry about the issues, those drivers are there to optimize these stuff under the hood, aren't they ?
unfortunately making a large buffer for the whole scene, as mhagain said, sucks, because I'll have to implement two different code paths for d3d and gl

Unfortunately the GL buffer object API has always been kinda crap like this. Usage hints instead of explicit behaviours, drivers doing what they want anyway, shuffling buffers around between different storage based on heuristics, inconsistent behaviour for different buffer object types, and synchronization all over the place despite being told to not do so.

It's not really the drivers at fault; it's that buffer objects were specified fairly loose and woolly to begin with and the drivers are just trying to make the best of a bad specification. With D3D buffers you're used to being able to say "do this" and the driver does what you want (or gives you a nice big error if it can't); with GL buffers things are regrettably not so straightforward.

For UBOs, and as I indicated, the best performance I've personally had was from glBufferSubData (roughly equivalent to UpdateSubresource) rather than glMapBufferRange. Yes, that means having to do an extra memory copy, but at least the driver can copy off your data and do the update in a more orderly manner, managing resource contention itself.

Using glBufferSubData on amd catalyst 13.9 brings severe spikes in the frame. (and higher frame-time)
With glMapBufferRange on the other hand, doesn't have any spikes, so it was the best solution for me

As for the api problem, the API is definitely less consistent than d3d, but how this particular mapping scheme is different from d3d ?
I didn't get what u mean. d3d also has more or less the same map/unmap api, the driver can do what it does for d3d buffers when it detects we are using the same kind of APIs and usage hints and flags.

Another question. do you work with nvidia or amd drivers ?
I'm eager to test it on nvidia drivers too, which I don't have.

My best guess is to have two large UBO buffers, and like mhagain mentioned - update everything with one glBufferSubData, before you render anything with that UBO. Two UBOs instead of one - for manual double-buffering, otherwise you may end-up waiting for the previous one to no longer be used by the FIFO, or orphaning the buffer.

Orphaning is easiest to do with glBufferData or glBufferSubData(..,0,fullsize), but with orphaning you end-up wasting time on allocating and mapping-to-cpu those large buffers again and again. If you have two or more buffers (swap between them on every intermediate render), chances are the current buffer is no longer in use by the FIFO, and the driver will detect it won't need to orphan it. Thus it will probably reuse the allocation and cpu-mapping. To make best use of such cases, I would use glMapBuffer(.., GL_WRITE_ONLY) at the start of the render, fill in all necessary data for the render, glUnmapBuffer, and then multiple times (once per drawcall) glBindBufferRange/glBindBufferBase. This should have the same effect as glBufferSubData(,0,fullsize) except that you'll be avoiding an unnecessary memcpy().

In your case, you are uploading 192 bytes per drawcall. For such small updates, there's a perfectly-suited circular buffer that's guaranteed full speed and is persistently mapped to cpu: the non-UBO uniforms (glUniformfv). The trouble with them is, for maximum throughput you should have only one uniform, of the likes of "uniform mat4 vvv[10];" or "uniform vec4 fff[10];", and a bunch of unwieldy "#define uni_myAmbientColor fff[3].xyz" macros littering your shader.