Uniform Buffer Object Performance

I just added uniform block support, but the performance is terrible. I'm sharing a UBO with multiple programs. On the plus side, if a uniform changes infrequently, I need to update it just once and every program gets it. The problem is that some uniforms (e.g. model view matrix) need to be updated in the UBO for every draw call. This creates an implicit sync and the driver has to block quite a bit.

I tried glMapNamedBufferRangeEXT with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT, but the performance is still in the toilet.

Is there a way to make shared uniform blocks fast for frequently changing uniforms?

I figured that might have been the problem, so I moved all the frequently changing uniforms out, but the performance is still worse using the Uniform Block. I'm using the default "shared" layout. This is on a Nvidia 780.

Also, don't use Map/Unmap to update the UBO. Try using glBufferData() to replace the whole UBO (which should be rather small, maybe a few hundred bytes, right?). This will work better on multithreaded drivers and also tells the driver that it can queue/orphan the previous buffer contents.

I can't find a path that makes UBOs faster than regular uniforms. I moved frequently changing uniforms out of the UniformBlock and I upload the entire UBO if something does change, which should happen only rarely. For my scene, regular uniforms run at ~7.7 ms and UBOs run at 8.4 ms. Maybe if I used 1,000 different shader programs, then a shared UBO would be faster if the uniforms changed once per frame.