BufferData vs BufferSubData for view matrix UBO

I read on many places that BufferSubData may cause blocking if the GPU is currently using the buffer it is called on. If I am using a UBO for my view matrix, which gets updated for every object I draw since I pre-multiply the world-space to camera-space matrix with the model-space to world-space matrix, would calling BufferData actually increase performance? Since it allocates data every call, wouldn't it avoid the blocking, so the GPU would still use the previous buffer which will get deleted when its done? Or does the allocation nullify any performance benefits?

glBufferSubData may not block - the driver may copy off the data to temp storage and only commit it to the buffer object when draw (or any other) calls that depend on it come around. That would be implementation-dependent behaviour though.

In general updating a UBO per-object is going to be a much bigger bottleneck, however. There have been a few threads about this before, and the finding is that it's most performant to have a single big UBO, sized large enough for all of your objects, which is updated once-only per-frame, then using glBindBufferRange to specify the portion of the UBO that's used for each object.

Because of UBO offset-alignment requirements you're going to have a lot of empty space in there if you just use it for storing matrices, so there are probably a few other per-object uniforms that you'll want to also include (such as lighting values, etc). Defining a common struct that all of your objects can use, and making two passes through your list of objects - first pass fills in the properties to an array of these structs, then glBuffer(Sub)Data it, then second pass for drawing - makes sense here.

Even with this you're still likely to find a lot of empty space in your UBO (offset-alignment again) and it may be tempting to view this as "wasted memory". In reality that's a tradeoff you're going to have to accept in exchange for getting a much faster once-only-per-frame update.

OK, that makes sense. That fits in well with my scene graph anyway: currently, I have each node holding a transform matrix relative to its parent, so I could do that by simply traversing the graph twice every frame to calculate the matrices first and render second. Not to mention, that makes for more potential gain from multithreading.

My program is object oriented, and I've tried pretty hard to keep all data associated with each OpenGL object wrapped in classes. Would calling glBufferData with a null pointer, then mapping the buffer each frame hold the same performance benefit? That way, I could just keep offsets into the big buffer in each class and manage those offsets somehow, in a manner similar to memory allocation.