Uniform Buffers VS Texture Buffers

OpenGL 3.1 introduced two new sources from where shaders can retrieve their data, namely uniform buffers and texture buffers. These can be used to accelerate rendering when heavy usage of application provided data happens like in case of skeletal animation, especially when combined with geometry instancing. However, even if the functionality is in the core specification for about a year now, there are few demos out there to show their usage, as so, there is a big confusion around when to use them and which one is more suitable for a particular use case.

Both AMD and NVIDIA have updated their GPU programming guides to present the latest facilities provided by both OpenGL and DirectX, however I still see that people don’t really understand how they work and that prevents them from effectively taking advantage of these features.

Once, at some online forum, I found somebody arguing why is this whole confusion introduced by the Khronos Group and why there is no general buffer type to use instead and the decision whether to use uniform or texture buffers should be a decision made by the driver. This particular post motivated me to write this article.

By the way, it seems suitable for the application to have such an abstraction, however, one should never forget that OpenGL is just a thin layer on top of any graphics capable hardware and as such, it should not hide such details that in the hand of a good programmer can provide added performance benefits.

When using as input to shaders, both uniform buffers and texture buffers have their strengths and weaknesses that are public to application developers, especially taking into account the detailed descriptions of each in the corresponding GPU programming guides of the vendors. It would be very difficult if not impossible for the driver to decide which particular buffer type to use based on shader source code and it would provide less flexibility to the programmer.

For the developer to decide which of the two should be used for a particular purpose one must investigate the characteristics of both and make the choice based on that. To ease this decision I will try to present the most important features of both. I will also talk about what I’ve used them for and what results I’ve achieved.

Uniform Buffers

Uniform buffers were introduced in OpenGL 3.1 but are available on driver implementations that don’t conform to the version 3.1 of the standard via the GL_ARB_uniform_buffer_object extension. As the specification says, uniform buffers provide a way to group GLSL uniforms into so called “uniform groups” and source their data from buffer objects to provide more streamlined access possibilities for the application.

Screenshot from the Instanced Tesselation demo of NVIDIA

As uniform buffers are relatively small they can easily fit in local memory. This makes data access instant thus provide optimum performance when the size constraints don’t prevent the application developer to use them. However, vendors also state that uniform buffers prefer a sequential memory access pattern. This means that it performs best when the data in the uniform buffer accesses are relative local, however, it does not necessarily mean that this sequential read must occur in one shader execution as, like in case of geometry instancing, subsequent shader executions can provide the desired access pattern.

Personally I use them for instanced rendering by storing the model-view matrix and related information of each and every instance in a common uniform buffer and use the instance id as an index to this combined data structure. This usage performs very well on my system.

Also uniform buffers can be used to store the matrices of bones and use them for implementing skeletal animation, however, I personally prefer using normal 2D textures for this purpose to take advantage of the free interpolation thanks to the dedicated texture fetching units but that’s another story.

Uniform buffers can also be used for other rendering techniques like skinned instancing or geometry deformation but the buffer size limitation may prevent such use case scenarios.

Texture Buffers

Texture buffers were also became core OpenGL in version 3.1 of the specification but are available also via the GL_ARB_texture_buffer_object extension (or via the GL_EXT_texture_buffer_object extension on earlier implementations). Buffer textures are one-dimensional arrays of texels whose storage comes from an attached buffer object.

Screenshot from the Simple Texture Buffer Object demo of NVIDIA

They provide the largest memory footprint for raw data access, much higher than equivalent 1D textures. However, they don’t provide texture filtering and other facilities that are usually available for other texture types. They represent formatted 1D data arrays rather than texture images. From some perspective, however, they are still textures that are resided in global memory so the access method is totally different than that of uniform buffers’. This has both advantages and disadvantages.

First, global texture memory access means texture fetching which involves the usage of a texture unit and possibly requires several clock cycles to complete. Anyway, thanks to the latency hiding mechanisms inside today commodity GPUs sometimes this can be as cheap as accessing uniform buffers. This part of the story is implementation dependent and is up to the hardware vendor. However, as stated in their programming guides, both AMD and NVIDIA have such latency hiding facilities and they also suggest that one should not expect a huge performance impact when using texture buffers.

Anyway, texture memory access provides a huge benefit compared to uniform buffers. Textures are more prone to scattered accesses and thus are more capable of dealing with random memory access. As the AMD HD2000 series programming guide says, if a certain set of data is accessed in a very random fashion it may be even faster to use texture fetches than indexed uniform access.

So even if texture buffers can be used in the same use case scenarios as uniform buffers, performance of either depends much more on the actual shader implementation rather than on the hardware implementation of the features.

Beside the aforementioned use cases, texture buffers can be used in more advanced techniques like instanced skeletal animation or even for implementing geometry tesselation, however I’m not convinced that it has any practical usage as it involves such tricks that don’t perform well on current hardware. Personally I use texture buffers for different geometry deformation techniques, to resolve batching issues when the size limitation of uniform buffers is a blocking factor, and for some inverse kinematics effects.

Conclusion

By the way, from now it’s your task to draw a conclusion based on the information read here but I recommend to read the mentioned programming guides to see a more accurate presentation of both methods. My personal conclusion is that there is no ultimate choice as both buffer types serve different purposes. Even if their possible use cases overlap, there are plenty of rendering techniques that would take advantage of the benefits of one but would suffer from the disadvantages of the other.

For further details on the topic, please refer to the OpenGL extension registry and the vendor supplied GPU programming guides:

Thanks for this article, it’s a god giving to me.
I just finished a small GPU raytracing experiment that I’m working on for my master thesis, and although I’m using textures to feed the shader with the scene data, I was considering using an alternative method to pass the data like uniforms.
I was unsure of the benefits of this, but thanks to you I now know all the prons and cons of each method.

So does that imply that when a UBO is bound, it might be _copied_ onto the local data share caches on chip?

Reading the R700 docs, it looks like ATI chips can source ‘constants’ from _either_ VRAM by a ptr or a direct (but _very_ small) constant file. Of course, direct fetches from VRAM are always possible.

I bring this up because I’m wondering about the cost of using UBO values vs the cost of _changing_ which UBO is bound. If the GPU is using UBOs “by ptr” a bind has the potential to be cheap, but if it’s using them “by caching” then we pay for the cache upload at batch start.

Well, that’s a pretty interesting question. I think this is something that only AMD guys can tell, but I think UBO binding should be cheap if caching is used as well. At least its cost should be very minor compared to the additional performance gain thanks to the on-chip constant store.

“however, I personally prefer using normal 2D textures for this purpose to take advantage of the free interpolation thanks to the dedicated texture fetching units but that’s another story.”

—-do you mind sharing the story with us? I am just finding a way to trans the bone’s matrix to shader. if a 2d texture can just do it, how should i prepare/organize the tex’s data for better performance? i’am not sure what’s the meaning of “dedicated texture fetching units”, is it a new cap of specific vender or …? thank you.

When I said “we can take advantage of the free interpolation thanks to the dedicated texture fetching units” I meant that the texture fetching units of GPUs perform linear filtering (interpolation) on the texels for free.
As in case of animation you usually interpolate matrices, you can reduce the number of necessary texture fetches by using a 2D texture instead of a texture buffer as an example.
Just put the first row of a transformation matrix in the RGBA components of a float texture texel, then the next row in the texel below that and so on. Then put the next transformation matrix data in the next texel column. This way, if you want to interpolate between the two matrix, you need only 4 texture fetches (one for each texel row) using a U (column) coordinate that will cause the rows of the two matrices to be linearly interpolated thanks to the free bilinear filtering exposed by the hardware. With a texture buffer, you can achieve the same thing only with 8 texture fetches and a manual interpolation in the shader. While this later may be more accurate (because the bilinear filtering done by the texture fetching units are less accurate), in practice this may prove acceptable and you can potentially get a 100% speedup.

Thank you very much for the quick reply and I really appriciate with the answer. One more thing i doubt is whether or not the matrices could be interpolated peacefully without something strange happens. I used to think the interplolation should be done on cpu side with the rotation quternions be slerped and position vectors be lineared, from whitch regenerate the bone’s matrix?
Besides, can the texture be just three rows since the bone matrices generally give their sub 3X4 matrices of value? thank you.

Yes, you are right, quaternions avoid several artifacts that may appear in case you lerp the bone matrices, however, if you have enough key matrices then the artifacts are not really visible. But nobody stops you from storing simply the quaternions in the texture, you can still take advantage of the linear filtering to lerp them.
Also, for bone matrices usually you need only three rows, not all fours so you are right, but the theory still stands, you can halve the number of fetches, then from 6 to 3. I just didn’t want to confuse you, so I simply talked about 4×4 matrices.
So the conclusion should be that no matter if you interpolate quaternions or matrices, you can speed up things theoretically by 100% by using the free linear filtering, of course, with the trade-off of limited precision.

First, I doubt quaternions can be interpolated using hW texture lerp…( slerp(q1,q2,t)= q1 (q1⁻¹ q2)^-t is not a linear stuff)
Secondly, i have implemented instancing with both UBO and TBO and both matrices and quaternions with a baked animation and 12 keyframe (to fit in a UBO)…Performances are rather the same…:/ I remark a little improvement using quaternions on fermi architecture but that’s all…
Concering TBO vs UBO:
It seams that local memory performance access with UBO is compensated by the texture cache with TBO….
If you’ve got more informations about all that stuff i would be glad…

The fact that you got the same performance in your case with both UBO and TBO might indicate that you had your bottleneck somewhere else. They do have different performance characteristics. The constant cache and the texture cache are slightly different. The types of accesses these two caches prefer differs as well.

Not to mention that addressing of UBOs is always done with so called dynamically uniform values, thus all shader invocations get the same value. This allows for optimizations in the hardware. The same does not apply for texture accesses as there the address can diverge thus more bandwidth is required between the texture cache and the shader cores than it is needed between the constant cache and the shader cores, even if you use the same address for the texture lookup in all shader instances.

You can see that for the use case of instance data, they should be the same as you use uniform indices (non-constant indices) to access the UBO, i.e. 4 bytes/cycle and you probably hit 99.99% of the cases the L1 cache to access the TBO, i.e. again 4 bytes/cycle. Thus you might be right that they can perform the same this this particular scenario, but all depends on the use case.

No trackbacks yet.

I’ve chosen the title based on the popular article that tries to prove that OpenGL lost the war against Direct3D. To be honest, I didn’t really like the article at all. First, because it compared OpenGL 3 which targeted Shader Model 4.0 hardware and DirectX 11 which targeted Shader Model 5.0 hardware. Besides that, as we…

After the release of the OpenGL 4.1 specification the Khronos Group slowed down the pace a little bit but they didn’t left OpenGL developers without a new specification version for too long as a few weeks ago they’ve released OpenGL 4.2. The new version of the specification brings several API improvements as well as exposes…

You might remember that I wrote an article about my suggestions for OpenGL 4.2 and beyond. One of the features that I recommended to be added to OpenGL was a yet non-existent extension called GL_ARB_draw_indirect2 which suggested the addition of new draw commands that are similar in fashion to the ancient MultiDraw* commands but they…

In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3×3 texel footprint…

The Khronos Group did a great job in the last few years to once again prove that OpenGL is still in game and that it can become the ultimate graphics API of choice, if it is not that already. However, we must note that it is not quite yet true that OpenGL 4.1 is a…

Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management…

Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very…

Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched…

OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn’t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics…

With the introduction of Shader Model 5.0 hardware and the API support provided by OpenGL 4.0 made GPU based geometry tessellation a first class citizen in the latest graphics applications. While the official support from all the commodity graphics card vendors and the relevant APIs are quite recent news, little to no people know that…