Uniform Buffers VS Texture Buffers
OpenGL 3.1 introduced two new sources from where shaders can retrieve their data, namely uniform buffers and texture buffers. These can be used to accelerate rendering when heavy usage of application provided data happens like in case of skeletal animation, especially when combined with geometry instancing. However, even if the functionality is in the core specification for about a year now, there are few demos out there to show their usage, as so, there is a big confusion around when to use them and which one is more suitable for a particular use case.
Both AMD and NVIDIA have updated their GPU programming guides to present the latest facilities provided by both OpenGL and DirectX, however I still see that people don’t really understand how they work and that prevents them from effectively taking advantage of these features.
Once, at some online forum, I found somebody arguing why is this whole confusion introduced by the Khronos Group and why there is no general buffer type to use instead and the decision whether to use uniform or texture buffers should be a decision made by the driver. This particular post motivated me to write this article.
By the way, it seems suitable for the application to have such an abstraction, however, one should never forget that OpenGL is just a thin layer on top of any graphics capable hardware and as such, it should not hide such details that in the hand of a good programmer can provide added performance benefits.
When using as input to shaders, both uniform buffers and texture buffers have their strengths and weaknesses that are public to application developers, especially taking into account the detailed descriptions of each in the corresponding GPU programming guides of the vendors. It would be very difficult if not impossible for the driver to decide which particular buffer type to use based on shader source code and it would provide less flexibility to the programmer.
For the developer to decide which of the two should be used for a particular purpose one must investigate the characteristics of both and make the choice based on that. To ease this decision I will try to present the most important features of both. I will also talk about what I’ve used them for and what results I’ve achieved.
Uniform Buffers
Maximum size: 64KByte (or more)
Memory access pattern: coherent access
Memory storage: usually local memory
Use case examples: geometry instancing, skeletal animation, etc.
Uniform buffers were introduced in OpenGL 3.1 but are available on driver implementations that don’t conform to the version 3.1 of the standard via the GL_ARB_uniform_buffer_object extension. As the specification says, uniform buffers provide a way to group GLSL uniforms into so called “uniform groups” and source their data from buffer objects to provide more streamlined access possibilities for the application.
As uniform buffers are relatively small they can easily fit in local memory. This makes data access instant thus provide optimum performance when the size constraints don’t prevent the application developer to use them. However, vendors also state that uniform buffers prefer a sequential memory access pattern. This means that it performs best when the data in the uniform buffer accesses are relative local, however, it does not necessarily mean that this sequential read must occur in one shader execution as, like in case of geometry instancing, subsequent shader executions can provide the desired access pattern.
Personally I use them for instanced rendering by storing the model-view matrix and related information of each and every instance in a common uniform buffer and use the instance id as an index to this combined data structure. This usage performs very well on my system.
Also uniform buffers can be used to store the matrices of bones and use them for implementing skeletal animation, however, I personally prefer using normal 2D textures for this purpose to take advantage of the free interpolation thanks to the dedicated texture fetching units but that’s another story.
Uniform buffers can also be used for other rendering techniques like skinned instancing or geometry deformation but the buffer size limitation may prevent such use case scenarios.
Texture Buffers
Maximum size: 128MByte (or more)
Memory access pattern: random access
Memory storage: global texture memory
Use case examples: skinned instancing, geometry tesselation etc.
Texture buffers were also became core OpenGL in version 3.1 of the specification but are available also via the GL_ARB_texture_buffer_object extension (or via the GL_EXT_texture_buffer_object extension on earlier implementations). Buffer textures are one-dimensional arrays of texels whose storage comes from an attached buffer object.
They provide the largest memory footprint for raw data access, much higher than equivalent 1D textures. However, they don’t provide texture filtering and other facilities that are usually available for other texture types. They represent formatted 1D data arrays rather than texture images. From some perspective, however, they are still textures that are resided in global memory so the access method is totally different than that of uniform buffers’. This has both advantages and disadvantages.
First, global texture memory access means texture fetching which involves the usage of a texture unit and possibly requires several clock cycles to complete. Anyway, thanks to the latency hiding mechanisms inside today commodity GPUs sometimes this can be as cheap as accessing uniform buffers. This part of the story is implementation dependent and is up to the hardware vendor. However, as stated in their programming guides, both AMD and NVIDIA have such latency hiding facilities and they also suggest that one should not expect a huge performance impact when using texture buffers.
Anyway, texture memory access provides a huge benefit compared to uniform buffers. Textures are more prone to scattered accesses and thus are more capable of dealing with random memory access. As the AMD HD2000 series programming guide says, if a certain set of data is accessed in a very random fashion it may be even faster to use texture fetches than indexed uniform access.
So even if texture buffers can be used in the same use case scenarios as uniform buffers, performance of either depends much more on the actual shader implementation rather than on the hardware implementation of the features.
Beside the aforementioned use cases, texture buffers can be used in more advanced techniques like instanced skeletal animation or even for implementing geometry tesselation, however I’m not convinced that it has any practical usage as it involves such tricks that don’t perform well on current hardware. Personally I use texture buffers for different geometry deformation techniques, to resolve batching issues when the size limitation of uniform buffers is a blocking factor, and for some inverse kinematics effects.
Conclusion
By the way, from now it’s your task to draw a conclusion based on the information read here but I recommend to read the mentioned programming guides to see a more accurate presentation of both methods. My personal conclusion is that there is no ultimate choice as both buffer types serve different purposes. Even if their possible use cases overlap, there are plenty of rendering techniques that would take advantage of the benefits of one but would suffer from the disadvantages of the other.
For further details on the topic, please refer to the OpenGL extension registry and the vendor supplied GPU programming guides:
- AMD’s ATI Radeon HD2000 programming guide: http://developer.amd.com/media/gpu_assets/ATI_Radeon_HD_2000_programming_guide.pdf
- NVIDIA’s G80 GPU programming guide: http://developer.download.nvidia.com/GPU_Programming_Guide/GPU_Programming_Guide_G80.pdf
| Print article | This entry was posted by Daniel Rákos on January 18, 2010 at 10:18 pm, and is filed under Graphics, Programming. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |




about 3 years ago
Thanks for this article, it’s a god giving to me.
I just finished a small GPU raytracing experiment that I’m working on for my master thesis, and although I’m using textures to feed the shader with the scene data, I was considering using an alternative method to pass the data like uniforms.
I was unsure of the benefits of this, but thanks to you I now know all the prons and cons of each method.
Once again, thank you very much!
about 2 years ago
Thank you for this informative post. Most valuable! I was having trouble finding out the differences.
about 2 years ago
So does that imply that when a UBO is bound, it might be _copied_ onto the local data share caches on chip?
Reading the R700 docs, it looks like ATI chips can source ‘constants’ from _either_ VRAM by a ptr or a direct (but _very_ small) constant file. Of course, direct fetches from VRAM are always possible.
I bring this up because I’m wondering about the cost of using UBO values vs the cost of _changing_ which UBO is bound. If the GPU is using UBOs “by ptr” a bind has the potential to be cheap, but if it’s using them “by caching” then we pay for the cache upload at batch start.
about 2 years ago
Well, that’s a pretty interesting question. I think this is something that only AMD guys can tell, but I think UBO binding should be cheap if caching is used as well. At least its cost should be very minor compared to the additional performance gain thanks to the on-chip constant store.
about 1 year ago
“however, I personally prefer using normal 2D textures for this purpose to take advantage of the free interpolation thanks to the dedicated texture fetching units but that’s another story.”
—-do you mind sharing the story with us? I am just finding a way to trans the bone’s matrix to shader. if a 2d texture can just do it, how should i prepare/organize the tex’s data for better performance? i’am not sure what’s the meaning of “dedicated texture fetching units”, is it a new cap of specific vender or …? thank you.
about 1 year ago
When I said “we can take advantage of the free interpolation thanks to the dedicated texture fetching units” I meant that the texture fetching units of GPUs perform linear filtering (interpolation) on the texels for free.
As in case of animation you usually interpolate matrices, you can reduce the number of necessary texture fetches by using a 2D texture instead of a texture buffer as an example.
Just put the first row of a transformation matrix in the RGBA components of a float texture texel, then the next row in the texel below that and so on. Then put the next transformation matrix data in the next texel column. This way, if you want to interpolate between the two matrix, you need only 4 texture fetches (one for each texel row) using a U (column) coordinate that will cause the rows of the two matrices to be linearly interpolated thanks to the free bilinear filtering exposed by the hardware. With a texture buffer, you can achieve the same thing only with 8 texture fetches and a manual interpolation in the shader. While this later may be more accurate (because the bilinear filtering done by the texture fetching units are less accurate), in practice this may prove acceptable and you can potentially get a 100% speedup.
about 1 year ago
Thank you very much for the quick reply and I really appriciate with the answer. One more thing i doubt is whether or not the matrices could be interpolated peacefully without something strange happens. I used to think the interplolation should be done on cpu side with the rotation quternions be slerped and position vectors be lineared, from whitch regenerate the bone’s matrix?
Besides, can the texture be just three rows since the bone matrices generally give their sub 3X4 matrices of value? thank you.
about 1 year ago
Yes, you are right, quaternions avoid several artifacts that may appear in case you lerp the bone matrices, however, if you have enough key matrices then the artifacts are not really visible. But nobody stops you from storing simply the quaternions in the texture, you can still take advantage of the linear filtering to lerp them.
Also, for bone matrices usually you need only three rows, not all fours so you are right, but the theory still stands, you can halve the number of fetches, then from 6 to 3. I just didn’t want to confuse you, so I simply talked about 4×4 matrices.
So the conclusion should be that no matter if you interpolate quaternions or matrices, you can speed up things theoretically by 100% by using the free linear filtering, of course, with the trade-off of limited precision.
about 4 months ago
First, I doubt quaternions can be interpolated using hW texture lerp…( slerp(q1,q2,t)= q1 (q1⁻¹ q2)^-t is not a linear stuff)
Secondly, i have implemented instancing with both UBO and TBO and both matrices and quaternions with a baked animation and 12 keyframe (to fit in a UBO)…Performances are rather the same…:/ I remark a little improvement using quaternions on fermi architecture but that’s all…
Concering TBO vs UBO:
It seams that local memory performance access with UBO is compensated by the texture cache with TBO….
If you’ve got more informations about all that stuff i would be glad…
about 4 months ago
Moreover I think that quaternion and matrix interpolation give the same result … if you disagree i would like to know why..
about 4 months ago
The fact that you got the same performance in your case with both UBO and TBO might indicate that you had your bottleneck somewhere else. They do have different performance characteristics. The constant cache and the texture cache are slightly different. The types of accesses these two caches prefer differs as well.
Not to mention that addressing of UBOs is always done with so called dynamically uniform values, thus all shader invocations get the same value. This allows for optimizations in the hardware. The same does not apply for texture accesses as there the address can diverge thus more bandwidth is required between the texture cache and the shader cores than it is needed between the constant cache and the shader cores, even if you use the same address for the texture lookup in all shader instances.
about 4 months ago
Also, you might find my other article about Evergreen (HD5000 series) memory access performance: http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/
There you can see that uniform access can vary from 16 bytes/cycle (constant indices) to 4 bytes/cycle (uniform indices), varying indices aren’t supported through OpenGL’s UBOs. While texture access varies from 4 bytes/cycle (L1 cached) to 0.6 bytes/cycle (global memory access).
You can see that for the use case of instance data, they should be the same as you use uniform indices (non-constant indices) to access the UBO, i.e. 4 bytes/cycle and you probably hit 99.99% of the cases the L1 cache to access the TBO, i.e. again 4 bytes/cycle. Thus you might be right that they can perform the same this this particular scenario, but all depends on the use case.