Texture and buffer access performance
Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management taking advantage of the local data store (LDS) available on today’s hardware. In this article I’ll present the memory access performance characteristics of AMD’s Evergreen-class GPUs focusing on what this all means from OpenGL point of view. While most of the data is about the HD5870, the general principles and relative performance characteristics are valid for other GPUs, including ones from other vendors.
Traditional CPU based applications don’t have to worry too much about where they put their data as they have a simple set of possibilities: registers and global memory (accessed through a series of linear caches called L1, L2 and on newer architectures also L3). While this and its details can be already quite cumbersome to utilize efficiently, GPU based algorithms need even more investigation as their architecture is based on a more complex multi-level memory design.
Typical questions an OpenGL graphics developer could ask nowadays are:
- Where should I put my per-object data?
- From where should I source animation data?
- Should I use uniform buffers, texture buffers or vertex buffers for my per-instance data?
- What does it mean from performance point of view if I use read-write buffers or textures?
Of course, the list could continue and answering the individual questions is not easy and often requires performance measurements to prove our suspicions. Instead of trying to answer all these questions it is easier to take a look at the actual hardware performance characteristics and solve the individual issues based on that.
I’ve already touched the topic in the past with the article Uniform Buffers VS Texture Buffers where I’ve presented the key differences between the two data access method and a few examples when to use one or the other. In this article I’ll go further and try to provide more accurate data about how various memory access methods perform in practice.
Earlier there were little to no detailed information about the actual performance of API level memory access methods but fortunately the increasing popularity of OpenCL made vendors to provide more technical details about the architecture and performance of their products to enable software developers to fully leverage the power of today’s GPUs. While these documents focus on OpenCL or other compute APIs, most of the data applies indirectly to OpenGL as well.
The Evergreen architecture
In order to be able to provide some actual performance data, I’ve selected as reference AMD’s Evergreen architecture and the Radeon HD5870 as the target hardware. Note that most of the presented details roughly apply to all other modern GPUs, including NVIDIA’s Fermi architecture. Each time there is a clear difference between the two, I’ll try to point it out. However, I cannot be 100% sure what are these differences as ATI’s OpenCL programming guide is somewhat more talkative about actual performance details than that of NVIDIA’s OpenCL programming guide.
From OpenCL platform model’s point of view the Radeon HD5870 is structured in the following way:
- Total of 20 compute units.
- Each compute unit consists of 16 stream cores.
- Each stream core consists of 5 processing elements (4 traditional, 1 transcendental).
This sums up to a total of 1600 processing elements on the Radeon HD5870.
The basic OpenCL architecture applies in the same way to NVIDIA GPUs, however, there is are differences between AMD’s and NVIDIA’s GPU architecture. AMD uses a special super-scalar architecture since their HD2000 series that allows them to execute 5 separate instructions in each core.
What this already reveals us from OpenGL point of view is that AMD’s architecture groups together 16 stream cores so fragment shaders are most probably running on 4×4 tiles of fragments in sync. As an example, it is important to note this in case we use heavy dynamic branching in shaders as we should be aware of that in case the branch selection is not coherent for the specified fragment neighborhood, performance can drop due to the fact that hardware masks out those processing elements that did not select the appropriate branch.
Also, it is important to note that usually one out of four or five processing elements (depending on hardware generation and vendor) are capable of executing transcendental instructions such as logarithm, exponential or trigonometric functions.
Memory capacity and performance
AMD is very clear about the memory capacity and performance details in their OpenCL programming guide. The figure below showcases these hardware characteristics of the Radeon HD5870:
|OpenCL Memory Type||Hardware Resource||Size/CU||Size/GPU||Peak Read Bandwidth / Stream Core|
|Constant||Direct-addressed constant||-||48KB||16 bytes/cycle|
|Same-indexed constant||-||-||4 bytes/cycle|
|Varying-indexed constant||-||-||~0.6 bytes/cycle|
|Images||L1 Cache||8KB||160KB||4 bytes/cycle|
|L2 Cache||-||512KB||~1.6 bytes/cycle|
|Global||Global Memory||-||1GB||~0.6 bytes/cycle|
GPRs – General Purpose Registers
LDS – Local Data Store
Direct-addressed constant – a constant accessed using a constant address.
Same-indexed constant – a varying-indexed constant where each processing element accesses the same index.
Varying-indexed constant – a varying-indexed constant where the processing elements access different indices.
Of course, consider this data for fetches that are properly aligned. In case of unaligned data access the actual throughput can be much lower. In order to be able to reach the peak bandwidth we have to align our data usually to multiples of 4, 8 or 16 bytes (depending on actual hardware).
As it can be seen, constant storage can also fall into three different access performance categories so do buffers and images. While actual numbers differ on various platforms, the guidelines apply to most of modern GPUs: use a particular addressing method wisely and take in consideration access locality in order to get optimum performance.
These numbers are no different in case of OpenGL terminology either, just replace the word “constant” with uniform buffers and think about images and global data as texture images or buffer objects. The only exception is that there is no direct alternative for local memory in OpenGL.
An additional thing to consider since Shader Model 5.0 hardware is read-write images and buffers. AMD refers to the two memory access method as FastPath and CompletePath. This means that in case of read-only textures or buffers the GPU uses the FastPath that is able to take full advantage of the L2 cache while read-write textures and buffers usually use the so called CompletePath that sacrifices the advantages of the L2 cache to enable the use of atomic operations on global memory objects. This, of course, has a quite huge performance effect reducing the throughput of the GPU about five times on the Radeon HD5870:
|Kernel||Effective Bandwidth||Ratio to Peak Bandwidth|
|copy 32-bit 1D FastPath||96 GB/s||63%|
|copy 32-bit 1D CompletePath||18 GB/s||12%|
Well, now we’ve seen that how various OpenCL memory types perform in reality, let’s see how all these information translate to the OpenGL world. Here are my top-10 recommendations about when and how to use the various data acquiring possibilities present in modern OpenGL:
- Align your data to multiples of 16 bytes and fetch them accordingly.
- Use direct-addressing of data in uniform buffers and try to avoid indexing into uniform buffers.
- If you must use indexing into uniform buffers, make sure that the indices are coherent across processing elements working in sync.
- If you heavily use indexed data consider using texture buffers instead of uniform buffers to take advantage of the L1 and L2 cache.
- Texture and buffer caches are linear so consider this when planning you access patterns.
- Bind textures and buffers for read-write mode only when it is really necessary, use regular texture binding otherwise to ensure optimum performance.
- A single atomic buffer operation forces the shader to use the slow path so use atomic operations wisely.
- Do not use atomic buffer operations to implement atomic counters, use built-in hardware atomic counters instead as they are much faster.
- Consider using dynamic branching to avoid costly memory operations as often as possible.
- Try to make your branch selection coherent across processing elements working in sync (e.g. 4×4 fragment tile in case of a fragment shader).
Note: This article may contain inaccurate data and some advices may not apply to other hardware platforms. I’ve made this article with the hope that it may prove useful for some developers out there. For accurate details or more information, please contact your hardware vendor.
|Print article||This entry was posted by Daniel Rákos on November 2, 2010 at 8:44 pm, and is filed under Graphics, Multiprocessing, Programming. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site.|
No trackbacks yet.
about 1 year ago - 80 comments
I’ve chosen the title based on the popular article that tries to prove that OpenGL lost the war against Direct3D. To be honest, I didn’t really like the article at all. First, because it compared OpenGL 3 which targeted Shader Model 4.0 hardware and DirectX 11 which targeted Shader Model 5.0 hardware. Besides that, as we
about 1 year ago - 6 comments
After the release of the OpenGL 4.1 specification the Khronos Group slowed down the pace a little bit but they didn’t left OpenGL developers without a new specification version for too long as a few weeks ago they’ve released OpenGL 4.2. The new version of the specification brings several API improvements as well as exposes
about 1 year ago - 3 comments
You might remember that I wrote an article about my suggestions for OpenGL 4.2 and beyond. One of the features that I recommended to be added to OpenGL was a yet non-existent extension called GL_ARB_draw_indirect2 which suggested the addition of new draw commands that are similar in fashion to the ancient MultiDraw* commands but they
about 2 years ago - 16 comments
In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3×3 texel footprint
about 2 years ago - 29 comments
The Khronos Group did a great job in the last few years to once again prove that OpenGL is still in game and that it can become the ultimate graphics API of choice, if it is not that already. However, we must note that it is not quite yet true that OpenGL 4.1 is a
about 2 years ago - 6 comments
Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very
about 2 years ago - 26 comments
Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched
about 2 years ago - 18 comments
OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn’t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics
about 2 years ago - 4 comments
With the introduction of Shader Model 5.0 hardware and the API support provided by OpenGL 4.0 made GPU based geometry tessellation a first class citizen in the latest graphics applications. While the official support from all the commodity graphics card vendors and the relevant APIs are quite recent news, little to no people know that
about 2 years ago - 55 comments
Gaussian blur is an image space effect that is used to create a softly blurred version of the original image. This image then can be used by more sophisticated algorithms to produce effects like bloom, depth-of-field, heat haze or fuzzy glass. In this article I will present how to take advantage of the various properties