Multi-Draw-Indirect is here
You might remember that I wrote an article about my suggestions for OpenGL 4.2 and beyond. One of the features that I recommended to be added to OpenGL was a yet non-existent extension called GL_ARB_draw_indirect2 which suggested the addition of new draw commands that are similar in fashion to the ancient MultiDraw* commands but they are meant to build on top of the indirect drawing mechanism introduced by the GL_ARB_draw_indirect extension and OpenGL 4.0. I contacted both AMD and NVIDIA with my idea with different levels of success, but AMD saw the potential in the functionality and they actually implemented it in the form of GL_AMD_multi_draw_indirect, well at least partially…
First of all, let’s recap what exactly GL_ARB_draw_indirect brought us:
This extension provides a mechanism for supplying the arguments to a DrawArraysInstanced or DrawElementsInstancedBaseVertex from buffer object memory. This is not particularly useful for applications where the CPU knows the values of the arguments beforehand, but is helpful when the values will be generated on the GPU through any mechanism that can write to a buffer object including image stores, atomic counters, or compute interop. This allows the GPU to consume these arguments without a round-trip to the CPU or the expensive synchronization that would involve. This is similar to the DrawTransformFeedbackEXT command from EXT_transform_feedback2, but offers much more flexibility in both generating the arguments and in the type of Draws that can be accomplished.
If you know my Nature or Mountains demo you know that I have dug deeply into the domain of GPU based culling algorithms. In case of these algorithms, the GPU consumes the scene data and performs visibility determination over a list of objects and writes out the culled data into a buffer object. The problem is that those algorithms that I’ve implemented in the aforementioned demo applications work only for instanced objects. In order to make it possible for the algorithms to be able to efficiently work with arbitrary object sets we still need a lot of new features (some of them may even require newer GPU generations). The most important ones are discussed in detail in the following sections.
This feature enables us to use the global atomic counters present on the GPU, which have, at least on the AMD implementation, dedicated hardware to provide efficient chip-wide access to these counters from any shader. This can be expected in the near future in the form of the yet not published GL_ARB_shader_atomic_counter extension. The extension also provides a way to back up the atomic counter values in buffer object memory.
The currently available GPU based culling algorithms, including those presented in my demos, bypass the lack of this feature by using transform feedback to capture the culled data which has implicit atomic counters that are associated with each output stream. However, this has a few drawbacks. First of all, transform feedback is not as efficient if one would use atomic counters together with the random memory read/write mechanism exposed by the GL_EXT_shader_image_load_store extension. This is because of its nature, geometry shaders and thus transform feedback has to preserve the original order of the incoming primitives. This is why the first GPU generation with geometry shader support had so much performance problems as the use of geometry shaders easily became the bottleneck of the rendering. Besides the performance benefits of having our own atomic counters, there are a lot of other reasons, like the ability to implement an append/consume buffer, if I’m allowed to use the D3D terminology.
It may seem that I went a bit off-topic, however, just think about how atomic counters can interact so nicely with indirect drawing. There is the instance count field of the indirect draw commands, what if we bind that address as the back-up buffer memory for the atomic counter? Yes, we can save that costly asynchronous query to get the number of visible objects that we did otherwise in case of applying an ICR or Hi-Z map based occlusion culling. You may say that you can achieve the same thing with atomic read/writes as provided by the GL_EXT_shader_image_load_store. Well, that’s true, unless the additional performance hit by doing atomic memory writes is acceptable (atomic counters are much, much faster, however, it is true that in case of a GPU based culling algorithm, those few writes shouldn’t be the bottleneck). But now let us think more deeply into the problem. If we can use atomic read/writes to count the instances, as it is present in the indirect draw command in the buffer object, then what if we count the number of draw commands written into the indirect draw buffer using atomic counters? And here we are, we have the first building block of a GPU based culling algorithm that can handle arbitrary data sets.
Multi-Draw-Indirect phase 1
Now let’s say we somehow managed to generate an indirect draw buffer object with the list of the instanced draw command arguments necessary to render the visible objects, no matter whether we used the OpenGL toolset as in my demos or we used some compute API like OpenCL. Now somehow we have to initiate the drawing. We can do this by issuing several DrawArraysIndirect or DrawElementsIndirect command based on how many instanced draw command arguments we’ve generated.
But what if we could do this with a single command? This is where GL_AMD_multi_draw_indirect comes into picture and that’s what AMD implemented for us. We can actually do this by using one of the MultiDraw*Indirect commands introduced by the extension.
The best thing in it is that in case of lack of hardware support for it, the driver can still implement it by simply making a loop that calls the appropriate Draw*Indirect commands so every hardware that supports GL_ARB_draw_indirect can support GL_AMD_multi_draw_indirect, and in case the hardware actually supports the functionality, then we can get a slight performance increase for free.
Multi-Draw-Indirect phase 2
While the new extension adds quite some flexibility to the existing indirect drawing mechanism, it still lacks an important feature to become the Holy Grail of GPU based culling and scene management algorithms. We still have to perform an asynchronous query or otherwise determine the number of records written into the indirect draw buffer.
Of course, we can alleviate the problem by always initializing the indirect draw buffer with zero values (so that if one would issue an indirect draw command using any of the data in the buffer no actual rendering would take place) and then simply using a MultiDraw*Indirect command passing a primcount argument that is equal to the theoretical maximum of generated records. However, this might result in a performance decrease, especially if this theoretical maximum value is much bigger than the actual draw commands present in the buffer.
In order to circumvent this problem, we need some mechanism that allows us to also source the primcount argument of the MultiDraw*Indirect commands from buffer object memory. While such functionality is not exposed yet by any of the major graphics APIs (and may not be supported by current hardware) this could be the next major step towards a fully self-feeding renderer that handles graphics related data on a much higher level beyond triangles and pixels.
While the indirect drawing mechanism introduced with OpenGL 4.0 is just a very little part of the feature set introduced by Shader Model 5.0 GPUs, it has still a lot of room for improvement and evolution ahead. AMD made the first step with GL_AMD_multi_draw_indirect and I really hope that indirect drawing and other GPU self-feed mechanisms will gain more developer attention in the near future.
Finally, I would like to thank to Graham Sellers, the creator of the extension, Pierre Bourdier for his support on promoting the new functionality and all the engineers at AMD who have contributed to the specification and the implementation work behind it. I’m really glad to see that they take the word of the developers in which direction they improve their OpenGL support.
|Print article||This entry was posted by Daniel Rákos on June 19, 2011 at 3:04 pm, and is filed under Graphics, Programming. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site.|
No trackbacks yet.
about 1 year ago - 80 comments
I’ve chosen the title based on the popular article that tries to prove that OpenGL lost the war against Direct3D. To be honest, I didn’t really like the article at all. First, because it compared OpenGL 3 which targeted Shader Model 4.0 hardware and DirectX 11 which targeted Shader Model 5.0 hardware. Besides that, as we
about 1 year ago - 6 comments
After the release of the OpenGL 4.1 specification the Khronos Group slowed down the pace a little bit but they didn’t left OpenGL developers without a new specification version for too long as a few weeks ago they’ve released OpenGL 4.2. The new version of the specification brings several API improvements as well as exposes
about 2 years ago - 16 comments
In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3×3 texel footprint
about 2 years ago - 29 comments
The Khronos Group did a great job in the last few years to once again prove that OpenGL is still in game and that it can become the ultimate graphics API of choice, if it is not that already. However, we must note that it is not quite yet true that OpenGL 4.1 is a
about 2 years ago - 12 comments
Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management
about 2 years ago - 6 comments
Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very
about 2 years ago - 26 comments
Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched
about 2 years ago - 18 comments
OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn’t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics
about 2 years ago - 4 comments
With the introduction of Shader Model 5.0 hardware and the API support provided by OpenGL 4.0 made GPU based geometry tessellation a first class citizen in the latest graphics applications. While the official support from all the commodity graphics card vendors and the relevant APIs are quite recent news, little to no people know that
about 2 years ago - 55 comments
Gaussian blur is an image space effect that is used to create a softly blurred version of the original image. This image then can be used by more sophisticated algorithms to produce effects like bloom, depth-of-field, heat haze or fuzzy glass. In this article I will present how to take advantage of the various properties