DISCLAIMER: This article was migrated from the old blog thus may contain formatting and content differences compared to the original post. Additionally, it likely contains technical inaccuracies, opinions that I may no longer align with, and most certainly poor use of English (I was young and foolish :)). This article remains public for those who may find it useful despite its flaws.
You might remember that I wrote an article about my suggestions for OpenGL 4.2 and beyond. One of the features that I recommended to be added to OpenGL was a yet non-existent extension called GL_ARB_draw_indirect2 which suggested the addition of new draw commands that are similar in fashion to the ancient MultiDraw* commands but they are meant to build on top of the indirect drawing mechanism introduced by the GL_ARB_draw_indirect extension and OpenGL 4.0. I contacted both AMD and NVIDIA with my idea with different levels of success, but AMD saw the potential in the functionality and they actually implemented it in the form of GL_AMD_multi_draw_indirect, well at least partially…
First of all, let’s recap what exactly GL_ARB_draw_indirect brought us:
This extension provides a mechanism for supplying the arguments to a DrawArraysInstanced or DrawElementsInstancedBaseVertex from buffer object memory. This is not particularly useful for applications where the CPU knows the values of the arguments beforehand, but is helpful when the values will be generated on the GPU through any mechanism that can write to a buffer object including image stores, atomic counters, or compute interop. This allows the GPU to consume these arguments without a round-trip to the CPU or the expensive synchronization that would involve. This is similar to the DrawTransformFeedbackEXT command from EXT_transform_feedback2, but offers much more flexibility in both generating the arguments and in the type of Draws that can be accomplished.
If you know my Nature or Mountains demo you know that I have dug deeply into the domain of GPU based culling algorithms. In case of these algorithms, the GPU consumes the scene data and performs visibility determination over a list of objects and writes out the culled data into a buffer object. The problem is that those algorithms that I’ve implemented in the aforementioned demo applications work only for instanced objects. In order to make it possible for the algorithms to be able to efficiently work with arbitrary object sets we still need a lot of new features (some of them may even require newer GPU generations). The most important ones are discussed in detail in the following sections.
This feature enables us to use the global atomic counters present on the GPU, which have, at least on the AMD implementation, dedicated hardware to provide efficient chip-wide access to these counters from any shader. This can be expected in the near future in the form of the yet not published GL_ARB_shader_atomic_counter extension. The extension also provides a way to back up the atomic counter values in buffer object memory.
The currently available GPU based culling algorithms, including those presented in my demos, bypass the lack of this feature by using transform feedback to capture the culled data which has implicit atomic counters that are associated with each output stream. However, this has a few drawbacks. First of all, transform feedback is not as efficient if one would use atomic counters together with the random memory read/write mechanism exposed by the GL_EXT_shader_image_load_store extension. This is because of its nature, geometry shaders and thus transform feedback has to preserve the original order of the incoming primitives. This is why the first GPU generation with geometry shader support had so much performance problems as the use of geometry shaders easily became the bottleneck of the rendering. Besides the performance benefits of having our own atomic counters, there are a lot of other reasons, like the ability to implement an append/consume buffer, if I’m allowed to use the D3D terminology.
It may seem that I went a bit off-topic, however, just think about how atomic counters can interact so nicely with indirect drawing. There is the instance count field of the indirect draw commands, what if we bind that address as the back-up buffer memory for the atomic counter? Yes, we can save that costly asynchronous query to get the number of visible objects that we did otherwise in case of applying an ICR or Hi-Z map based occlusion culling. You may say that you can achieve the same thing with atomic read/writes as provided by the GL_EXT_shader_image_load_store. Well, that’s true, unless the additional performance hit by doing atomic memory writes is acceptable (atomic counters are much, much faster, however, it is true that in case of a GPU based culling algorithm, those few writes shouldn’t be the bottleneck). But now let us think more deeply into the problem. If we can use atomic read/writes to count the instances, as it is present in the indirect draw command in the buffer object, then what if we count the number of draw commands written into the indirect draw buffer using atomic counters? And here we are, we have the first building block of a GPU based culling algorithm that can handle arbitrary data sets.
Multi-Draw-Indirect phase 1
Now let’s say we somehow managed to generate an indirect draw buffer object with the list of the instanced draw command arguments necessary to render the visible objects, no matter whether we used the OpenGL toolset as in my demos or we used some compute API like OpenCL. Now somehow we have to initiate the drawing. We can do this by issuing several DrawArraysIndirect or DrawElementsIndirect command based on how many instanced draw command arguments we’ve generated.
But what if we could do this with a single command? This is where GL_AMD_multi_draw_indirect comes into picture and that’s what AMD implemented for us. We can actually do this by using one of the MultiDraw*Indirect commands introduced by the extension.
The best thing in it is that in case of lack of hardware support for it, the driver can still implement it by simply making a loop that calls the appropriate Draw*Indirect commands so every hardware that supports GL_ARB_draw_indirect can support GL_AMD_multi_draw_indirect, and in case the hardware actually supports the functionality, then we can get a slight performance increase for free.
Multi-Draw-Indirect phase 2
While the new extension adds quite some flexibility to the existing indirect drawing mechanism, it still lacks an important feature to become the Holy Grail of GPU based culling and scene management algorithms. We still have to perform an asynchronous query or otherwise determine the number of records written into the indirect draw buffer.
Of course, we can alleviate the problem by always initializing the indirect draw buffer with zero values (so that if one would issue an indirect draw command using any of the data in the buffer no actual rendering would take place) and then simply using a MultiDraw*Indirect command passing a primcount argument that is equal to the theoretical maximum of generated records. However, this might result in a performance decrease, especially if this theoretical maximum value is much bigger than the actual draw commands present in the buffer.
In order to circumvent this problem, we need some mechanism that allows us to also source the primcount argument of the MultiDraw*Indirect commands from buffer object memory. While such functionality is not exposed yet by any of the major graphics APIs (and may not be supported by current hardware) this could be the next major step towards a fully self-feeding renderer that handles graphics related data on a much higher level beyond triangles and pixels.
While the indirect drawing mechanism introduced with OpenGL 4.0 is just a very little part of the feature set introduced by Shader Model 5.0 GPUs, it has still a lot of room for improvement and evolution ahead. AMD made the first step with GL_AMD_multi_draw_indirect and I really hope that indirect drawing and other GPU self-feed mechanisms will gain more developer attention in the near future.
Finally, I would like to thank to Graham Sellers, the creator of the extension, Pierre Bourdier for his support on promoting the new functionality and all the engineers at AMD who have contributed to the specification and the implementation work behind it. I’m really glad to see that they take the word of the developers in which direction they improve their OpenGL support.