GPU based dynamic geometry LOD
Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very beginning of computer graphics technologies and are present in some form in current CAD softwares, video games and other graphics applications. While determining the appropriate geometry LOD was previously the task of the CPU, with todays hardware it is possible to also offload this to the GPU which excels at handling large amount of objects in parallel.
Introduction
With the advent of Shader Model 5.0 GPUs and the appearance of programmable tessellation hardware it may seem like the geometry LOD problem is solved once and for all. However, in many cases it is simply not enough as for far away objects even a patch pass-through tessellation shader already produces too much geometry than the added detail worths. As a result, classic geometry LOD algorithms are still a good-to-have feature in the tool-box of the developer. Not to mention that all vendors recommend disabling tessellation shaders at all if we don’t need any geometry amplification as even a pass-through tessellation shader does have its payload.
This means that there has to be still a conventional rendering path for geometries that should not be tessellated. Then why not to try offloading the geometry LOD determination to the GPU if possible?
This article presents a technique that was already presented by AMD’s March of the Froblins demo and by NVIDIA’s Skinned Instancing demo and allows GPU based dynamic geometry LOD determination using a geometry shader that selects the most appropriate LOD from a group of geometry LODs based on the object’s distance from camera. While this article and the reference implementation (OpenGL 4.0 – Mountains demo) presents the application of the technique only for instanced geometry, the same method can be easily extended to support heterogeneous objects by taking advantage of the latest functionalities introduced in OpenGL 4.
The algorithm
The technique is based on the geometry shader’s ability to emit or deny the emission of primitives into a transform feedback buffer as done in the mentioned DX based implementations. One major improvement compared to earlier approaches is that the LOD determination is done in a single pass rather than requiring a separate pass for each geometry LOD. Additionally, this LOD determination pass can be also merged together with other visibility determination passes like Instance Cloud Reduction or Hierarchical-Z map based occlusion culling as it is done in the reference implementation. This was made possible thanks to the latest transform feedback capabilities introduced in OpenGL 4.0 (see the extension ARB_transform_feedback3) that enables the geometry shader to output data to separate primitive streams.

Flow-chart presenting the culling and dynamic LOD algorithms used in AMD's March of the Froblins demo. The implementation needs five passes for culling and separating three detail levels and performs two asynchronous queries meanwhile. Requires OpenGL 3 compliant hardware.

Flow-chart presenting the culling and dynamic LOD algorithm used in our Mountains demo. The implementation requires only one pass for culling and separating three detail levels without the need to use asynchronous queries. Requires OpenGL 4 compliant hardware.
The algorithm itself is very simple and straightforward. For each object instance determine the appropriate geometry LOD based on it’s distance from the camera and the LOD distances passed as uniform to the shader. After this, output the instance’s data to the output stream ID that corresponds to the determined LOD’s index. Here you can see a GLSL implementation of the algorithm:
#version 400 core
uniform mat4 ModelViewMatrix;
uniform vec2 LodDistance;
layout(points) in;
layout(points, max_vertices = 1) out;
in vec3 InstancePosition[1];
layout(stream=0) out vec3 InstPosLOD0;
layout(stream=1) out vec3 InstPosLOD1;
layout(stream=2) out vec3 InstPosLOD2;
void main() {
float distance = length(ModelViewMatrix * vec4(InstancePosition[0], 1.0));
if ( distance < LodDistance.x ) {
InstPosLOD0 = InstancePosition[0];
EmitStreamVertex(0);
} else
if ( distance < LodDistance.y ) {
InstPosLOD1 = InstancePosition[0];
EmitStreamVertex(1);
} else {
InstPosLOD2 = InstancePosition[0];
EmitStreamVertex(2);
}
}
Additionally, the geometry LOD determination pass has to be executed with primitive queries enabled for all the relevant output streams to acquire the number of instances for each geometry LOD index:
for (int i=0; i<NUM_LOD; i++) glBeginQueryIndexed(GL_PRIMITIVES_GENERATED, i, lodQuery[i]); glBeginTransformFeedback(GL_POINTS); glDrawArrays(GL_POINTS, 0, instanceCount); glEndTransformFeedback(); for (int i=0; i<NUM_LOD; i++) glEndQueryIndexed(GL_PRIMITIVES_GENERATED, i);
Finally, the only thing what is left is to issue an instanced draw call for each geometry LOD index to draw all the instances:
for (int i=0; i<NUM_LOD; i++) {
glGetQueryObjectiv(lodQuery[i], GL_QUERY_RESULT, instanceCountLOD[i]);
if ( instanceCountLOD[i] > 0 )
glDrawElementsInstanced(..., instanceCountLOD[i]);
}
That’s all, and what you get as a result is a fully GPU based geometry LOD selection algorithm.
The Mountains demo
The reference implementation provided as part of the OpenGL 4.0 – Mountains demo that is available with full source code and Windows executable in the downloads section. The demo application implements the same visibility determination algorithms that were presented in the SIGGRAPH 2008 Course Notes besides the dynamic geometry LOD algorithm presented here in a single pass.
Dynamic LOD can be enabled in the demo by using the F3 key. After enabled, the demo separates the various geometry detail levels according to the LOD distances configured. As it can be seen, there is almost no visible difference between the scene rendered with dynamic geometry LOD enabled and disabled. Also, by setting the LOD distances appropriately, the algorithm provides seamless transition between subsequent geometry detail levels as the camera is moved.
When dyamic LOD is enabled, the demo also makes it possible to visualize the various geometry detail levels by pressing the F4 key. The highest detail LOD is marked with red, mid-level with green and the lowest detail geometries are marked as blue. It can be seen that as the camera moves the renderer automatically adjusts the detail of each individual instance.
Besides maintaining a constant quality without the viewer to observe any transitions between the various detail levels, the algorithm provides a huge performance gain in case of complex geometries as it can be seen on the figure below:

Performance comparison of the demo in frames per second on a Radeon HD5770 (higher is better): no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).
Conclusion
We’ve seen how straightforward is to implement GPU based dynamic geometry LOD determination using geometry shaders on OpenGL 4.0 compliant hardware providing also a reference implementation that uses the algorithm to efficiently determine detail levels for large number of instanced geometry. We also briefly mentioned that the algorithm can be extended to handle arbitrary object sets. We discussed about a possible OpenGL 3 based implementation but we did not provide one as it requires several rendering passes to perform all the operations that can be implemented in a single pass on Shader Model 5.0 hardware.
Even though the algorithm is already extremely efficient, it still involves the use of asynchronous primitive queries that may induce some latency. Of course, this latency can be easily hidden by performing other operations on the CPU/GPU until the results are available.
Furthermore, taking full advantage of Shader Model 5.0 GPUs it would be possible to eliminate the need of asynchronous queries by using atomic counters and indirect rendering, however the core OpenGL specification does not expose yet such functionality so this improvement is left for a future release of the demo.
Classic dynamic geometry LOD algorithms are still first class citizens of every rendering system and even though the introduction of hardware tessellation somewhat subsumes the need for these classic techniques, practice shows that the best way to implement a full-fledged dynamic LOD system is by using geometry LOD selection and tessellation together rather that one instead of the other.
| Print article | This entry was posted by Daniel Rákos on October 25, 2010 at 7:35 pm, and is filed under Graphics, Programming, Samples. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |


about 2 years ago
Hi Daniel,
I was wondering if these techniques (ICR, Hi-Z map, dynamic LOD) could take advantage of an spatial division structure (e.g. an octree) to decrease the number of objects to check.
about 2 years ago
Yes, maybe if the amount of data to process is large enough a loose octree based implementation could worth the additional processing passes, however taking in consideration the amount of parallel stream processors in current hardware, you usually won’t need any such mechanism unless you have to deal with like about 100,000 objects or more.
about 2 years ago
Ok. Another question. None of these methods could be used with OpenGL ES 2.0 as it hasn’t geometry shader, right?
And with the absence of occlusion queries, do you know some efficient technique to perform occlusion culling with OpenGL ES 2.0?
Would it decrease the application performance instead of boost it
as you would have to do all the calculations on the CPU?
about 2 years ago
OpenGL ES 2.0 and the usual hardware it is running on is simply not a suitable platform for these types of techniques. Of course, one of the biggest problem is the lack of geometry shaders.
Due to the lack of these types of programmability, when you have to use OpenGL ES you should rather go to standard CPU based methods.
E.g.: ICR (aka per-object view frustum culling) has obviously clear benefit on CPU. This is also true for dynamic LOD selection.
For occlusion culling, well, it is a much more difficult question…
If you’ve read the previous article about Hi-Z map based occlusion culling you can see that Steve Hill has linked some material how they used Hi-Z map based occlusion culling. They used simple render-to-buffer technique and they read the results back on the CPU and then the CPU decided what has to be drawn and what can be culled. I think it should be possible to implement this using OpenGL ES 2.0.
about 2 years ago
Ok, I’ll take a look on it.
Thanks
about 1 year ago
Good day. Thank you for your articles, they are very interesting and help to understand the technology.
But I have questions.
I’m developing a project, where it is necessary to render vast areas with a large number of objects (grass, trees, shrubs, etc.).
The surface terrain is sufficiently large (in principle, it is generally not limited to), so I smashed it on the tiles. Each tile is a container of resources, including vegetation. Each tile and its resources are loaded as needed.
Now I have implemented an approach where each type of object (vegetation) correspond to a Instance object, and correspondingly its TBO. Initial tests showed that the overall performance rather depends on the number of Instance objects and drops sharply at a sufficiently large number of TBO. Then I decided to take advantage of your approach ICR, but the result I was very impressed (with the number of objects can be seen when more than 500k less than 20
I got the best performance), but when I added a test case for a few more Instance objects of another type, the performance dropped significantly (60fps -> 32fps) I thought that maybe it’s because of frequent switching feedback. Then I realized the dynamic LOD. As a result, I got a winning performance.
In the test case I used the model tree of your demo mountains, with the number of visible objects 3339, I only got 11fps. When two visible objects – 30fps. The total number of objects in the scene 5000.
Now I’m at a loss, what could I lose sight of the implementation. It can not be created for each object type a TBO? Or does not inevitably decline when using FeedbackBuffer? Or is the problem of using TBO?
Data for your demos
Nature 1.2 200-840fps
Nature 2.0 200-830fps
mountains:
No-culling 16-70fps
No-culling + dynamicLOD 33-250fps
ICR 20-1054fps
ICR + dynamicLOD 67-1050fps
ICR + Hi-Z 130-550fps
ICR + Hi-Z + dynamicLOD 250-400fps
Description of the computer:
CPU Intel Core2Duo E4400 – 2Gz
memory – 3Gb
GPU NVIDIA GTX 550 Ti
Windows XP sp3