OpenGL 4.0 – Mountains demo released
OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn’t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics developers to implement GPU based scene management and culling algorithms. The Mountains demo showcases some of these rendering techniques that, as far as I know, were never implemented so far using OpenGL. In this article I will present the key features of the demo that will be discussed in more detail in subsequent articles. Demo binaries with full source code are also published.
The demo itself is mainly inspired by the March of the Froblins demo released by AMD and the SIGGRAPH 2008 Course Notes by Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk presenting the actual implementation in detail. That demo targeted the Radeon HD4800 series and presented several practical GPU based culling algorithms implemented using DirectX10. The Mountains demo implements these techniques in OpenGL and further improves the technique used in AMD’s demo by unleashing the new features introduced by Shader Model 5.0 hardware and OpenGL 4.0.
While this article briefly presents the demo and the used rendering techniques, the details of each individual technique will be presented in subsequent articles as the thorough examination of them needs a longer discussion that would render this article simply too long and overwhelming.
Introduction
The Mountains demo renders a tiled terrain block with thousands of high detail tree models (the full detail tree model is over five thousand triangles). Due to the view distance used in the demo is quite large, several tiles of the terrain block are potentially visible on the screen and this results in a huge explosion in the number of triangles the GPU has to render. Also, with traditional methods the rendering of the terrain blocks and the several thousand tree models would need loads of draw calls. In order to solve this problem, the demo renders the trees using geometry instancing to minimize the number of draw calls.
In a traditional rendering engine CPU based culling methods would be used. While that would even work in practice, it is more convenient to perform the culling on the GPU as every information needed to do it is available there. Nevertheless, culling is a typical algorithm that can easily take advantage of the highly parallel architecture of the GPU. Also, performing the culling on the CPU would make geometry instancing barely beneficial.
Another problem with a scene like this is that a simple per-object view frustum culling would not solve the problem completely as most of trees in the view frustum are not visible due that they are hidden by the terrain. In traditional OpenGL the way how to solve this problem would be the use of per-object occlusion queries and rendering of bounding volumes. While this may work in practice, it involves too much CPU intervention even if we take advantage of conditional rendering and nevertheless, this also breaks instancing.
These are the issues that motivated me in creating this demo and I established the following goals for the project:
- All the object-level information must stay on the GPU and the CPU should not make decisions on a per-object basis.
- The renderer should use as few draw calls as possible in order to solve the problem of visibility determination.
- Don’t draw anything that is not inside the view frustum or is occluded by terrain.
The result is a renderer that does little to no scene management on the CPU, instead uses the GPU for visibility determination that is, in most cases, able to reduce the scene’s geometric complexity from over 400 million triangles under one million triangles providing an interactive experience on a Radeon HD5770 with around 200 frames per second.
Implementation
The scene consists of a tiled terrain with over 130 thousands of triangles and more than 1400 tree instances each with almost 6 thousands of triangles. This sums up to 8 million triangles for a single tile block of terrain. As the view range is needed to be quite large we actually deal with a 7×7 tile of terrain that is dynamically placed in a way that the camera always resides in the middle block of the tile. What all this means that even though we dynamically generate the scenery around the camera, we still have to deal with a scene consisting of over 400 million triangles. This is simply too much for the GPU to deal with.
The first step done in order to reduce the geometric complexity of the scene is done on the CPU by performing a view frustum culling on a per-terrain-block basis. This will limit our 7×7 tile to a smaller subset that contains only those blocks that are lying within the view frustum. The result is a scene usually around 50 million triangles.
While this is already a reasonable amount of simplification, in order to further reduce the amount of geometry we have to render we have to do per-object culling. But as mentioned before, we would not like to do such fine grained scene management on the CPU so we need some sophisticated methods to do it on the GPU.
In order to accomplish this, we will take advantage of the geometry shader’s capability of discarding geometry. We will use it to do the per-object decisions in order to cull the tree instances that are not visible. The three techniques implemented in the culling geometry shader and the accompanying vertex shader are the following:
- Instance Cloud Reduction (ICR) – This method does view frustum culling on a per-instance basis based on the bounding box of the instanced geometry, in this case the tree. The technique was first presented in my previous article titled Instance culling using geometry shaders and then further improved according to the instructions presented in Instance Cloud Reduction reloaded. In this case, the technique allows us to do a more fine grained yet still high level view frustum culling of the tree instances than that allowed by the simple per-tile culling performed on the CPU.
- Hierarchical-Z Map based Occlusion Culling – This technique allows for conservative per-instance occlusion culling completely done and evaluated on the GPU using a similar algorithm that the hardware depth buffer uses to hierarchically reject fragments based on their depth values. Using this technique, a coarse occlusion culling can be performed on the instances without the need of occlusion queries and CPU intervention. Update! The technique is discussed in detail in the article Hierarchical-Z map based occlusion culling.
- Dynamic Level-of-Detail Determination – This method allows us to dynamically select a suitable geometry level-of-detail on a per-instance basis completely on the GPU based on the application provided LOD parameters and the distance of the instance from the camera. The Mountains demo uses three LOD levels for the tree object: one with 5811 triangles, another with 2893 triangles and the lowest detailed version contains 1492 triangles. Update! The technical details of the algorithm are presented in the article GPU based dynamic geometry LOD.
While in the Mountains demo all these techniques are used to determine the visibility and the LOD of static scenery (as trees are unlikely to move) the truth is that these methods apply with no modification also to dynamic scenery. This is a very important thing to note as usually dynamic objects are those that makes many of the CPU based scene management and visibility determination algorithms difficult to use or simply inefficient.
The key improvement compared to how these techniques are used in AMD’s demo is that my implementation applies all the algorithms to the instance set in a single rendering pass compared to the several passes needed by the original implementation. This is because the Mountains demo takes advantage of the latest technologies introduced by OpenGL 4.0 and the supporting hardware (in this case the functionality provided by the extension GL_ARB_transform_feedback3).
By using these techniques the GPU is able to reduce the geometric complexity of the scene from 50 million triangles down to around a few millions, sometimes even under a million. Of course, the actually reduction efficiency is heavily influenced by the view position and direction.
Besides the scene management and visibility determination techniques, the demo also showcases a few simple visual effects:
- A simple infinitely far skybox generated using a geometry shader.
- Simple diffuse lighting applied to the tree instances.
- Global illumination-like effect that simulates the terrain to cast shadows over the trees even though no shadow rendering technique is applied.
- Fog effect to smooth out the disappearance of the terrain at the far clip plane.
- Simplistic fake depth-of-field effect that makes far away objects look blurry.
Maybe I will present also some of these techniques in detail in another article if there is interest for it.
As I mentioned, I used a geometry shader to render the skybox and so I did when rendering full screen quads to apply image space algorithms. I’ve done this because I always feel kind of stupid when I have to put such a simple geometry like a skybox or a full screen quad into a vertex buffer. In these situations I feel like I would simply use immediate mode to draw that damn little piece of geometry but I want to stick to core OpenGL so I quickly change my mind. As a simple alternative, I rather used geometry shaders to emit these simple geometric objects that are used so often that I even wonder how OpenGL does not have e.g. a glDrawScreenQuad-like command. Of course, the geometry shaders don’t start by themselves so I used dummy draw commands to make the geometry shader do its job.
Performance
Now let’s see how our GPU based optimizations perform in practice. I’ve collected results from typical view positions from where a moderate number of trees are visible. The tests were done on a Radeon HD 5770. Other configuration parameters are not really relevant as the demo is clearly GPU bound as only a few state changes and render commands are executed on the CPU. Of course, this is kind of a synthetic demo as you would usually want to balance the workload between the CPU and the GPU but usually you have AI, physics and other things for the CPU so transferring as much work to the GPU as possible usually gives a great benefit.

Performance comparison of the demo in frames per second on a Radeon HD5770 (higher is better): no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).
As you can see on the figure above, using all the optimizations clearly shows its benefits on the frame rate of the demo, even though the Hi-Z map based occlusion query requires several additional draw passes due to the construction of the Hi-Z map. It is also clearly visible that in a scene like this where there are a lot of occluders, ICR is simply not sufficient on its own. One final note that the application of dynamic LOD has a more significant effect without Hi-Z as occlusion culling removes the largest ratio of the instances.

Amount of visible geometry after culling in millions of triangles: no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).
Our next chart shows the amount of geometry that is finally drawn after culling in millions of triangles. On this figure we see exactly the inverse of the previous chart and it is not surprising as obviously we have a geometry throughput bottleneck. It also clearly shows how important dynamic LOD is even if we don’t perform more sophisticated visibility determination algorithms.
| No LOD | Dynamic LOD | |
| No culling | 17 draw calls | 19 draw calls |
| Instance cloud reduction | 17 draw calls | 19 draw calls |
| ICR + Hi-Z map based occlusion query | 27 draw calls | 29 draw calls |
Finally, in the table above we’ve listed the number of draw calls needed by each technique from the reference point of view. The techniques applied do not have a significant effect on the amount of draw calls: we have a fixed number of draw calls and additionally two draw calls if we use LOD. The only exception is when we use Hi-Z map based occlusion culling as the Hi-Z map is a full mipmap chain and we need ten additional draw calls to generate all the mip-levels.
Conclusion
The techniques presented are rather simple to implement and can provide huge performance increases. Nevertheless, they allow the renderer to offload even some of the object-level algorithms from the CPU to the GPU and obviously this is the direction to go in the future.
We’ve also met mostly our goals set at the beginning. Of course not fully as the occlusion culling performed is rather a coarse culling method and does not eliminate completely all the instances that will not contribute to the final image.
Future work
While the implementation almost completely eliminates all need of CPU intervention during the rendering phase, I still had to use a few asynchronous queries to get the amount of visible instances for each geometry LOD, although the latency incurred by the use of query objects is hidden in the demo by rendering the skybox between the initiation of the queries and the retrieving of the results.
As soon as we get atomic counters into core OpenGL and consequently when we’ll have drivers supporting it, I will further improve the technique using indirect rendering and atomic counters so even the need for these queries will be eliminated.
Additionally, as mentioned several times, I plan to write detailed articles about the individual techniques I used in the demo. I decided to go in this direction as a thorough description of all the details of the demo would be simply too long in one piece.
Running the demo
The demo uses OpenGL 4.0 so a Shader Model 5.0 capable graphics card is a must. Even though most of the used techniques makes it possible to create an implementation running on OpenGL 3.x, this time I wanted to stick to GL 4.0 as I took advantage of the new features of it to even further improve the implementation.
First, don’t be afraid if after startup the demo will run on very low frame rates. This is because by default all GPU based optimizations are disabled.
You can use the SPACE button to switch between the various culling methods:
- No culling at all
- Instance cloud reduction
- ICR with Hi-Z map based occlusion culling
Finally, you can turn dynamic LOD on and off using the F3 key.
There are a few other controls present in the demo that you may figure out if you read the code, but I don’t want to go into the details of them as they will be presented in the upcoming articles where I will present Hi-Z map based occlusion culling and dynamic LOD in detail. So stay tuned: follow me on twitter or subscribe to the RSS feed.
The demo can be downloaded with full source code in the downloads section.
| Print article | This entry was posted by Daniel Rákos on October 11, 2010 at 9:19 pm, and is filed under Graphics, Programming, Samples. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |


about 2 years ago
I have the latest Nvidia drivers and a 8800 GT card.
Guess my card don’t support the used OpenGL 4 features since I get this error:
Failed to link shaders:
Vertex info
———–
0(1) : warning C7568: #version 400 not fully supported on current GPU target profile
Geometry info
————-
0(1) : warning C7568: #version 400 not fully supported on current GPU target profile
(0) : warning C6504: Unknown profile option ‘Invocations’ ignored
(0) : warning C6504: Unknown profile option ‘Invocations’ ignored
0(32) : error C5108: unknown semantics “THREAD_ID” specified for “gl_InvocationID”
Fragment info
————-
0(1) : warning C7568: #version 400 not fully supported on current GPU target profile
about 2 years ago
OpenGL 4 is only supported on ATI Radeon HD5000 series, NVIDIA GeForce 400 series and up. It is roughly equivalent with DirectX11.
In one word, it won’t work on a GF8800.
about 2 years ago
You can further improve your terrain culling using http://vertexasylum.com/2010/07/11/oh-no-another-terrain-rendering-paper/ with instancing.
Are there some way to run this demo on Radeon 4650?
about 2 years ago
Unfortunately not. As I mentioned, the theory behind the algorithms can be implemented also for GL3 GPUs, just I wanted to take advantage of the new features in GL4. A GL3 implementation is possible, just it would be a bit more complicated but if there is interest for it maybe I can provide a GL3.3 version as well. However, I cannot promise it as it would be difficult to allocate time for doing it.
Btw, I know the terrain optimization technique you mentioned, just here not the terrain rendering is the bottleneck so I decided not to deal too much with it.
about 2 years ago
I’ve got error C1008: undefined variable “EndStreamPrimitive” in cull.gs on a gtx 470 with the latest drivers. I tried to add #extension GL_ARB_gpu_shader5 : enable with no luck.
Any idea ? Thanks
about 2 years ago
Enabling GL_ARB_gpu_shader5 is needless as it is in core GL 4.0 and GLSL 4.00 and the shaders explicitly specify the shading language version.
Also, EndStreamPrimitive is not a variable, but a built-in function specified in GLSL 4.00 so I suppose this will be a problem with the NVIDIA driver.
If you don’t have the latest drivers installed, I would recommend you to do it.
If you have them, I think this worths a bug report to NVIDIA.
about 2 years ago
I got this running on Linux (archlinux, nvidia-260.19.06, gcc-4.5.1) with some changes:
- uint type is ambigous, use glm::uint (also you should not use using namespace in header files)
- add #include to tga.cpp (memcpy undefined)
- remove EndStreamPrimitive() calls in cull.gs, they are not needed in this case, perhaps the nvidia glsl compiler undefines them because the are not needed
sf::Window::GetInput dont work for me, perhaps sfml2 svn have a bug or is incompatible with xorg 1.9.
about 2 years ago
Thanks for the indications. I’ll update the code according to the first two comments.
About the third one, I’m not sure whether we can simply remove the EndStreamPrimitive() call. I’ll double check the specification and if you are right, I’ll remove that as well. But anyway, the driver should not complain about them.
About SFML, please make sure that you have the latest version of the trunk that you use as there were a lot of bug fixes related to GetInput lately. If it still doesn’t work after that, then I think it should be really a bug in SFML.
about 2 years ago
Removing EndStreamPrimitive() makes it work.
The depth of field is highly distracting. I suppose this was to counteract the foliage aliasing (which appeared when I removed it) ? Why not use MSAA and/or tree lods ?
The camera movements should be scaled by the fps, btw.
performance varies between 18 and 60 fps on GTX470.
about 2 years ago
I fixed the broken mouse input, on my system it seems mouse pos in Input is delayed one frame. I changed code so i only reset cursor pos if mouse pos differs from center. I am on latest sfml2 (svn rev 1579).
I posted about the missing EndStreamPrimitive: http://www.nvnews.net/vbulletin/showthread.php?t=156039
Lets see what nvidia says, i suppose its not implemented yet.
I am looking forward to your interesting articles and samples.
about 2 years ago
The camera movement is infact scaled by the fps. The only problem that I did not implement sophisticated camera movement smoothening but only scaled the movement based on frametime which is not the most accurate way.
I used depth of field effect only because I had to use framebuffer objects due to Hi-Z map based occlusion query so at least I wanted to do in the full screen quad pass at the end of the rendering. Also, this depth of field effect what I’ve used is very basic, it is not how you should do if you want to create something really cool. However, outstanding visual quality is not the main goal of the demo that’s why I didn’t bother with these issues that much as I had limited time (just a few days).
I didn’t use MSAA just because I did not want to use it and in fact there are various tree LODs, the only problem is that even the lowest detail LOD is too complex and that’s why such a high degree of aliasing is visible.
I know that there is a lot of room for improvement regarding to the demo and thanks for the feedback! I’ll try to solve as much of the mentioned issues as possible, however I have limited time.
about 2 years ago
About the EndStreamPrimitive issue.
The specification does not state anything about it is a requirement or not and it seems that the algorithm works without it with AMD drivers also so I’ll remove the calls just to make the demo work on NVIDIA cards as well.
I’ll update the downloads based on your comments then.
about 2 years ago
Kaspersky is reporting a Trojan in mountains.exe
08/05/2011 2:51:20 PM Detected: HEUR:Trojan.Win32.Generic Firefox http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-win32.zip/Mountains.exe
about 2 years ago
Thanks for the info, I’ll investigate it, however, my anti-virus did not find anything.
Update: Checked with NOD32 and Norton Anti Virus. None of them found any problem.
about 2 years ago
My mistake sorry, I just updated from last months databases and it reports nothing.. its clean. I also uploaded it on http://www.virustotal.com , and it was clean across 41 antivirus progs.. Again sorry for that alert.
about 2 years ago
No problem. At least you took the time to notice me, even though it was a false alarm, so thanks for that.
about 1 year ago
I just run the .exe file on my machine, with GTX 470. However, no tree appears. Only the terrain and the sky can be seen. What is the possible problem for this case? Have you come across the same situation?
about 1 year ago
I would suspect first some driver bug so maybe you should update your drivers, however, as I heard the NVIDIA introduced a lot of bugs in their OpenGL drivers since version 270.x so maybe you should try something 260.x, but I’m unsure as I don’t own an NVIDIA card.
To my best knowledge, the demo is compliant with the standard so my bet is on the driver. Especially because other NVIDIA users managed to run the demo.