Hierarchical-Z map based occlusion culling
Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched occlusion culling for large amount of individual objects using a geometry shader without the need of any CPU intervention that is unavoidable using traditional occlusion queries. The article also provides a reference implementation in the form of the OpenGL 4.0 Mountains demo that uses the technique for culling thousands of object instances.
Introduction
Occlusion culling is a visibility determination algorithm that is used to identify those objects that did reside in the view volume but still aren’t visible on the screen due to occlusion. That means they are hidden by such objects that reside closer to the camera.
For several generations now GPUs allow hardware accelerated methods to perform occlusion culling in the form of occlusion queries. OpenGL provides the functionality via the extension ARB_occlusion_query. Occlusion queries are very simple: when you draw an object with occlusion query enabled the query returns the number of samples that passed the depth test (or simply return true or false based on whether any samples of the objects passed the depth test or not as it is provided by the OpenGL extension ARB_occlusion_query2).
So actually performing occlusion culling using occlusion queries means simply the following:
- Draw the object while occlusion query is enabled.
- If the query result is that the object is visible then draw the object.
At first, this may sound stupid as you have to draw the object in order to tell whether it is visible or not. While in this form it really sounds silly, in practice occlusion query can save a lot of work for the GPU. Think about you have a complex object with several thousands of triangles. If you would like to determine the visibility of it using occlusion query you would simply render e.g. the bounding box of the object and if the bounding box is visible (occlusion query returns that some samples have passed) then it means the object itself is most probably visible. This way you can save the GPU from the unnecessary processing of large amount of geometry.
I have to mention here that I intentionally used the expression “most probably visible” as occlusion queries provide just a conservative estimate on whether the object is visible or not rather than an exact result. This is because the bounding box occupies a different (larger) portion of the screen than the original geometry. So what we expect from an occlusion culling algorithm is to give one of the following results: the object is not visible or the object is most probably visible. The bigger this probability is the better the occlusion culling effectiveness is.
While we would always want an occlusion culling algorithm to be as effective as possible usually we have to make a trade-off between effectiveness and efficiency. In the above example if we would like to have 100% effectiveness then we would have to draw the whole object and that would defeat most of the goals of occlusion culling. The algorithm presented in this article is somewhat even more conservative but enables the use of occlusion culling for much larger datasets.
Motivation
While hardware accelerated occlusion query is a powerful tool to use in visibility determination it puts a quite reasonable burden on the application to manage the occlusion queries and to draw the objects based on the results when they are available (taking in consideration the asynchronous nature of occlusion queries). The most naive use of occlusion queries would be to execute the query right before we have to draw the object. While this seems like a feasible idea, it does not perform well in practice as the CPU has to be stalled until the result of the query is available and that involves also empty cycles on the GPU as well thus results in unacceptable performance. In order to resolve this, the application has to fill the time between the query execution and the drawing of the object based on the query result. While there are techniques to accomplish this, it definitely comes at a cost as the implementation becomes more complex.
The aforementioned problem is somewhat resolved by using conditional rendering introduced in OpenGL 3 (NV_conditional_render extension). However, this extension does nothing just in case the results of the query are not available yet then we simply draw the object no matter if it is visible or not. This can avoid the stalling of the rendering pipeline and can be done in software if the extension is not available, however, it somewhat defeats the purpose of occlusion culling.
Another deficit when using occlusion queries is that there is still need for CPU intervention in order to make a decision about the visibility of the object. For today’s hardware where proper batching is one of the most crucial aspects of the renderer such an approach is rather ineffective.
The occlusion culling technique presented in this article solves both these issues by providing an implementation that is very simple to integrate into any renderer, does put little to no burden on the renderer and makes decision about the visibility of objects entirely on the GPU.
The algorithm
As in case of many other GPU based culling algorithm presented by me and others, the hierarchical-Z map based occlusion culling uses the geometry shader’s ability to deny the emission of primitives that are determined to be invisible on the final rendering. The shader will only emit data for those objects that are visible and this data is streamed out into a buffer object using transform feedback.
The algorithm itself is similar in spirit to the hierarchical Z testing that is implemented in modern GPUs. After rendering all the occluders in the scene, we construct a hierarchical depth image from the depth buffer which we will refer to as the Hi-Z map. This texture map is a mip-mapped, screen resolution image where each texel in mip level i contains the maximum depth of all corresponding texels in mip level i-1. This depth information can be collected during the main rendering pass for the occluding objects as we need a texture of the same resolution so we don’t need a separate depth pass. This can be simply accomplished using OpenGL framebuffer objects.
After the construction of the Hi-Z map, occlusion culling can be performed by comparing depth value of the object’s bounding volume and the depth information stored in the Hi-Z map. This is when the hierarchical mip-mapped structure of the Hi-Z map comes handy as we can do conservative depth comparisons with less texture fetches by sampling directly from a particular mip level.
This is why we constructed the Hi-Z map using a “store maximum depth” policy. This will work with a usual depth buffer setup where the depth comparison function is either GREATER or GEQUAL. For a reverse directed depth buffer the “store minimum depth” policy has to be used.
Hi-Z map construction
In case of single-sample rendering, one can use the Hi-Z map as the main depth buffer for rendering the scene. The technique extends also to multi-sampled rendering but in this case a separate full-screen quad pass is needed to calculate the maximum depth of each individual sample in the multi-sampled depth buffer and store it in the single-sampled Hi-Z map. This is possible since OpenGL 3.2 or using the extension ARB_texture_multisample. Besides this additional step, the algorithm remains the same.
The Hi-Z map can be constructed using OpenGL framebuffer objects by rendering a full-screen quad pass for each mip level where the previous mip level is bound as the input texture and the current mip level is bound as render target. As OpenGL does allow rendering from and to the same texture object as far as we don’t access the same mip level for both reading and writing, the algorithm simply looks like the following:
// bind depth texture
glBindTexture(GL_TEXTURE_2D, depthTexture);
// calculate the number of mipmap levels for NPOT texture
int numLevels = 1 + (int)floorf(log2f(fmaxf(SCREEN_WIDTH, SCREEN_HEIGHT)));
int currentWidth = SCREEN_WIDTH;
int currentHeight = SCREEN_HEIGHT;
for (int i=1; i<numLevels; i++) {
// calculate next viewport size
currentWidth /= 2;
currentHeight /= 2;
// ensure that the viewport size is always at least 1x1
currentWidth = currentWidth > 0 ? currentWidth : 1;
currentHeight = currentHeight > 0 ? currentHeight : 1;
glViewport(0, 0, currentWidth, currentHeight);
// bind next level for rendering but first restrict fetches only to previous level
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, i-1);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, i-1);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT,
GL_TEXTURE_2D, depthTexture, i);
// draw full-screen quad
............
}
// reset mipmap level range for the depth image
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, 0);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, numLevels-1);
It is very important not to forget about the step when we ensure that the viewport size is always at least 1×1 as in case of non-power-of-two (NPOT) textures due to rounding problems. I forgot this first and I was wondering an hour why my last mip level didn’t get filled.
While one may wonder how this technique can be efficient after so many full-screen quad passes, it is in fact very efficient and it constructs the Hi-Z map on my Radeon HD5770 in less than 0.2 milliseconds. The measurement should be quite accurate as I’ve done it using OpenGL timer queries (see the extension ARB_timer_query).
The fragment shader used for the construction of the Hi-Z map is very straightforward except one thing. We use an NPOT depth texture due to the aspect ratio of the window and as NPOT textures use a “floor” convention to determine the size of subsequent mip levels (see the extension ARB_texture_non_power_of_two) we need predicated fetches as in case of reduction from odd-sized mip levels we should not forgot about the edge texels:
#version 400 core
uniform sampler2D LastMip;
uniform ivec2 LastMipSize;
in vec2 TexCoord;
void main(void)
{
vec4 texels;
texels.x = texture( LastMip, TexCoord ).x;
texels.y = textureOffset( LastMip, TexCoord, ivec2(-1, 0) ).x;
texels.z = textureOffset( LastMip, TexCoord, ivec2(-1,-1) ).x;
texels.w = textureOffset( LastMip, TexCoord, ivec2( 0,-1) ).x;
float maxZ = max( max( texels.x, texels.y ), max( texels.z, texels.w ) );
vec3 extra;
// if we are reducing an odd-width texture then fetch the edge texels
if ( ( (LastMipSize.x & 1) != 0 ) && ( int(gl_FragCoord.x) == LastMipSize.x-3 ) ) {
// if both edges are odd, fetch the top-left corner texel
if ( ( (LastMipSize.y & 1) != 0 ) && ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
extra.z = textureOffset( LastMip, TexCoord, ivec2( 1, 1) ).x;
maxZ = max( maxZ, extra.z );
}
extra.x = textureOffset( LastMip, TexCoord, ivec2( 1, 0) ).x;
extra.y = textureOffset( LastMip, TexCoord, ivec2( 1,-1) ).x;
maxZ = max( maxZ, max( extra.x, extra.y ) );
} else
// if we are reducing an odd-height texture then fetch the edge texels
if ( ( (LastMipSize.y & 1) != 0 ) && ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
extra.x = textureOffset( LastMip, TexCoord, ivec2( 0, 1) ).x;
extra.y = textureOffset( LastMip, TexCoord, ivec2(-1, 1) ).x;
maxZ = max( maxZ, max( extra.x, extra.y ) );
}
gl_FragDepth = maxZ;
}
I was experimenting with using texture gather lookups to reduce the number of texture fetches from 4-to-7 fetches per fragment down to 1-to-3 fetches per fragment (see the extension ARB_texture_gather) it seems that texture gather works only if the image is linearly sampled and to avoid the additional burden involved by switching filtering state during rendering I stuck to simple texture lookups as using texture gather lookups did not show any visible effect on the construction time of the Hi-Z map.

Various mip levels of the Hi-Z map. The Hi-Z map size is 1024x768 and the displayed mip levels are: level 4 (left), level 5 (middle) and level 6 (right).
For debugging and demonstration purposes the Mountains demo has built-in function to display the content of the various mip levels of the Hi-Z map. This is available by pressing the F4 key while Hi-Z map based occlusion culling is enabled. The + and – keys can be used to switch between the mip levels.
In order to better visualize the depth information in the depth buffer I converted the non-linear depth values stored in the depth texture into linear depth values as presented in [GeeXLab] How to Visualize the Depth Buffer in GLSL.
Culling with the Hi-Z map
Once we have constructed the Hi-Z map, we can perform the actual occlusion culling by fetching the 2×2 texel neighborhood corresponding to the screen area occupied by the bounding volume of the object whose visibility has to be determined. In the demo I used bounding boxes but any other bounding volume can be used (e.g. a bounding sphere is usually accurate enough for this technique).
First, we have to calculate the clip space bounding rectangle of the bounding volume. In the bounding box case this is done by transforming the bounding box vertices into clip space and then calculate the minimum and maximum X and Y coordinates. This bounding rectangle will be used for two things: it defines the texture coordinates that we’ll have to use for the Hi-Z map lookup and it helps determining the appropriate LOD for the texture lookup.
In order to determine the texture LOD that we’ll have to fetch we have to calculate the screen space size of the bounding square corresponding to the clip space bounding rectangle determined previously. This can be simply done by calculating the width and height of the bounding rectangle in clip space and then transforming this into screen space:
float ViewSizeX = (BoundingRect[1].x-BoundingRect[0].x) * Transform.Viewport.y; float ViewSizeY = (BoundingRect[1].y-BoundingRect[0].y) * Transform.Viewport.z;
After this, the texture LOD can be simply calculated using the following formula:
float LOD = ceil( log2( max( ViewSizeX, ViewSizeY ) / 2.0 ) );
Finally, as we have the texture coordinates (the vertices of the clip space bounding rectangle) and the texture LOD, we simply have to make four texture lookups into the Hi-Z map using these parameters, calculate the maximum of the four depth values returned and compare it to the depth value corresponding to the object (this is the object’s front-most point’s depth value that comes also from the clip space coordinates of the bounding box). If the object depth is greater than the reference depth the object is occluded and so it is culled by the geometry shader as usual.
One may ask why we use a 2×2 texel footprint for calculating the reference depth value why not just fetch the next mip level only once (as there we also get the maximum values of a 2×2 texel footprint due to the Hi-Z map construction method). That’s what I’ve also asked myself at first sight but quickly figured out the reason (see the figure below).

Comparison of number of fetches used for occlusion culling. Both figures show the magnified screen coverage of a single Hi-Z map texel at mip level N, texel coverage for mip level N-1 is in cyan and texel coverage for mip level N-2 is in blue. Object is show as red and yellow indicates the fetched texels.
In case of four texels not just the determination of the texture LOD is much easier but also it better encompasses the actual object bounding rectangle. In case of one texture fetch the computation of texture LOD is more complicated and expensive but the main problem is that a larger LOD has to be fetched and it is not always the LOD determined in the case of four fetches plus one. In the most extreme situation (if the bounding rectangle is right at the middle of the screen) it is possible that we have to fetch the largest LOD. This does not result in any false culling but it severely degrades the effectiveness of the culling.
Of course, it is possible to use more complex screen space bounding polygon as well as more fetches but those would increase the effectiveness of the culling much less than the additional burden and expensive operations worth.
Conclusion
We’ve seen how traditional hardware occlusion culling works by using occlusion queries. We also discussed that we sometimes need a better algorithm that does the occlusion culling for large amount of objects without CPU intervention.
The article also described a way to implement such an occlusion culling algorithm by using a hierarchical-Z map and geometry shaders. We’ve also managed to provide a reference implementation in the form of the demo called Mountains that can be downloaded with full source code in the downloads section.
The algorithm performs very well in practice on current hardware. The Hi-Z map construction takes less than 0.2 milliseconds and the actual culling comes at almost no cost for even thousands of objects. For more detail about performance comparison between rendering with and without hierarchical-Z map based occlusion culling read the article about the OpenGL 4.0 Mountains Demo.
While the demo uses the technique only for culling instances of the same object, the technique can be easily extended to work for heterogeneous set of objects as the actual culling algorithm works on a per-object basis and is completely indifferent regarding to the method used for rendering the actual geometry.
This technique can be thought of as the next step towards a completely GPU based visibility determination and scene management system.
Acknowledgements go to Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk and their SIGGRAPH 2008 Course Notes that inspired this work.
| Print article | This entry was posted by Daniel Rákos on October 19, 2010 at 7:13 pm, and is filed under Graphics, Programming, Samples. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |


about 2 years ago
Really nice implementation!
about 2 years ago
First of all, congratulations for your tutorials. They are really great
I want to ask you a couple of questions about scene management.
It seems that is possible to do all object visibility management in the GPU. But, what about the world geometry? For example, in your mountains demo, are the mountains geometry also culled like the trees? What would be the case if the world geometry were a polygon soup instead of a height map?
My second question is about the ‘Dynamic Level-of-Detail Determination’ technique. What’s the difference with Tessellation?
about 2 years ago
About scene management:
Usually you have to split up your objects into two sets: occluders and occludees. Here the spliting was straightforward as the terrain is the object which is likely to occlude the rest of the scene. Of course, this can be determined on-the-fly on the GPU but such an implementation is left for a future demo (I plan to write one that does so).
In case you have a polygon soup that represents the static scene you usually want to split it to objects (like groups of polygons that fit in a unit sized cube or whatever). You usually do not want all your static world scenery to be just one big object.
So the technique can be really extended to handle arbitrary scene and I’ll write demos and articles about how to do it.
About dynamic LOD:
Even though you use tessellation, you usually still want to have a few LOD levels in case of some objects as tessellation is not an all-in-one answer for all the LOD management. However, you can also combine GPU based dynamic LOD with tessellation and actually that’s how it is done in AMD’s March of the Froblins demo.
There they use three geometry LODs and in the highest quality LOD they use tessellation to further improve geometry detail based on the distance to camera.
I’ll write an article also about dynamic LOD in the near future so stay tuned.
about 2 years ago
Just amazing.
Discoverd this site by chance and even when there are just a few articles they are all worth reading.
These tiny pieces of “intense & condensed” information you post every once in a while have been really helpful to me, and I’m sure quite a lot of people will find them useful too.
Keep up the good work!
about 2 years ago
“This technique can be thought of as the next step towards a completely GPU based visibility determination and scene management system”
You may be interested to know that we did this type of culling in Splinter Cell: Conviction for all geometry, lights, AO fields, particle systems, shadow casters and arbitrary bounds (you name it):
http://bit.ly/dvkYii
Nick Darnell also did a couple of write-ups on the subject:
http://www.nickdarnell.com/?tag=hierarchical-z-buffer
about 2 years ago
Thanks for the links. Didn’t know about this, I heard about the technique only from AMD’s March of the Froblins demo.
Anyway, it is good to hear that this technique is an industry proven one and it really works well in practice.
about 2 years ago
I’ve read your article and I see you used buffer readbacks after the occlusion culling happened as you evaluated the results on the CPU. The latency incurred by this was not visible after all?
about 2 years ago
The latency from the processing itself wasn’t an issue for us (it’s really fast), but it does introduce a sync point, so it’s important to order things well and to try to keep both the GPU and CPU busy. I really simple example would be rendering multiple cameras in a frame: it’s generally better to do all of the visibility processing upfront instead of interleaved (vis, render, vis, render).
On consoles this stuff is a lot easier to measure and tweak, but in practice we didn’t see any problems on PC.
about 2 years ago
Oops, I did mean the sync point not the latency. But anyway, it is good to know that even with such an explicit sync point the algorithm has clear benefits.
I suppose you used two buffers to minimize the sync issue so that the culling writes to the first buffer, then the content of the buffer is copied to the second and the second one is accessed by the CPU. Didn’t you?
about 2 years ago
We didn’t double/triple buffer as we wanted to avoid visibility errors that come from using the results from the previous frame (the usual alternative). It really keeps things simple when you can get away with doing that.
about 2 years ago
No, I meant double buffering within one frame, not between frames. Just for minimizing the sync.
about 2 years ago
Ooh, sorry, I think I see what you mean now! I can’t speak for OpenGL but with straight DX9, you’re required to copy the results to CPU-accessible memory anyway via GetRenderTargetData() (DEFAULT rendertarget to SYSTEM texture). As you allude to, this can reduce the chance of stalls if the copying is asynchronous and the actual reading of the results happens later in the frame.
about 2 years ago
Yes, that’s what I mean.
In OpenGL you can map the buffer to be accessible from CPU, however there is no guarantee that in such cases an actual copy won’t happen, so maybe my idea does not provide any benefit at all.
Also, as you use DX9 I suppose you used a simple R2VA method. In that case you cannot even avoid CPU decision making as there is no such method like that made possible by geometry shaders to discard data rather than streaming it out.
about 2 years ago
Right. In fact, we just rendered to a regular texture on PC; CPU processing was unavoidable anyway since the results were used for more than just instanced geometry.
about 2 years ago
Actually this technique extends also to non-instanced geometry by taking advantage of ARB_draw_indirect so if you would have targeted SM5.0 hardware you would be able to avoid the CPU processing.
about 2 years ago
Ah, that’s really interesting. It’s a shame that it appears to be restricted to just draw calls though and not draw plus state, or indirect command buffers (unless I missed something). Without that, there isn’t the opportunity to skip all of the CPU setup.
about 2 years ago
In fact, there is way to skill CPU setup in most cases. If you use texture arrays, shader subroutines and a few other things wisely you barely ever have to modify state, but I don’t want to go into the details as we have to leave something for my future articles…
about 2 years ago
I see.
Well, it partly depends on content as to how far you can go with this stuff and still keep things reasonably optimal/flexible.
about 2 years ago
Exactly. And the sad thing is that mostly the API is the limiting factor not the hardware. That’s why I continuously try to bomb the ARB with ideas what should be included into core. My current project is to make them create ARB_draw_instanced2 and ARB_draw_indirect2 based on my requests
Unfortunately, that is not this simple, however I try to keep the pressure and maybe my voice get heard…
about 2 years ago
Right, what you were suggesting earlier is a workaround for current API limitations.
I was going to write more but I think I’ll leave that discussion for later if you plan to blog about your ideas in that area.
about 2 years ago
Feel free to contact me on e-mail for further discussion. This comment-flow already went a bit off-topic
about 2 years ago
Perfect article!
Have never heard about this technique and find text very useful and interesting to read. But there is one thing I can’t understand: could you please explain why one cant use glGenerateMipmaps on once rendered screen-size depth buffer to get all Hi-Z map levels?
about 2 years ago
If you would use simply automatic mipmap generation, then it would result in mipmaps where every texel is the average of the corresponding texels in the earlier mip level. However, this cannot be used for culling as we cannot say that an object is occluded just because its distance from the camera is greater than the average distance of the already drawn pixels. You have to make sure that it is further than any of the already drawn pixels, so you need to use max instead of average. Unfortunately we don’t have control over the automatic mipmap generation in OpenGL so we cannot specify a min/max operator instead of the default averaging one, so that’s the reason we need a custom mipmap generation technique.
about 2 years ago
Oh, sure, I forgot it must be max values in mipmap, thank you
about 1 year ago
Hey,
Thanks for taking the time to share these examples with the rest of us, they’re a great resource. I have a quick question about your mipmapping shader.
Is there a factor 2 missing in the following conditions?
if (int(gl_FragCoord.y) == LastMipSize.y-3)
As int(gl_FragCoord.y) will have a range of [0,LastMipSize.y/2)?
Assuming I’m right, you can also combine the two conditions (odd texture and end of texture) into a single condition like so
if ( 2 * (int(gl_FragCoord.y) + 1) == LastMipSize.y-1))
Please let me know if I’ve made a mistake.
Marcus
about 1 year ago
Hmm, looks odd, but it seems you’re right as the currently rendered texture has half the resolution of the previous. If I don’t miss anything, this would result in some popping artifacts.
I double checked the original implementation from AMD and, even though they use DX pixel shaders, they also do the exact same thing. Looks like both of us made it wrong.
I’ll double-check it when I’ll have time. Thanks for the feedback!