OpenGL 4.0 - Mountains demo

OpenGL 4.0 - Mountains demo

Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched occlusion culling for large amount of individual objects using a geometry shader without the need of any CPU intervention that is unavoidable using traditional occlusion queries. The article also provides a reference implementation in the form of the OpenGL 4.0 Mountains demo that uses the technique for culling thousands of object instances.

Introduction

Occlusion culling is a visibility determination algorithm that is used to identify those objects that did reside in the view volume but still aren’t visible on the screen due to occlusion. That means they are hidden by such objects that reside closer to the camera.

For several generations now GPUs allow hardware accelerated methods to perform occlusion culling in the form of occlusion queries. OpenGL provides the functionality via the extension ARB_occlusion_query. Occlusion queries are very simple: when you draw an object with occlusion query enabled the query returns the number of samples that passed the depth test (or simply return true or false based on whether any samples of the objects passed the depth test or not as it is provided by the OpenGL extension ARB_occlusion_query2).

So actually performing occlusion culling using occlusion queries means simply the following:

  1. Draw the object while occlusion query is enabled.
  2. If the query result is that the object is visible then draw the object.

At first, this may sound stupid as you have to draw the object in order to tell whether it is visible or not. While in this form it really sounds silly, in practice occlusion query can save a lot of work for the GPU. Think about you have a complex object with several thousands of triangles. If you would like to determine the visibility of it using occlusion query you would simply render e.g. the bounding box of the object and if the bounding box is visible (occlusion query returns that some samples have passed) then it means the object itself is most probably visible. This way you can save the GPU from the unnecessary processing of large amount of geometry.

I have to mention here that I intentionally used the expression “most probably visible” as occlusion queries provide just a conservative estimate on whether the object is visible or not rather than an exact result. This is because the bounding box occupies a different (larger) portion of the screen than the original geometry. So what we expect from an occlusion culling algorithm is to give one of the following results: the object is not visible or the object is most probably visible. The bigger this probability is the better the occlusion culling effectiveness is.

While we would always want an occlusion culling algorithm to be as effective as possible usually we have to make a trade-off between effectiveness and efficiency. In the above example if we would like to have 100% effectiveness then we would have to draw the whole object and that would defeat most of the goals of occlusion culling. The algorithm presented in this article is somewhat even more conservative but enables the use of occlusion culling for much larger datasets.

Motivation

While hardware accelerated occlusion query is a powerful tool to use in visibility determination it puts a quite reasonable burden on the application to manage the occlusion queries and to draw the objects based on the results when they are available (taking in consideration the asynchronous nature of occlusion queries). The most naive use of occlusion queries would be to execute the query right before we have to draw the object. While this seems like a feasible idea, it does not perform well in practice as the CPU has to be stalled until the result of the query is available and that involves also empty cycles on the GPU as well thus results in unacceptable performance. In order to resolve this, the application has to fill the time between the query execution and the drawing of the object based on the query result. While there are techniques to accomplish this, it definitely comes at a cost as the implementation becomes more complex.

The aforementioned problem is somewhat resolved by using conditional rendering introduced in OpenGL 3 (NV_conditional_render extension). However, this extension does nothing just in case the results of the query are not available yet then we simply draw the object no matter if it is visible or not. This can avoid the stalling of the rendering pipeline and can be done in software if the extension is not available, however, it somewhat defeats the purpose of occlusion culling.

Another deficit when using occlusion queries is that there is still need for CPU intervention in order to make a decision about the visibility of the object. For today’s hardware where proper batching is one of the most crucial aspects of the renderer such an approach is rather ineffective.

The occlusion culling technique presented in this article solves both these issues by providing an implementation that is very simple to integrate into any renderer, does put little to no burden on the renderer and makes decision about the visibility of objects entirely on the GPU.

The algorithm

As in case of many other GPU based culling algorithm presented by me and others, the hierarchical-Z map based occlusion culling uses the geometry shader’s ability to deny the emission of primitives that are determined to be invisible on the final rendering. The shader will only emit data for those objects that are visible and this data is streamed out into a buffer object using transform feedback.

The algorithm itself is similar in spirit to the hierarchical Z testing that is implemented in modern GPUs. After rendering all the occluders in the scene, we construct a hierarchical depth image from the depth buffer which we will refer to as the Hi-Z map. This texture map is a mip-mapped, screen resolution image where each texel in mip level i contains the maximum depth of all corresponding texels in mip level i-1. This depth information can be collected during the main rendering pass for the occluding objects as we need a texture of the same resolution so we don’t need a separate depth pass. This can be simply accomplished using OpenGL framebuffer objects.

After the construction of the Hi-Z map, occlusion culling can be performed by comparing depth value of the object’s bounding volume and the depth information stored in the Hi-Z map. This is when the hierarchical mip-mapped structure of the Hi-Z map comes handy as we can do conservative depth comparisons with less texture fetches by sampling directly from a particular mip level.

This is why we constructed the Hi-Z map using a “store maximum depth” policy. This will work with a usual depth buffer setup where the depth comparison function is either GREATER or GEQUAL. For a reverse directed depth buffer the “store minimum depth” policy has to be used.

Hi-Z map construction

In case of single-sample rendering, one can use the Hi-Z map as the main depth buffer for rendering the scene. The technique extends also to multi-sampled rendering but in this case a separate full-screen quad pass is needed to calculate the maximum depth of each individual sample in the multi-sampled depth buffer and store it in the single-sampled Hi-Z map. This is possible since OpenGL 3.2 or using the extension ARB_texture_multisample. Besides this additional step, the algorithm remains the same.

The Hi-Z map can be constructed using OpenGL framebuffer objects by rendering a full-screen quad pass for each mip level where the previous mip level is bound as the input texture and the current mip level is bound as render target. As OpenGL does allow rendering from and to the same texture object as far as we don’t access the same mip level for both reading and writing, the algorithm simply looks like the following:

// bind depth texture
glBindTexture(GL_TEXTURE_2D, depthTexture);
// calculate the number of mipmap levels for NPOT texture
int numLevels = 1 + (int)floorf(log2f(fmaxf(SCREEN_WIDTH, SCREEN_HEIGHT)));
int currentWidth = SCREEN_WIDTH;
int currentHeight = SCREEN_HEIGHT;
for (int i=1; i<numLevels; i++) {
  // calculate next viewport size
  currentWidth /= 2;
  currentHeight /= 2;
  // ensure that the viewport size is always at least 1x1
  currentWidth = currentWidth > 0 ? currentWidth : 1;
  currentHeight = currentHeight > 0 ? currentHeight : 1;
  glViewport(0, 0, currentWidth, currentHeight);
  // bind next level for rendering but first restrict fetches only to previous level
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, i-1);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, i-1);
  glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT,
                         GL_TEXTURE_2D, depthTexture, i);
  // draw full-screen quad
  ............
}
// reset mipmap level range for the depth image
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, 0);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, numLevels-1);

It is very important not to forget about the step when we ensure that the viewport size is always at least 1×1 as in case of non-power-of-two (NPOT) textures due to rounding problems. I forgot this first and I was wondering an hour why my last mip level didn’t get filled.

While one may wonder how this technique can be efficient after so many full-screen quad passes, it is in fact very efficient and it constructs the Hi-Z map on my Radeon HD5770 in less than 0.2 milliseconds. The measurement should be quite accurate as I’ve done it using OpenGL timer queries (see the extension ARB_timer_query).

The fragment shader used for the construction of the Hi-Z map is very straightforward except one thing. We use an NPOT depth texture due to the aspect ratio of the window and as NPOT textures use a “floor” convention to determine the size of subsequent mip levels (see the extension ARB_texture_non_power_of_two) we need predicated fetches as in case of reduction from odd-sized mip levels we should not forgot about the edge texels:

#version 400 core

uniform sampler2D LastMip;
uniform ivec2 LastMipSize;

in vec2 TexCoord;

void main(void)
{
  vec4 texels;
  texels.x = texture( LastMip, TexCoord ).x;
  texels.y = textureOffset( LastMip, TexCoord, ivec2(-1, 0) ).x;
  texels.z = textureOffset( LastMip, TexCoord, ivec2(-1,-1) ).x;
  texels.w = textureOffset( LastMip, TexCoord, ivec2( 0,-1) ).x;

  float maxZ = max( max( texels.x, texels.y ), max( texels.z, texels.w ) );

  vec3 extra;
  // if we are reducing an odd-width texture then fetch the edge texels
  if ( ( (LastMipSize.x & 1) != 0 ) && ( int(gl_FragCoord.x) == LastMipSize.x-3 ) ) {
    // if both edges are odd, fetch the top-left corner texel
    if ( ( (LastMipSize.y & 1) != 0 ) && ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
      extra.z = textureOffset( LastMip, TexCoord, ivec2( 1, 1) ).x;
      maxZ = max( maxZ, extra.z );
    }
    extra.x = textureOffset( LastMip, TexCoord, ivec2( 1, 0) ).x;
    extra.y = textureOffset( LastMip, TexCoord, ivec2( 1,-1) ).x;
    maxZ = max( maxZ, max( extra.x, extra.y ) );
  } else
  // if we are reducing an odd-height texture then fetch the edge texels
  if ( ( (LastMipSize.y & 1) != 0 ) && ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
    extra.x = textureOffset( LastMip, TexCoord, ivec2( 0, 1) ).x;
    extra.y = textureOffset( LastMip, TexCoord, ivec2(-1, 1) ).x;
    maxZ = max( maxZ, max( extra.x, extra.y ) );
  }

  gl_FragDepth = maxZ;
}

I was experimenting with using texture gather lookups to reduce the number of texture fetches from 4-to-7 fetches per fragment down to 1-to-3 fetches per fragment (see the extension ARB_texture_gather) it seems that texture gather works only if the image is linearly sampled and to avoid the additional burden involved by switching filtering state during rendering I stuck to simple texture lookups as using texture gather lookups did not show any visible effect on the construction time of the Hi-Z map.

Various mip levels of the Hi-Z map

Various mip levels of the Hi-Z map. The Hi-Z map size is 1024x768 and the displayed mip levels are: level 4 (left), level 5 (middle) and level 6 (right).

For debugging and demonstration purposes the Mountains demo has built-in function to display the content of the various mip levels of the Hi-Z map. This is available by pressing the F4 key while Hi-Z map based occlusion culling is enabled. The + and – keys can be used to switch between the mip levels.

In order to better visualize the depth information in the depth buffer I converted the non-linear depth values stored in the depth texture into linear depth values as presented in [GeeXLab] How to Visualize the Depth Buffer in GLSL.

Culling with the Hi-Z map

Once we have constructed the Hi-Z map, we can perform the actual occlusion culling by fetching the 2×2 texel neighborhood corresponding to the screen area occupied by the bounding volume of the object whose visibility has to be determined. In the demo I used bounding boxes but any other bounding volume can be used (e.g. a bounding sphere is usually accurate enough for this technique).

First, we have to calculate the clip space bounding rectangle of the bounding volume. In the bounding box case this is done by transforming the bounding box vertices into clip space and then calculate the minimum and maximum X and Y coordinates. This bounding rectangle will be used for two things: it defines the texture coordinates that we’ll have to use for the Hi-Z map lookup and it helps determining the appropriate LOD for the texture lookup.

In order to determine the texture LOD that we’ll have to fetch we have to calculate the screen space size of the bounding square corresponding to the clip space bounding rectangle determined previously. This can be simply done by calculating the width and height of the bounding rectangle in clip space and then transforming this into screen space:

float ViewSizeX = (BoundingRect[1].x-BoundingRect[0].x) * Transform.Viewport.y;
float ViewSizeY = (BoundingRect[1].y-BoundingRect[0].y) * Transform.Viewport.z;

After this, the texture LOD can be simply calculated using the following formula:

float LOD = ceil( log2( max( ViewSizeX, ViewSizeY ) / 2.0 ) );

Finally, as we have the texture coordinates (the vertices of the clip space bounding rectangle) and the texture LOD, we simply have to make four texture lookups into the Hi-Z map using these parameters, calculate the maximum of the four depth values returned and compare it to the depth value corresponding to the object (this is the object’s front-most point’s depth value that comes also from the clip space coordinates of the bounding box). If the object depth is greater than the reference depth the object is occluded and so it is culled by the geometry shader as usual.

One may ask why we use a 2×2 texel footprint for calculating the reference depth value why not just fetch the next mip level only once (as there we also get the maximum values of a 2×2 texel footprint due to the Hi-Z map construction method). That’s what I’ve also asked myself at first sight but quickly figured out the reason (see the figure below).

Comparison of four texel fetches and one texel fetch for depth comparison

Comparison of number of fetches used for occlusion culling. Both figures show the magnified screen coverage of a single Hi-Z map texel at mip level N, texel coverage for mip level N-1 is in cyan and texel coverage for mip level N-2 is in blue. Object is show as red and yellow indicates the fetched texels.

In case of four texels not just the determination of the texture LOD is much easier but also it better encompasses the actual object bounding rectangle. In case of one texture fetch the computation of texture LOD is more complicated and expensive but the main problem is that a larger LOD has to be fetched and it is not always the LOD determined in the case of four fetches plus one. In the most extreme situation (if the bounding rectangle is right at the middle of the screen) it is possible that we have to fetch the largest LOD. This does not result in any false culling but it severely degrades the effectiveness of the culling.

Of course, it is possible to use more complex screen space bounding polygon as well as more fetches but those would increase the effectiveness of the culling much less than the additional burden and expensive operations worth.

Conclusion

We’ve seen how traditional hardware occlusion culling works by using occlusion queries. We also discussed that we sometimes need a better algorithm that does the occlusion culling for large amount of objects without CPU intervention.

The article also described a way to implement such an occlusion culling algorithm by using a hierarchical-Z map and geometry shaders. We’ve also managed to provide a reference implementation in the form of the demo called Mountains that can be downloaded with full source code in the downloads section.

The algorithm performs very well in practice on current hardware. The Hi-Z map construction takes less than 0.2 milliseconds and the actual culling comes at almost no cost for even thousands of objects. For more detail about performance comparison between rendering with and without hierarchical-Z map based occlusion culling read the article about the OpenGL 4.0 Mountains Demo.

While the demo uses the technique only for culling instances of the same object, the technique can be easily extended to work for heterogeneous set of objects as the actual culling algorithm works on a per-object basis and is completely indifferent regarding to the method used for rendering the actual geometry.

This technique can be thought of as the next step towards a completely GPU based visibility determination and scene management system.

Acknowledgements go to Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk and their SIGGRAPH 2008 Course Notes that inspired this work.