<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>RasterGrid Blog &#187; Samples</title>
	<atom:link href="http://rastergrid.com/blog/category/programming/samples/feed/" rel="self" type="application/rss+xml" />
	<link>http://rastergrid.com/blog</link>
	<description>A technical blog from Daniel Rákos (aka aqnuep)</description>
	<lastBuildDate>Fri, 24 Feb 2012 03:23:41 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Frei-Chen edge detector</title>
		<link>http://rastergrid.com/blog/2011/01/frei-chen-edge-detector/</link>
		<comments>http://rastergrid.com/blog/2011/01/frei-chen-edge-detector/#comments</comments>
		<pubDate>Sun, 30 Jan 2011 15:27:43 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[detection]]></category>
		<category><![CDATA[edge]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=532</guid>
		<description><![CDATA[In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3&#215;3 texel footprint]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2011%252F01%252Ffrei-chen-edge-detector%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fehkb4E%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Frei-Chen%20edge%20detector%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 160px"><img title="Frei-Chen edge detector" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/frei-chen.png" alt="Frei-Chen edge detector" width="150" height="150" /><p class="wp-caption-text">Frei-Chen edge detector</p></div>
<p>In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3&#215;3 texel footprint similarly like the Sobel filter but applies a total of nine convolution masks over the image that can be used for either edge or corner detection. The article presents the mathematical background that is needed to implement the edge detector and provides a reference implementation written in C/C++ using OpenGL that showcases both the Frei-Chen and the Sobel edge detection filter applied to the same image.</p>
<p><span id="more-532"></span>I met with the algorithm during my computer graphics studies when one of my homeworks was to implement the Frei-Chen edge detector. As I already mentioned it in an earlier post, I am willing to provide source code for more basic graphics algorithms after seeing the success of <a title="Efficient Gaussian blur with linear sampling" href="http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/">my former post</a> about the Gaussian blur filter. This one is a very similarly basic article, taking in consideration it shows only how to apply a particular convolution filter based algorithm on a still image, while the possibilities this edge detection algorithm brings is a more complex topic that is out of the scope of this article.</p>
<p>As the provided reference implementation also showcases applying the Sobel operator on an image, I would like to present that first and then continue with the presentation of the Frei-Chen masking set. Those who are already well familiar with edge detection and the Sobel operator can skip the following two sections.</p>
<h2>Edge detection</h2>
<p>Before getting deep into how to implement edge detectors, let&#8217;s first talk about what is an edge detector and why we need it.</p>
<p>In general, edge detection is one of the most fundamental image processing tools, particularly used in the areas of feature detection and feature extraction. The aim of the technique is to identify points of a digital image at which the intensity changes sharply. The reason of these intensity changes can be either discontinuities in depth, surface orientation, lighting condition changes and many other factors. In the ideal case, the result of applying an edge detector to an image leads us to a set of connected lines or curves that indicate the boundaries of objects.</p>
<p>Not going that far, what an edge detector gives us from the very beginning is a gray-scale image where each pixel intensity tries to approximate the likelihood of whether that pixel belongs to an object boundary. How well a particular algorithm can detect such pixels depends on many factors and usually it is better to try multiple edge detectors in order to choose one that fits most for the particular use case.</p>
<p>After we got this gray-scale image we usually have to define a threshold value that will be used as an acceptance criteria for edge pixels. If the intensity value previously calculated is above this threshold then we accept the pixel as an edge otherwise we don&#8217;t. This part is the so called binarization stage. Additionally, subsequent image processing algorithms can be used to further interpret the edge image.</p>
<p>In computer graphics, edge detection is usually used to implement various image decoration algorithms. Maybe the most popular applications of edge detectors nowadays are non-photorealistic rendering (NPR) and screen-space anti-aliasing techniques.</p>
<h2>Sobel filter</h2>
<p>The Sobel edge detection filter works on a 3&#215;3 texel footprint and applies two convolution masks to the image that are intended to detect horizontal and vertical gradients of the image. The filter weights can be seen in on the figure below:</p>
<p style="text-align: center;"><img class="   aligncenter" title="Sobel masks" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/sobel-masks.png" alt="Sobel masks" width="457" height="119" /></p>
<p>These masks are applied to the intensities gathered from the 3&#215;3 footprint of the image and then are accumulated to produce the final gradient value in the following way:</p>
<p style="text-align: center;"><img class="aligncenter" title="Sobel gradient" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/sobel-grad.png" alt="Sobel gradient" width="321" height="84" /></p>
<p>The actual algorithm can be seen in the accompanying demo that provides a GLSL based implementation. The algorithm is defined to work on one channel image, however it can be easily extended to be applied either separately on a usual three-channel RGB image or by first calculating a gray-scale value based on the color component values. The former is more computationally intensive but usually provides better results by defining the threshold criteria in a way that a pixel is accepted as boundary point if the gradient value is larger than the threshold for either of the color channels. The reference implementation, however is based on the later approach for the sake of simplicity so for each pixel first an intensity value is calculated simply by taking the length of the vector comprised of the RGB components.</p>
<h2>Frei-Chen filter</h2>
<p>The Frei-Chen edge detector also works on a 3&#215;3 texel footprint but applies a total of nine convolution masks to the image. Frei-Chen masks are unique masks, which contain all of the basis vectors. This implies that a 3&#215;3 image area is represented with the weighted sum of nine Frei-Chen masks that can be seen below:</p>
<p style="text-align: center;"><img class="aligncenter" title="Frei-Chen masks" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/frei-chen-masks.png" alt="Frei-Chen masks" width="650" height="237" /></p>
<p>The first four Frei-Chen masks above are used for edges, the next four are used for lines and the last mask is used to compute averages. For edge detection, appropriate masks are chosen and the image is projected onto it. The projection equation is given below:</p>
<p style="text-align: center;"><img class="aligncenter" title="Frei-Chen equation" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/frei-chen-eq.png" alt="Frei-Chen equation" width="631" height="108" /></p>
<p>When we are using the Frei-Chen masks for edge detection we are searching for the cosine defined above and we use the first four masks as the elements of importance so the first sum above goes from one to four.</p>
<p>The application of a threshold and applying the filter to multi-channel images works exactly the same way like in case of the Sobel filter. Similarly, the reference implementation applies the filter on the image as it would be a single-channel image by first calculating the intensity value for each texel in the same fashion like with the previously presented filter.</p>
<h2>Comparison</h2>
<p>Based on my experience, the Frei-Chen edge detector looks better than the Sobel filter as it is less sensitive to noise and is able to detect edges that have small gradients and thus are not found by the basic Sobel filter. For a comparison, you can check the figure below:</p>
<div class="wp-caption aligncenter" style="width: 610px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/ed-comparison.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2011/01/ed-comparison.png?referer=');"><img title="Comparison of edge detectors" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/ed-comparison-thumb.png" alt="Comparison of edge detectors" width="600" height="200" /></a><p class="wp-caption-text">Comparison of edge detectors: original image (left), Sobel filter (middle), Frei-Chen filter (right).</p></div>
<p>The reason why the Frei-Chen edge detector seems to work better is because its construction includes a normalization factor as well as other factors that are meant to exclude all other features except edges. A normalization factor can be also added to the Sobel filter by having a third mask that is equivalent with the ninth Frei-Chen mask and is used to normalize the gradients. This could help in reducing the number of undetected edges and the amount of noise that arises from the fact that the Sobel filter calculates absolute gradients rather than relative ones.</p>
<p>From performance point of view, the Frei-Chen edge detector is much more heavyweight as it uses nine masks instead of two, however, in practice, the performance difference between the two is much less taking in consideration that both use the same sized texel footprint and the computational performance of today&#8217;s GPUs is usually much higher than their texture fetching performance.</p>
<h2>Conclusion</h2>
<p>We managed to present an alternative algorithm for the Sobel filter in the form of the Frei-Chen edge detector that, even though having little impact on the performance compared to the Sobel operator, provides better edge detection quality. Having little to no difference in the way how the input data has to be organized and how the result is output, the Frei-Chen edge detector can be easily used as a drop-in replacement for implementations that used the Sobel filter before.</p>
<p><strong>Source code</strong> and <strong>Win32 binary</strong> can be acquired in the <a title="Frei-Chen Edge Detector" href="http://rastergrid.com/blog/downloads/frei-chen-edge-detector/">downloads section</a>.</p>
<p>I would like to encourage those who read this article to add the Frei-Chen edge detector into their software for making a comparison about whether it yields to better results than the Sobel filter for applications that rely on the output of the edge detection filter. I would be interested how the filter works in real-life computer graphics scenarios.</p>
<p>Thanks in advance and hope you enjoyed the article!</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2011/01/frei-chen-edge-detector/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>GPU based dynamic geometry LOD</title>
		<link>http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/</link>
		<comments>http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/#comments</comments>
		<pubDate>Mon, 25 Oct 2010 19:35:13 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LOD]]></category>
		<category><![CDATA[occlusion culling]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[tessellation]]></category>
		<category><![CDATA[vertex buffer]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=428</guid>
		<description><![CDATA[Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F10%252Fgpu-based-dynamic-geometry-lod%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2F9M4KeD%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22GPU%20based%20dynamic%20geometry%20LOD%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 210px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/10/mountains.png"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-thumb.png" alt="OpenGL 4.0 - Mountains demo" width="200" height="150" /></a><p class="wp-caption-text">OpenGL 4.0 - Mountains demo</p></div>
<p>Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very beginning of computer graphics technologies and are present in some form in current CAD softwares, video games and other graphics applications. While determining the appropriate geometry LOD was previously the task of the CPU, with todays hardware it is possible to also offload this to the GPU which excels at handling large amount of objects in parallel.<br />
<span id="more-428"></span></p>
<h2>Introduction</h2>
<p>With the advent of Shader Model 5.0 GPUs and the appearance of programmable tessellation hardware it may seem like the geometry LOD problem is solved once and for all. However, in many cases it is simply not enough as for far away objects even a patch pass-through tessellation shader already produces too much geometry than the added detail worths. As a result, classic geometry LOD algorithms are still a good-to-have feature in the tool-box of the developer. Not to mention that all vendors recommend disabling tessellation shaders at all if we don&#8217;t need any geometry amplification as even a pass-through tessellation shader does have its payload.</p>
<p>This means that there has to be still a conventional rendering path for geometries that should not be tessellated. Then why not to try offloading the geometry LOD determination to the GPU if possible?</p>
<p>This article presents a technique that was already presented by AMD&#8217;s <a title="March of the Froblins" href="http://developer.amd.com/samples/demos/pages/froblins.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/samples/demos/pages/froblins.aspx?referer=');">March of the Froblins</a> demo and by NVIDIA&#8217;s <a title="NVIDIA DX10 Samples" href="http://developer.download.nvidia.com/SDK/10/direct3d/samples.html" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.download.nvidia.com/SDK/10/direct3d/samples.html?referer=');">Skinned Instancing</a> demo and allows GPU based dynamic geometry LOD determination using a geometry shader that selects the most appropriate LOD from a group of geometry LODs based on the object&#8217;s distance from camera. While this article and the reference implementation (<a title="OpenGL 4.0 - Mountains demo released" href="http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/">OpenGL 4.0 &#8211; Mountains demo</a>) presents the application of the technique only for instanced geometry, the same method can be easily extended to support heterogeneous objects by taking advantage of the latest functionalities introduced in OpenGL 4.</p>
<h2>The algorithm</h2>
<p>The technique is based on the geometry shader&#8217;s ability to emit or deny the emission of primitives into a transform feedback buffer as done in the mentioned DX based implementations. One major improvement compared to earlier approaches is that the LOD determination is done in a single pass rather than requiring a separate pass for each geometry LOD. Additionally, this LOD determination pass can be also merged together with other visibility determination passes like <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">Instance Cloud Reduction</a> or <a title="Hierarchical-Z map based occlusion culling" href="http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/">Hierarchical-Z map based occlusion culling</a> as it is done in the reference implementation. This was made possible thanks to the latest transform feedback capabilities introduced in OpenGL 4.0 (see the extension <a title="GL_ARB_transform_feedback3" href="http://www.opengl.org/registry/specs/ARB/transform_feedback3.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback3.txt?referer=');">ARB_transform_feedback3</a>) that enables the geometry shader to output data to separate primitive streams.</p>
<div class="wp-caption aligncenter" style="width: 660px"><img class="    " title="Culling and dynamic LOD in the March of the Froblins demo" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/froblin-lod.png" alt="Culling and dynamic LOD in the March of the Froblins demo" width="650" height="340" /><p class="wp-caption-text">Flow-chart presenting the culling and dynamic LOD algorithms used in AMD&#39;s March of the Froblins demo. The implementation needs five passes for culling and separating three detail levels and performs two asynchronous queries meanwhile. Requires OpenGL 3 compliant hardware.</p></div>
<div class="wp-caption aligncenter" style="width: 660px"><img title="Culling and dynamic LOD in the Mountains demo" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-lod.png" alt="Culling and dynamic LOD in the Mountains demo" width="650" height="281" /><p class="wp-caption-text">Flow-chart presenting the culling and dynamic LOD algorithm used in our Mountains demo. The implementation requires only one pass for culling and separating three detail levels without the need to use asynchronous queries. Requires OpenGL 4 compliant hardware.</p></div>
<p>The algorithm itself is very simple and straightforward. For each object instance determine the appropriate geometry LOD based on it&#8217;s distance from the camera and the LOD distances passed as uniform to the shader. After this, output the instance&#8217;s data to the output stream ID that corresponds to the determined LOD&#8217;s index. Here you can see a GLSL implementation of the algorithm:</p>
<pre class="brush:c">#version 400 core

uniform mat4 ModelViewMatrix;
uniform vec2 LodDistance;

layout(points) in;
layout(points, max_vertices = 1) out;

in vec3 InstancePosition[1];

layout(stream=0) out vec3 InstPosLOD0;
layout(stream=1) out vec3 InstPosLOD1;
layout(stream=2) out vec3 InstPosLOD2;

void main() {
  float distance = length(ModelViewMatrix * vec4(InstancePosition[0], 1.0));
  if ( distance &lt; LodDistance.x ) {
    InstPosLOD0 = InstancePosition[0];
    EmitStreamVertex(0);
  } else
  if ( distance &lt; LodDistance.y ) {
    InstPosLOD1 = InstancePosition[0];
    EmitStreamVertex(1);
  } else {
    InstPosLOD2 = InstancePosition[0];
    EmitStreamVertex(2);
  }
}</pre>
<p>Additionally, the geometry LOD determination pass has to be executed with primitive queries enabled for all the relevant output streams to acquire the number of instances for each geometry LOD index:</p>
<pre class="brush:cpp">for (int i=0; i&lt;NUM_LOD; i++)
  glBeginQueryIndexed(GL_PRIMITIVES_GENERATED, i, lodQuery[i]);

glBeginTransformFeedback(GL_POINTS);
  glDrawArrays(GL_POINTS, 0, instanceCount);
glEndTransformFeedback();

for (int i=0; i&lt;NUM_LOD; i++)
  glEndQueryIndexed(GL_PRIMITIVES_GENERATED, i);</pre>
<p>Finally, the only thing what is left is to issue an instanced draw call for each geometry LOD index to draw all the instances:</p>
<pre class="brush:cpp">for (int i=0; i&lt;NUM_LOD; i++) {
  glGetQueryObjectiv(lodQuery[i], GL_QUERY_RESULT, instanceCountLOD[i]);
  if ( instanceCountLOD[i] &gt; 0 )
    glDrawElementsInstanced(..., instanceCountLOD[i]);
}</pre>
<p>That&#8217;s all, and what you get as a result is a fully GPU based geometry LOD selection algorithm.</p>
<h2>The Mountains demo</h2>
<p>The reference implementation provided as part of the <a title="OpenGL 4.0 - Mountains demo" href="http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/">OpenGL 4.0 &#8211; Mountains demo</a> that is available with full source code and Windows executable in the <a title="Mountains Demo download" href="http://rastergrid.com/blog/downloads/mountains-demo/">downloads section</a>. The demo application implements the same visibility determination algorithms that were presented in the <a title="SIGGRAPH 2008 Course Notes about the March of the Froblins" href="http://developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf?referer=');">SIGGRAPH 2008 Course Notes</a> besides the dynamic geometry LOD algorithm presented here in a single pass.</p>
<p>Dynamic LOD can be enabled in the demo by using the F3 key. After enabled, the demo separates the various geometry detail levels according to the LOD distances configured. As it can be seen, there is almost no visible difference between the scene rendered with dynamic geometry LOD enabled and disabled. Also, by setting the LOD distances appropriately, the algorithm provides seamless transition between subsequent geometry detail levels as the camera is moved.</p>
<table style="width: 100%;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #ffffff;" align="center">
<div class="wp-caption alignnone" style="width: 338px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/lod-comp.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/lod-comp.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/lod-comp-thumb.png" alt="Close-up view to compare image quality without and with dynamic LOD" width="328" height="160" /></a><p class="wp-caption-text">Close-up view of distant objects to compare the image quality without (left) and with (right) dynamic LOD.</p></div></td>
<td style="background-color: #ffffff;" align="center">
<p><div class="wp-caption alignnone" style="width: 223px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/visual-lod.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/visual-lod.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/visual-lod-thumb.png" alt="LOD visualization" width="213" height="160" /></a><p class="wp-caption-text">Geometry LOD visualization: LOD 0 (red), LOD 1 (green), LOD 2 (blue).</p></div></td>
</tr>
</tbody>
</table>
<p>When dyamic LOD is enabled, the demo also makes it possible to visualize the various geometry detail levels by pressing the F4 key. The highest detail LOD is marked with red, mid-level with green and the lowest detail geometries are marked as blue. It can be seen that as the camera moves the renderer automatically adjusts the detail of each individual instance.</p>
<p>Besides maintaining a constant quality without the viewer to observe any transitions between the various detail levels, the algorithm provides a huge performance gain in case of complex geometries as it can be seen on the figure below:</p>
<p><div class="wp-caption aligncenter" style="width: 654px"><img class="   " src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-fps.png" alt="Performance comparison of various culling and LOD techniques in frames per second on a Radeon HD5770 (higher is better)" width="644" height="224" /><p class="wp-caption-text">Performance comparison of the demo in frames per second on a Radeon HD5770 (higher is better): no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).</p></div>
<h2>Conclusion</h2>
<p>We&#8217;ve seen how straightforward is to implement GPU based dynamic geometry LOD determination using geometry shaders on OpenGL 4.0 compliant hardware providing also a reference implementation that uses the algorithm to efficiently determine detail levels for large number of instanced geometry. We also briefly mentioned that the algorithm can be extended to handle arbitrary object sets. We discussed about a possible OpenGL 3 based implementation but we did not provide one as it requires several rendering passes to perform all the operations that can be implemented in a single pass on Shader Model 5.0 hardware.</p>
<p>Even though the algorithm is already extremely efficient, it still involves the use of asynchronous primitive queries that may induce some latency. Of course, this latency can be easily hidden by performing other operations on the CPU/GPU until the results are available.</p>
<p>Furthermore, taking full advantage of Shader Model 5.0 GPUs it would be possible to eliminate the need of asynchronous queries by using atomic counters and indirect rendering, however the core OpenGL specification does not expose yet such functionality so this improvement is left for a future release of the demo.</p>
<p>Classic dynamic geometry LOD algorithms are still first class citizens of every rendering system and even though the introduction of hardware tessellation somewhat subsumes the need for these classic techniques, practice shows that the best way to implement a full-fledged dynamic LOD system is by using geometry LOD selection and tessellation together rather that one instead of the other.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Hierarchical-Z map based occlusion culling</title>
		<link>http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/</link>
		<comments>http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/#comments</comments>
		<pubDate>Tue, 19 Oct 2010 19:13:32 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[depth buffer]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LOD]]></category>
		<category><![CDATA[mipmap]]></category>
		<category><![CDATA[occlusion culling]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[transform feedback]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=397</guid>
		<description><![CDATA[Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F10%252Fhierarchical-z-map-based-occlusion-culling%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FaGM0Fs%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Hierarchical-Z%20map%20based%20occlusion%20culling%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 210px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/10/mountains.png"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-thumb.png" alt="OpenGL 4.0 - Mountains demo" width="200" height="150" /></a><p class="wp-caption-text">OpenGL 4.0 - Mountains demo</p></div>
<p>Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched occlusion culling for large amount of individual objects using a geometry shader without the need of any CPU intervention that is unavoidable using traditional occlusion queries. The article also provides a reference implementation in the form of the OpenGL 4.0 Mountains demo that uses the technique for culling thousands of object instances.</p>
<p><span id="more-397"></span></p>
<h2>Introduction</h2>
<p>Occlusion culling is a visibility determination algorithm that is used to identify those objects that did reside in the view volume but still aren&#8217;t visible on the screen due to occlusion. That means they are hidden by such objects that reside closer to the camera.</p>
<p>For several generations now GPUs allow hardware accelerated methods to perform occlusion culling in the form of occlusion queries. OpenGL provides the functionality via the extension <a title="GL_ARB_occlusion_query" href="http://www.opengl.org/registry/specs/ARB/occlusion_query.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/occlusion_query.txt?referer=');">ARB_occlusion_query</a>. Occlusion queries are very simple: when you draw an object with occlusion query enabled the query returns the number of samples that passed the depth test (or simply return true or false based on whether any samples of the objects passed the depth test or not as it is provided by the OpenGL extension <a title="GL_ARB_occlusion_query2" href="http://www.opengl.org/registry/specs/ARB/occlusion_query2.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/occlusion_query2.txt?referer=');">ARB_occlusion_query2</a>).</p>
<p>So actually performing occlusion culling using occlusion queries means simply the following:</p>
<ol>
<li>Draw the object while occlusion query is enabled.</li>
<li>If the query result is that the object is visible then draw the object.</li>
</ol>
<p>At first, this may sound stupid as you have to draw the object in order to tell whether it is visible or not. While in this form it really sounds silly, in practice occlusion query can save a lot of work for the GPU. Think about you have a complex object with several thousands of triangles. If you would like to determine the visibility of it using occlusion query you would simply render e.g. the bounding box of the object and if the bounding box is visible (occlusion query returns that some samples have passed) then it means the object itself is most probably visible. This way you can save the GPU from the unnecessary processing of large amount of geometry.</p>
<p>I have to mention here that I intentionally used the expression &#8220;most probably visible&#8221; as occlusion queries provide just a conservative estimate on whether the object is visible or not rather than an exact result. This is because the bounding box occupies a different (larger) portion of the screen than the original geometry. So what we expect from an occlusion culling algorithm is to give one of the following results: the object is not visible or the object is most probably visible. The bigger this probability is the better the occlusion culling effectiveness is.</p>
<p>While we would always want an occlusion culling algorithm to be as effective as possible usually we have to make a trade-off between effectiveness and efficiency. In the above example if we would like to have 100% effectiveness then we would have to draw the whole object and that would defeat most of the goals of occlusion culling. The algorithm presented in this article is somewhat even more conservative but enables the use of occlusion culling for much larger datasets.</p>
<h2>Motivation</h2>
<p>While hardware accelerated occlusion query is a powerful tool to use in visibility determination it puts a quite reasonable burden on the application to manage the occlusion queries and to draw the objects based on the results when they are available (taking in consideration the asynchronous nature of occlusion queries). The most naive use of occlusion queries would be to execute the query right before we have to draw the object. While this seems like a feasible idea, it does not perform well in practice as the CPU has to be stalled until the result of the query is available and that involves also empty cycles on the GPU as well thus results in unacceptable performance. In order to resolve this, the application has to fill the time between the query execution and the drawing of the object based on the query result. While there are techniques to accomplish this, it definitely comes at a cost as the implementation becomes more complex.</p>
<p>The aforementioned problem is somewhat resolved by using conditional rendering introduced in OpenGL 3 (<a title="GL_NV_conditional_render" href="http://www.opengl.org/registry/specs/NV/conditional_render.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/conditional_render.txt?referer=');">NV_conditional_render</a> extension). However, this extension does nothing just in case the results of the query are not available yet then we simply draw the object no matter if it is visible or not. This can avoid the stalling of the rendering pipeline and can be done in software if the extension is not available, however, it somewhat defeats the purpose of occlusion culling.</p>
<p>Another deficit when using occlusion queries is that there is still need for CPU intervention in order to make a decision about the visibility of the object. For today&#8217;s hardware where proper batching is one of the most crucial aspects of the renderer such an approach is rather ineffective.</p>
<p>The occlusion culling technique presented in this article solves both these issues by providing an implementation that is very simple to integrate into any renderer, does put little to no burden on the renderer and makes decision about the visibility of objects entirely on the GPU.</p>
<h2>The algorithm</h2>
<p>As in case of many other GPU based culling algorithm presented by me and others, the hierarchical-Z map based occlusion culling uses the geometry shader&#8217;s ability to deny the emission of primitives that are determined to be invisible on the final rendering. The shader will only emit data for those objects that are visible and this data is streamed out into a buffer object using transform feedback.</p>
<p>The algorithm itself is similar in spirit to the hierarchical Z testing that is implemented in modern GPUs. After rendering all the occluders in the scene, we construct a hierarchical depth image from the depth buffer which we will refer to as the Hi-Z map. This texture map is a mip-mapped, screen resolution image where each texel in mip level <em>i</em> contains the maximum depth of all corresponding texels in mip level <em>i-1</em>. This depth information can be collected during the main rendering pass for the occluding objects as we need a texture of the same resolution so we don&#8217;t need a separate depth pass. This can be simply accomplished using OpenGL framebuffer objects.</p>
<p>After the construction of the Hi-Z map, occlusion culling can be performed by comparing depth value of the object&#8217;s bounding volume and the depth information stored in the Hi-Z map. This is when the hierarchical mip-mapped structure of the Hi-Z map comes handy as we can do conservative depth comparisons with less texture fetches by sampling directly from a particular mip level.</p>
<p>This is why we constructed the Hi-Z map using a &#8220;store maximum depth&#8221; policy. This will work with a usual depth buffer setup where the depth comparison function is either GREATER or GEQUAL. For a reverse directed depth buffer the &#8220;store minimum depth&#8221; policy has to be used.</p>
<h3>Hi-Z map construction</h3>
<p>In case of single-sample rendering, one can use the Hi-Z map as the main depth buffer for rendering the scene. The technique extends also to multi-sampled rendering but in this case a separate full-screen quad pass is needed to calculate the maximum depth of each individual sample in the multi-sampled depth buffer and store it in the single-sampled Hi-Z map. This is possible since OpenGL 3.2 or using the extension <a title="GL_ARB_texture_multisample" href="http://www.opengl.org/registry/specs/ARB/texture_multisample.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_multisample.txt?referer=');">ARB_texture_multisample</a>. Besides this additional step, the algorithm remains the same.</p>
<p>The Hi-Z map can be constructed using OpenGL framebuffer objects by rendering a full-screen quad pass for each mip level where the previous mip level is bound as the input texture and the current mip level is bound as render target. As OpenGL does allow rendering from and to the same texture object as far as we don&#8217;t access the same mip level for both reading and writing, the algorithm simply looks like the following:</p>
<pre class="brush:cpp">// bind depth texture
glBindTexture(GL_TEXTURE_2D, depthTexture);
// calculate the number of mipmap levels for NPOT texture
int numLevels = 1 + (int)floorf(log2f(fmaxf(SCREEN_WIDTH, SCREEN_HEIGHT)));
int currentWidth = SCREEN_WIDTH;
int currentHeight = SCREEN_HEIGHT;
for (int i=1; i&lt;numLevels; i++) {
  // calculate next viewport size
  currentWidth /= 2;
  currentHeight /= 2;
  // ensure that the viewport size is always at least 1x1
  currentWidth = currentWidth &gt; 0 ? currentWidth : 1;
  currentHeight = currentHeight &gt; 0 ? currentHeight : 1;
  glViewport(0, 0, currentWidth, currentHeight);
  // bind next level for rendering but first restrict fetches only to previous level
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, i-1);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, i-1);
  glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT,
                         GL_TEXTURE_2D, depthTexture, i);
  // draw full-screen quad
  ............
}
// reset mipmap level range for the depth image
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, 0);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, numLevels-1);</pre>
<p>It is very important not to forget about the step when we ensure that the viewport size is always at least 1&#215;1 as in case of non-power-of-two (NPOT) textures due to rounding problems. I forgot this first and I was wondering an hour why my last mip level didn&#8217;t get filled.</p>
<p>While one may wonder how this technique can be efficient after so many full-screen quad passes, it is in fact very efficient and it constructs the Hi-Z map on my Radeon HD5770 in less than <strong>0.2 milliseconds</strong>. The measurement should be quite accurate as I&#8217;ve done it using OpenGL timer queries (see the extension <a title="GL_ARB_timer_query" href="http://www.opengl.org/registry/specs/ARB/timer_query.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/timer_query.txt?referer=');">ARB_timer_query</a>).</p>
<p>The fragment shader used for the construction of the Hi-Z map is very straightforward except one thing. We use an NPOT depth texture due to the aspect ratio of the window and as NPOT textures use a &#8220;floor&#8221; convention to determine the size of subsequent mip levels (see the extension <a title="GL_ARB_texture_non_power_of_two" href="http://www.opengl.org/registry/specs/ARB/texture_non_power_of_two.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_non_power_of_two.txt?referer=');">ARB_texture_non_power_of_two</a>) we need predicated fetches as in case of reduction from odd-sized mip levels we should not forgot about the edge texels:</p>
<pre class="brush:c">#version 400 core

uniform sampler2D LastMip;
uniform ivec2 LastMipSize;

in vec2 TexCoord;

void main(void)
{
  vec4 texels;
  texels.x = texture( LastMip, TexCoord ).x;
  texels.y = textureOffset( LastMip, TexCoord, ivec2(-1, 0) ).x;
  texels.z = textureOffset( LastMip, TexCoord, ivec2(-1,-1) ).x;
  texels.w = textureOffset( LastMip, TexCoord, ivec2( 0,-1) ).x;

  float maxZ = max( max( texels.x, texels.y ), max( texels.z, texels.w ) );

  vec3 extra;
  // if we are reducing an odd-width texture then fetch the edge texels
  if ( ( (LastMipSize.x &amp; 1) != 0 ) &amp;&amp; ( int(gl_FragCoord.x) == LastMipSize.x-3 ) ) {
    // if both edges are odd, fetch the top-left corner texel
    if ( ( (LastMipSize.y &amp; 1) != 0 ) &amp;&amp; ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
      extra.z = textureOffset( LastMip, TexCoord, ivec2( 1, 1) ).x;
      maxZ = max( maxZ, extra.z );
    }
    extra.x = textureOffset( LastMip, TexCoord, ivec2( 1, 0) ).x;
    extra.y = textureOffset( LastMip, TexCoord, ivec2( 1,-1) ).x;
    maxZ = max( maxZ, max( extra.x, extra.y ) );
  } else
  // if we are reducing an odd-height texture then fetch the edge texels
  if ( ( (LastMipSize.y &amp; 1) != 0 ) &amp;&amp; ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
    extra.x = textureOffset( LastMip, TexCoord, ivec2( 0, 1) ).x;
    extra.y = textureOffset( LastMip, TexCoord, ivec2(-1, 1) ).x;
    maxZ = max( maxZ, max( extra.x, extra.y ) );
  }

  gl_FragDepth = maxZ;
}</pre>
<p>I was experimenting with using texture gather lookups to reduce the number of texture fetches from 4-to-7 fetches per fragment down to 1-to-3 fetches per fragment (see the extension <a title="GL_ARB_texture_gather" href="http://www.opengl.org/registry/specs/ARB/texture_gather.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_gather.txt?referer=');">ARB_texture_gather</a>) it seems that texture gather works only if the image is linearly sampled and to avoid the additional burden involved by switching filtering state during rendering I stuck to simple texture lookups as using texture gather lookups did not show any visible effect on the construction time of the Hi-Z map.</p>
<div class="wp-caption aligncenter" style="width: 602px"><img title="Various mip levels of the Hi-Z map" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/depth-lods.png" alt="Various mip levels of the Hi-Z map" width="592" height="144" /><p class="wp-caption-text">Various mip levels of the Hi-Z map. The Hi-Z map size is 1024x768 and the displayed mip levels are: level 4 (left), level 5 (middle) and level 6 (right).</p></div>
<p>For debugging and demonstration purposes the Mountains demo has built-in function to display the content of the various mip levels of the Hi-Z map. This is available by pressing the F4 key while Hi-Z map based occlusion culling is enabled. The + and &#8211; keys can be used to switch between the mip levels.</p>
<p>In order to better visualize the depth information in the depth buffer I converted the non-linear depth values stored in the depth texture into linear depth values as presented in <a title="[GeeXLab] How to Visualize the Depth Buffer in GLSL" href="http://www.geeks3d.com/20091216/geexlab-how-to-visualize-the-depth-buffer-in-glsl/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.geeks3d.com/20091216/geexlab-how-to-visualize-the-depth-buffer-in-glsl/?referer=');">[GeeXLab] How to Visualize the Depth Buffer in GLSL</a>.</p>
<h3>Culling with the Hi-Z map</h3>
<p>Once we have constructed the Hi-Z map, we can perform the actual occlusion culling by fetching the 2&#215;2 texel neighborhood corresponding to the screen area occupied by the bounding volume of the object whose visibility has to be determined. In the demo I used bounding boxes but any other bounding volume can be used (e.g. a bounding sphere is usually accurate enough for this technique).</p>
<p>First, we have to calculate the clip space bounding rectangle of the bounding volume. In the bounding box case this is done by transforming the bounding box vertices into clip space and then calculate the minimum and maximum X and Y coordinates. This bounding rectangle will be used for two things: it defines the texture coordinates that we&#8217;ll have to use for the Hi-Z map lookup and it helps determining the appropriate LOD for the texture lookup.</p>
<p>In order to determine the texture LOD that we&#8217;ll have to fetch we have to calculate the screen space size of the bounding square corresponding to the clip space bounding rectangle determined previously. This can be simply done by calculating the width and height of the bounding rectangle in clip space and then transforming this into screen space:</p>
<pre class="brush:c">float ViewSizeX = (BoundingRect[1].x-BoundingRect[0].x) * Transform.Viewport.y;
float ViewSizeY = (BoundingRect[1].y-BoundingRect[0].y) * Transform.Viewport.z;</pre>
<p>After this, the texture LOD can be simply calculated using the following formula:</p>
<pre class="brush:c">float LOD = ceil( log2( max( ViewSizeX, ViewSizeY ) / 2.0 ) );</pre>
<p>Finally, as we have the texture coordinates (the vertices of the clip space bounding rectangle) and the texture LOD, we simply have to make four texture lookups into the Hi-Z map using these parameters, calculate the maximum of the four depth values returned and compare it to the depth value corresponding to the object (this is the object&#8217;s front-most point&#8217;s depth value that comes also from the clip space coordinates of the bounding box). If the object depth is greater than the reference depth the object is occluded and so it is culled by the geometry shader as usual.</p>
<p>One may ask why we use a 2&#215;2 texel footprint for calculating the reference depth value why not just fetch the next mip level only once (as there we also get the maximum values of a 2&#215;2 texel footprint due to the Hi-Z map construction method). That&#8217;s what I&#8217;ve also asked myself at first sight but quickly figured out the reason (see the figure below).</p>
<div class="wp-caption aligncenter" style="width: 530px"><img class=" " title="Comparison of four texel fetches and one texel fetch for depth comparison" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/fetch-modes.png" alt="Comparison of four texel fetches and one texel fetch for depth comparison" width="520" height="256" /><p class="wp-caption-text">Comparison of number of fetches used for occlusion culling. Both figures show the magnified screen coverage of a single Hi-Z map texel at mip level N, texel coverage for mip level N-1 is in cyan and texel coverage for mip level N-2 is in blue. Object is show as red and yellow indicates the fetched texels.</p></div>
<p>In case of four texels not just the determination of the texture LOD is much easier but also it better encompasses the actual object bounding rectangle. In case of one texture fetch the computation of texture LOD is more complicated and expensive but the main problem is that a larger LOD has to be fetched and it is not always the LOD determined in the case of four fetches plus one. In the most extreme situation (if the bounding rectangle is right at the middle of the screen) it is possible that we have to fetch the largest LOD. This does not result in any false culling but it severely degrades the effectiveness of the culling.</p>
<p>Of course, it is possible to use more complex screen space bounding polygon as well as more fetches but those would increase the effectiveness of the culling much less than the additional burden and expensive operations worth.</p>
<h2>Conclusion</h2>
<p>We&#8217;ve seen how traditional hardware occlusion culling works by using occlusion queries. We also discussed that we sometimes need a better algorithm that does the occlusion culling for large amount of objects without CPU intervention.</p>
<p>The article also described a way to implement such an occlusion culling algorithm by using a hierarchical-Z map and geometry shaders. We&#8217;ve also managed to provide a reference implementation in the form of the demo called Mountains that can be downloaded with full source code in the <a title="OpenGL 4.0 - Mountains demo download" href="http://rastergrid.com/blog/downloads/mountains-demo/">downloads section</a>.</p>
<p>The algorithm performs very well in practice on current hardware. The Hi-Z map construction takes less than 0.2 milliseconds and the actual culling comes at almost no cost for even thousands of objects. For more detail about performance comparison between rendering with and without hierarchical-Z map based occlusion culling read the article about the <a title="OpenGL 4.0 - Mountains demo released" href="http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/">OpenGL 4.0 Mountains Demo</a>.</p>
<p>While the demo uses the technique only for culling instances of the same object, the technique can be easily extended to work for heterogeneous set of objects as the actual culling algorithm works on a per-object basis and is completely indifferent regarding to the method used for rendering the actual geometry.</p>
<p>This technique can be thought of as the next step towards a completely GPU based visibility determination and scene management system.</p>
<p>Acknowledgements go to Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk and their <a title="SIGGRAPH 2008 Course Notes about the March of the Froblins" href="http://developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf?referer=');">SIGGRAPH 2008 Course Notes</a> that inspired this work.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>OpenGL 4.0 &#8211; Mountains demo released</title>
		<link>http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/</link>
		<comments>http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/#comments</comments>
		<pubDate>Mon, 11 Oct 2010 21:19:21 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LOD]]></category>
		<category><![CDATA[occlusion culling]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[SFML]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=339</guid>
		<description><![CDATA[OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn&#8217;t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F10%252Fopengl-4-0-mountains-demo-released%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FawWubV%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22OpenGL%204.0%20-%20Mountains%20demo%20released%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 210px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/10/mountains.png"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-thumb.png" alt="OpenGL 4.0 - Mountains demo" width="200" height="150" /></a><p class="wp-caption-text">OpenGL 4.0 - Mountains demo</p></div>
<p>OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn&#8217;t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics developers to implement GPU based scene management and culling algorithms. The Mountains demo showcases some of these rendering techniques that, as far as I know, were never implemented so far using OpenGL. In this article I will present the key features of the demo that will be discussed in more detail in subsequent articles. Demo binaries with full source code are also published.</p>
<p><span id="more-339"></span>The demo itself is mainly inspired by the <a title="March of the Froblins" href="http://developer.amd.com/samples/demos/pages/froblins.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/samples/demos/pages/froblins.aspx?referer=');">March of the Froblins</a> demo released by AMD and the <a title="Chapter03-SBOT-March_of_The_Froblins.pdf" href="http://developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf?referer=');">SIGGRAPH 2008 Course Notes</a> by Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk presenting the actual implementation in detail. That demo targeted the Radeon HD4800 series and presented several practical GPU based culling algorithms implemented using DirectX10. The Mountains demo implements these techniques in OpenGL and further improves the technique used in AMD&#8217;s demo by unleashing the new features introduced by Shader Model 5.0 hardware and OpenGL 4.0.</p>
<p>While this article briefly presents the demo and the used rendering techniques, the details of each individual technique will be presented in subsequent articles as the thorough examination of them needs a longer discussion that would render this article simply too long and overwhelming.</p>
<h2>Introduction</h2>
<p>The Mountains demo renders a tiled terrain block with thousands of high detail tree models (the full detail tree model is over five thousand triangles). Due to the view distance used in the demo is quite large, several tiles of the terrain block are potentially visible on the screen and this results in a huge explosion in the number of triangles the GPU has to render. Also, with traditional methods the rendering of the terrain blocks and the several thousand tree models would need loads of draw calls. In order to solve this problem, the demo renders the trees using geometry instancing to minimize the number of draw calls.</p>
<p>In a traditional rendering engine CPU based culling methods would be used. While that would even work in practice, it is more convenient to perform the culling on the GPU as every information needed to do it is available there. Nevertheless, culling is a typical algorithm that can easily take advantage of the highly parallel architecture of the GPU. Also, performing the culling on the CPU would make geometry instancing barely beneficial.</p>
<p>Another problem with a scene like this is that a simple per-object view frustum culling would not solve the problem completely as most of trees in the view frustum are not visible due that they are hidden by the terrain. In traditional OpenGL the way how to solve this problem would be the use of per-object occlusion queries and rendering of bounding volumes. While this may work in practice, it involves too much CPU intervention even if we take advantage of conditional rendering and nevertheless, this also breaks instancing.</p>
<p>These are the issues that motivated me in creating this demo and I established the following goals for the project:</p>
<div class="wp-caption alignright" style="width: 210px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains2.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains2.png?referer=');"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains2-thumb.png" alt="View from above" width="200" height="150" /></a><p class="wp-caption-text">View from above</p></div>
<ul>
<li>All the object-level information must stay on the GPU and the CPU should not make decisions on a per-object basis.</li>
<li>The renderer should use as few draw calls as possible in order to solve the problem of visibility determination.</li>
<li>Don&#8217;t draw anything that is not inside the view frustum or is occluded by terrain.</li>
</ul>
<p>The result is a renderer that does little to no scene management on the CPU, instead uses the GPU for visibility determination that is, in most cases, able to reduce the scene&#8217;s geometric complexity from over 400 million triangles under one million triangles providing an interactive experience on a Radeon HD5770 with around 200 frames per second.</p>
<h2>Implementation</h2>
<p>The scene consists of a tiled terrain with over 130 thousands of triangles and more than 1400 tree instances each with almost 6 thousands of triangles. This sums up to 8 million triangles for a single tile block of terrain. As the view range is needed to be quite large we actually deal with a 7&#215;7 tile of terrain that is dynamically placed in a way that the camera always resides in the middle block of the tile. What all this means that even though we dynamically generate the scenery around the camera, we still have to deal with a scene consisting of over 400 million triangles. This is simply too much for the GPU to deal with.</p>
<p>The first step done in order to reduce the geometric complexity of the scene is done on the CPU by performing a view frustum culling on a per-terrain-block basis. This will limit our 7&#215;7 tile to a smaller subset that contains only those blocks that are lying within the view frustum. The result is a scene usually around 50 million triangles.</p>
<p>While this is already a reasonable amount of simplification, in order to further reduce the amount of geometry we have to render we have to do per-object culling. But as mentioned before, we would not like to do such fine grained scene management on the CPU so we need some sophisticated methods to do it on the GPU.</p>
<p>In order to accomplish this, we will take advantage of the geometry shader&#8217;s capability of discarding geometry. We will use it to do the per-object decisions in order to cull the tree instances that are not visible. The three techniques implemented in the culling geometry shader and the accompanying vertex shader are the following:</p>
<ul>
<li><strong>Instance Cloud Reduction (ICR)</strong> &#8211; This method does view frustum culling on a per-instance basis based on the bounding box of the instanced geometry, in this case the tree. The technique was first presented in my previous article titled <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">Instance culling using geometry shaders</a> and then further improved according to the instructions presented in <a title="Instance Cloud Reduction reloaded" href="http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/">Instance Cloud Reduction reloaded</a>. In this case, the technique allows us to do a more fine grained yet still high level view frustum culling of the tree instances than that allowed by the simple per-tile culling performed on the CPU.</li>
<li><strong>Hierarchical-Z Map based Occlusion Culling</strong> &#8211; This technique allows for conservative per-instance occlusion culling completely done and evaluated on the GPU using a similar algorithm that the hardware depth buffer uses to hierarchically reject fragments based on their depth values. Using this technique, a coarse occlusion culling can be performed on the instances without the need of occlusion queries and CPU intervention. <strong>Update!</strong> The technique is discussed in detail in the article <a title="Hierarchical-Z map based occlusion culling" href="http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/">Hierarchical-Z map based occlusion culling</a>.</li>
<li><strong>Dynamic Level-of-Detail Determination</strong> &#8211; This method allows us to dynamically select a suitable geometry level-of-detail on a per-instance basis completely on the GPU based on the application provided LOD parameters and the distance of the instance from the camera. The Mountains demo uses three LOD levels for the tree object: one with 5811 triangles, another with 2893 triangles and the lowest detailed version contains 1492 triangles. <strong>Update!</strong> The technical details of the algorithm are presented in the article <a title="GPU based dynamic geometry LOD" href="http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/">GPU based dynamic geometry LOD</a>.</li>
</ul>
<p>While in the Mountains demo all these techniques are used to determine the visibility and the LOD of static scenery (as trees are unlikely to move) the truth is that these methods apply with no modification also to dynamic scenery. This is a very important thing to note as usually dynamic objects are those that makes many of the CPU based scene management and visibility determination algorithms difficult to use or simply inefficient.</p>
<p>The key improvement compared to how these techniques are used in AMD&#8217;s demo is that my implementation applies all the algorithms to the instance set in a single rendering pass compared to the several passes needed by the original implementation. This is because the Mountains demo takes advantage of the latest technologies introduced by OpenGL 4.0 and the supporting hardware (in this case the functionality provided by the extension <a title="GL_ARB_transform_feedback3" href="http://www.opengl.org/registry/specs/ARB/transform_feedback3.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback3.txt?referer=');">GL_ARB_transform_feedback3</a>).</p>
<p>By using these techniques the GPU is able to reduce the geometric complexity of the scene from 50 million triangles down to around a few millions, sometimes even under a million. Of course, the actually reduction efficiency is heavily influenced by the view position and direction.</p>
<p>Besides the scene management and visibility determination techniques, the demo also showcases a few simple visual effects:</p>
<div class="wp-caption alignright" style="width: 210px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains3.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains3.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains3-thumb.png" alt="View horizon and sky" width="200" height="150" /></a><p class="wp-caption-text">View horizon and sky</p></div>
<ul>
<li>A simple infinitely far skybox generated using a geometry shader.</li>
<li>Simple diffuse lighting applied to the tree instances.</li>
<li>Global illumination-like effect that simulates the terrain to cast shadows over the trees even though no shadow rendering technique is applied.</li>
<li>Fog effect to smooth out the disappearance of the terrain at the far clip plane.</li>
<li>Simplistic fake depth-of-field effect that makes far away objects look blurry.</li>
</ul>
<p>Maybe I will present also some of these techniques in detail in another article if there is interest for it.</p>
<p>As I mentioned, I used a geometry shader to render the skybox and so I did when rendering full screen quads to apply image space algorithms. I&#8217;ve done this because I always feel kind of stupid when I have to put such a simple geometry like a skybox or a full screen quad into a vertex buffer. In these situations I feel like I would simply use immediate mode to draw that damn little piece of geometry but I want to stick to core OpenGL so I quickly change my mind. As a simple alternative, I rather used geometry shaders to emit these simple geometric objects that are used so often that I even wonder how OpenGL does not have e.g. a glDrawScreenQuad-like command. Of course, the geometry shaders don&#8217;t start by themselves so I used dummy draw commands to make the geometry shader do its job.</p>
<h2>Performance</h2>
<p>Now let&#8217;s see how our GPU based optimizations perform in practice. I&#8217;ve collected results from typical view positions from where a moderate number of trees are visible. The tests were done on a Radeon HD 5770. Other configuration parameters are not really relevant as the demo is clearly GPU bound as only a few state changes and render commands are executed on the CPU. Of course, this is kind of a synthetic demo as you would usually want to balance the workload between the CPU and the GPU but usually you have AI, physics and other things for the CPU so transferring as much work to the GPU as possible usually gives a great benefit.</p>
<div class="wp-caption aligncenter" style="width: 654px"><img class="   " src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-fps.png" alt="Performance comparison of various culling and LOD techniques in frames per second on a Radeon HD5770 (higher is better)" width="644" height="224" /><p class="wp-caption-text">Performance comparison of the demo in frames per second on a Radeon HD5770 (higher is better): no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).</p></div>
<p>As you can see on the figure above, using all the optimizations clearly shows its benefits on the frame rate of the demo, even though the Hi-Z map based occlusion query requires several additional draw passes due to the construction of the Hi-Z map. It is also clearly visible that in a scene like this where there are a lot of occluders, ICR is simply not sufficient on its own. One final note that the application of dynamic LOD has a more significant effect without Hi-Z as occlusion culling removes the largest ratio of the instances.</p>
<div class="wp-caption aligncenter" style="width: 654px"><img src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-mtris.png" alt="Amount of visible geometry after culling in millions of triangles: no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red)." width="644" height="224" /><p class="wp-caption-text">Amount of visible geometry after culling in millions of triangles: no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).</p></div>
<p>Our next chart shows the amount of geometry that is finally drawn after culling in millions of triangles. On this figure we see exactly the inverse of the previous chart and it is not surprising as obviously we have a geometry throughput bottleneck. It also clearly shows how important dynamic LOD is even if we don&#8217;t perform more sophisticated visibility determination algorithms.</p>
<table style="width: 100%;" border="0">
<tbody>
<tr>
<td></td>
<td style="text-align: center;"><strong>No LOD</strong></td>
<td style="text-align: center;"><strong>Dynamic LOD</strong></td>
</tr>
<tr>
<td><strong>No culling</strong></td>
<td style="text-align: center;">17 draw calls</td>
<td style="text-align: center;">19 draw calls</td>
</tr>
<tr>
<td><strong>Instance cloud reduction</strong></td>
<td style="text-align: center;">17 draw calls</td>
<td style="text-align: center;">19 draw calls</td>
</tr>
<tr>
<td><strong>ICR + Hi-Z map based occlusion query</strong></td>
<td style="text-align: center;">27 draw calls</td>
<td style="text-align: center;">29 draw calls</td>
</tr>
</tbody>
</table>
<p>Finally, in the table above we&#8217;ve listed the number of draw calls needed by each technique from the reference point of view. The techniques applied do not have a significant effect on the amount of draw calls: we have a fixed number of draw calls and additionally two draw calls if we use LOD. The only exception is when we use Hi-Z map based occlusion culling as the Hi-Z map is a full mipmap chain and we need ten additional draw calls to generate all the mip-levels.</p>
<h2>Conclusion</h2>
<p>The techniques presented are rather simple to implement and can provide huge performance increases. Nevertheless, they allow the renderer to offload even some of the object-level algorithms from the CPU to the GPU and obviously this is the direction to go in the future.</p>
<p>We&#8217;ve also met mostly our goals set at the beginning. Of course not fully as the occlusion culling performed is rather a coarse culling method and does not eliminate completely all the instances that will not contribute to the final image.</p>
<h2>Future work</h2>
<p>While the implementation almost completely eliminates all need of CPU intervention during the rendering phase, I still had to use a few asynchronous queries to get the amount of visible instances for each geometry LOD, although the latency incurred by the use of query objects is hidden in the demo by rendering the skybox between the initiation of the queries and the retrieving of the results.</p>
<div class="wp-caption alignright" style="width: 210px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains4.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains4.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains4-thumb.png" alt="Deep in the forest" width="200" height="150" /></a><p class="wp-caption-text">Deep in the forest</p></div>
<p>As soon as we get atomic counters into core OpenGL and consequently when we&#8217;ll have drivers supporting it, I will further improve the technique using indirect rendering and atomic counters so even the need for these queries will be eliminated.</p>
<p>Additionally, as mentioned several times, I plan to write detailed articles about the individual techniques I used in the demo. I decided to go in this direction as a thorough description of all the details of the demo would be simply too long in one piece.</p>
<h2>Running the demo</h2>
<p>The demo uses OpenGL 4.0 so a Shader Model 5.0 capable graphics card is a must. Even though most of the used techniques makes it possible to create an implementation running on OpenGL 3.x, this time I wanted to stick to GL 4.0 as I took advantage of the new features of it to even further improve the implementation.</p>
<p>First, don&#8217;t be afraid if after startup the demo will run on very low frame rates. This is because by default all GPU based optimizations are disabled.</p>
<p>You can use the SPACE button to switch between the various culling methods:</p>
<ul>
<li>No culling at all</li>
<li>Instance cloud reduction</li>
<li>ICR with Hi-Z map based occlusion culling</li>
</ul>
<p>Finally, you can turn dynamic LOD on and off using the F3 key.</p>
<p>There are a few other controls present in the demo that you may figure out if you read the code, but I don&#8217;t want to go into the details of them as they will be presented in the upcoming articles where I will present Hi-Z map based occlusion culling and dynamic LOD in detail. So stay tuned: <a title="Follow me on twitter" href="http://www.twitter.com/aqnuep" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.twitter.com/aqnuep?referer=');">follow me on twitter</a> or <a title="RSS Feeds" href="http://rastergrid.com/blog/feed/">subscribe to the RSS feed</a>.</p>
<p>The demo can be downloaded with full source code in the <a title="Downloads" href="http://rastergrid.com/blog/downloads/mountains-demo/">downloads section</a>.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Efficient Gaussian blur with linear sampling</title>
		<link>http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/</link>
		<comments>http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/#comments</comments>
		<pubDate>Tue, 07 Sep 2010 20:48:16 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[bloom]]></category>
		<category><![CDATA[blur]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[depth-of-field]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[postprocessing]]></category>
		<category><![CDATA[SFML]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=299</guid>
		<description><![CDATA[Gaussian blur is an image space effect that is used to create a softly blurred version of the original image. This image then can be used by more sophisticated algorithms to produce effects like bloom, depth-of-field, heat haze or fuzzy glass. In this article I will present how to take advantage of the various properties]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F09%252Fefficient-gaussian-blur-with-linear-sampling%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FcLq0EW%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Efficient%20Gaussian%20blur%20with%20linear%20sampling%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 160px"><br />
<img class=" " title="Gaussian blur" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_thumbnail.png" alt="Gaussian blur" width="150" height="150" /><p class="wp-caption-text">Gaussian blur</p></div>
<p>Gaussian blur is an image space effect that is used to create a softly blurred version of the original image. This image then can be used by more sophisticated algorithms to produce effects like bloom, depth-of-field, heat haze or fuzzy glass. In this article I will present how to take advantage of the various properties of the Gaussian filter to create an efficient implementation as well as a technique that can greatly improve the performance of a naive Gaussian blur filter implementation by taking advantage of bilinear texture filtering to reduce the number of necessary texture lookups. While the article focuses on the Gaussian blur filter, most of the principles presented are valid for most convolution filters used in real-time graphics.</p>
<p><span id="more-299"></span></p>
<p>Gaussian blur is a widely used technique in the domain of computer graphics and many rendering techniques rely on it in order to produce convincing photorealistic effects, no matter if we talk about an offline renderer or a game engine. Since the advent of configurable fragment processing through texture combiners and then using fragment shaders the use of Gaussian blur or some other blur filter is almost a must for every rendering engine. While the basic convolution filter algorithm is a rather expensive one, there are a lot of neat techniques that can drastically reduce the computational cost of it, making it available for real-time rendering even on pretty outdated hardware. This article will be most like a tutorial article that tries to present most of the available optimization techniques. Some of them may be familiar to all of you but maybe the linear sampling will bring you some surprise, but let&#8217;s not go that far but start with the basics.</p>
<h2>Terminology</h2>
<p>In order to precede any possibility of confusion, I&#8217;ll start the article with the introduction of some terms and concepts that I will use in the post.</p>
<p><strong>Convolution filter</strong> &#8211; An algorithm that combines the color value of a group of pixels.</p>
<p><strong>NxN-tap filter &#8211; </strong>A filter that uses a square shaped footprint of pixels with the square&#8217;s side length being N pixels.</p>
<p><strong>N-tap filter</strong> &#8211; A filter that uses an N-pixel footprint. Note that an N-tap filter does *not* necessarily mean that the filter has to sample N texels as we will see that an N-tap filter can be implemented using less than N texel fetches.</p>
<p><strong>Filter kernel</strong> &#8211; A collection of relative coordinates and weights that are used to combine the pixel footprint of the filter.</p>
<p><strong>Discrete sampling</strong> &#8211; Texture sampling method when we fetch the data of exactly one texel (aka GL_NEAREST filtering).</p>
<p><strong>Linear sampling</strong> &#8211; Texture sampling method when we fetch a footprint of 2&#215;2 texels and we apply a bilinear filter to aquire the final color information (aka GL_LINEAR filtering).</p>
<h2>Gaussian filter</h2>
<p>The image space Gaussian filter is an NxN-tap convolution filter that weights the pixels inside of its footprint based on the Gaussian function:</p>
<p style="text-align: center;"><img class=" aligncenter" title="Gaussian function 2D" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_function_2D.png" alt="Gaussian function 2D" width="190" height="41" /></p>
<p>The pixels of the filter footprint are weighted using the values got from the Gaussian function thus providing a blur effect. The spacial representation of the Gaussian filter, sometimes referred to as the &#8220;bell surface&#8221;, demonstrates how much the individual pixels of the footprint contribute to the final pixel color.</p>
<div class="wp-caption aligncenter" style="width: 444px"><img title="Gaussian function graphical representation" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_graph.png" alt="Gaussian function graphical representation" width="434" height="351" /><p class="wp-caption-text">The graphical representation of the 2-dimensional Gaussian function</p></div>
<p>Based on this some of you may already say &#8220;aha, so we simply need to do NxN texture fetches and weight them together and voilà&#8221;. While this is true, it is not that efficient as it looks like. In case of a 1024&#215;1024 image, using a fragment shader that implements a 33&#215;33-tap Gaussian filter based on this approach would need an enormous number of 1024*1024*33*33 ≈ 1.14 billion texture fetches in order to apply the blur filter for the whole image.</p>
<p>In order to get to a more efficient algorithm we have to analyze a bit some of the nice properties of the Gaussian function:</p>
<ul>
<li>The 2-dimensional Gaussian function can be calculated by multiplying two 1-dimensional Gaussian function:</li>
</ul>
<p style="text-align: center;"><img class="aligncenter" title="Gaussian function 1D" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_function_1D.png" alt="Gaussian function 1D" width="190" height="41" /></p>
<ul>
<li>A Gaussian function with a distribution of 2σ is equivalent with the product of two Gaussian functions with a distribution of σ.</li>
</ul>
<p>Both of these properties of the Gaussian function give us room for heavy optimization.</p>
<p>Based on the first property, we can separate our 2-dimensional Gaussian function into two 1-dimensional one. In case of the fragment shader implementation this means that we can separate our Gaussian filter into a horizontal blur filter and the vertical blur filter, still getting the accurate results after the rendering. This results in two N-tap filters and an additional rendering pass needed for the second filter. Getting back to our example, applying the two filters to a 1024&#215;1024 image using two 33-tap Gaussian filters will get us to 1024*1024*33*2 ≈ 69 million texture fetches. That is already more than an order of magnitude less than the original approach made possible.</p>
<p>Using the second property of the Gaussian function, we can separate our 33&#215;33-tap filter into three 9&#215;9-tap filter (9+8=17, 17+16=33). Back to our example, for the 1024&#215;1024 sized image this results in 1024*1024*9*9*3 ≈ 255 million texture fetches. As we can see, we also spared a large amount of the necessary texture fetches using this approach as well.</p>
<p>Of course, the combination of the two techniques is also possible. That means we both separate our filter to a vertical and horizontal filter as well as decompose our 33-tap filter into three 9-tap filter. This will get us to the almost optimal number of 1024*1024*9*3*2 ≈ 56 million texture fetches.</p>
<h2>Gaussian kernel weights</h2>
<p>We&#8217;ve seen how to implement an efficient Gaussian blur filter for our application, at least in theory, but we haven&#8217;t talked about how we should calculate the weights for each pixel we combine using the filter in order to get the proper results. The most straightforward way to determine the kernel weights is by simply calculating the value of the Gaussian function for various distribution and coordinate values. While this is the most generic solution, there is a simpler way to get some weights by using the binomial coefficients. Why we can do that? Because the Gaussian function is actually the distribution function of the normal distribution and the normal distribution&#8217;s discrete equivalent is the binomial distribution which uses the binomial coefficients for weighting its samples.</p>
<div class="wp-caption aligncenter" style="width: 630px"><img title="Binomial coefficients" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/binomial_coeff2.png" alt="Binomial coefficients" width="620" height="300" /><p class="wp-caption-text">The Pascal triangle showcasing the binomial coefficients that can be used to calculate the kernel weights (each element in the succeeding rows is the sum of its &quot;parents&quot;).</p></div>
<p>For implementing our 9-tap horizontal and vertical Gaussian filter we will use the last row of the Pascal triangle illustrated above in order to calculate our weights. One may ask why we don&#8217;t use the row with index 8 as it has 9 coefficients. This is a justifiable question, but it is rather easy to answer it. This is because with a typical 32 bit color buffer the outermost coefficients don&#8217;t have any effect on the final image while the second outermost ones have little to no effect. We would like to minimize the number of texture fetches but provide the highest quality blur as possible with our 9-tap filter. Obviously, in case very high precision results are a must and a higher precision color buffer is available, preferably a floating point one, using the row with index 8 is better. But let&#8217;s stick to our original idea and use the last row&#8230;</p>
<p>By having the necessary coefficients, it is very easy to calculate the weights that will be used to linearly interpolate our pixels. We just have to divide the coefficient by the sum of the coefficients that is 4096 in this case. Of course, for correcting the elimination of the four outermost coefficients, we shall reduce the sum to 4070, otherwise if we apply the filter several times the image may get darker.</p>
<p>Now, as we have our weights it is very straightforward to implement our fragment shaders. Let&#8217;s see how the vertical file shader will look like in GLSL:</p>
<pre class="brush:cpp">uniform sampler2D image;

out vec4 FragmentColor;

uniform float offset[5] = float[]( 0.0, 1.0, 2.0, 3.0, 4.0 );
uniform float weight[5] = float[]( 0.2270270270, 0.1945945946, 0.1216216216,
                                   0.0540540541, 0.0162162162 );

void main(void)
{
    FragmentColor = texture2D( image, vec2(gl_FragCoord)/1024.0 ) * weight[0];
    for (int i=1; i&lt;5; i++) {
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)+vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)-vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
    }
}</pre>
<p>Obviously the horizontal filter is no different just the offset value is applied to the X component rather than to the Y component of the fragment coordinate. Note that we hardcoded here the size of the image as we divide the resulting window space coordinate by 1024. In a real life scenario one may replace that with a uniform or simply use texture rectangles that don&#8217;t use normalized texture coordinates.</p>
<p>If you have to apply the filter several times in order to get a more strong blur effect, the only thing you have to do is ping-pong between two framebuffers and apply the shaders to the result of the previous step.</p>
<div class="wp-caption aligncenter" style="width: 610px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian1.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian1.png?referer=');"><img class=" " title="Gaussian blur effect" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian1_thumbnail.png" alt="Gaussian blur effect" width="600" height="200" /></a><p class="wp-caption-text">9-tap Gaussian blur filter applied to an image of size 1024x1024: no filter applied (left), applied once (middle), applied nine times (right). Click to view the full-sized image in order to better see the difference.</p></div>
<h2>Linear sampling</h2>
<p>So far, we were able to see how to implement a separable Gaussian filter using two rendering pass in order to get a 9-tap Gaussian blur. We&#8217;ve also seen that we can run this filter three times over a 1024&#215;1024 sized image in order to get a 33-tap Gaussian blur by using only 56 million texture fetches. While this is already quite efficient it does not really expose any possibilities of the GPUs as this form of the algorithm would work perfectly almost unmodified on a CPU as well.</p>
<p>Now, we will see that we can take advantage of the fixed function hardware available on the GPU that can even further reduce the number of required texture fetches. In order to get to this optimization let&#8217;s discuss one of the assumptions that we made from the beginning of the article:</p>
<p>So far, we assumed that in order to get information about a single pixel we have to make a texture fetch, that means for 9 pixels we need 9 texture fetches. While this is true in case of a CPU implementation, it is not necessarily true in case of a GPU implementation. This is because in the GPU case we have bilinear texture filtering at our disposal that comes with practically no cost. That means if we don&#8217;t fetch at texel center positions our texture then we can get information about multiple pixels. As we already use the separability property of the Gaussian function we actually working in 1D so for us bilinear filter will provide information about two pixels. The amount of how much each texel contribute to the final color value is based on the coordinate that we use.</p>
<p>By properly adjusting the texture coordinate offsets we can get the accurate information of two texels or pixels using a single texture fetch. That means for implementing a 9-tap horizontal/vertical Gaussian filter we need only 5 texture fetches. In general, for an N-tap filter we need [N/2] texture fetches.</p>
<p>What this will mean for our weight values previously used for the discrete sampled Gaussian filter? It means that each case we use a single texture fetch to get information about two texels we have to weight the color value retrieved by the sum of the weights corresponding to the two texels. Now that we know what are our weights, we just have to calculate the texture coordinate offsets properly.</p>
<p>For texture coordinates, we can simply use the middle coordinate between the two texel centers. While this is a good approximation, we won&#8217;t accept it as we can calculate much better coordinates that will result us exactly the same values as when we used discrete sampling.</p>
<p>In case of such a merge of two texels we have to adjust the coordinates that the distance of the determined coordinate from the texel #1 center should be equal to the weight of texel #2 divided by the sum of the two weights. In the same style, the distance of the determined coordinate from the texel #2 center should be equal to the weight of texel #1 divided by the sum of the two weights.</p>
<p>As a result, we get the following formulas to determine the weights and offsets for our linear sampled Gaussian blur filter:</p>
<p style="text-align: center;"><img class="aligncenter" title="Weight and offset calculation for linear sampling" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/equation.png" alt="Weight and offset calculation for linear sampling" width="597" height="116" /></p>
<p>By using this information we just have to replace our uniform constants and decrease the number of iterations in our vertical filter shader and we get the following:</p>
<pre class="brush:cpp">uniform sampler2D image;

out vec4 FragmentColor;

uniform float offset[3] = float[]( 0.0, 1.3846153846, 3.2307692308 );
uniform float weight[3] = float[]( 0.2270270270, 0.3162162162, 0.0702702703 );

void main(void)
{
    FragmentColor = texture2D( image, vec2(gl_FragCoord)/1024.0 ) * weight[0];
    for (int i=1; i&lt;3; i++) {
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)+vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)-vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
    }
}</pre>
<p>This simplification of the algorithm is mathematically correct and if we don&#8217;t consider possible rounding errors resulting from the hardware implementation of the bilinear filter we should get the exact same result with our linear sampling shader like in case of the discrete sampling one.</p>
<div class="wp-caption aligncenter" style="width: 523px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/side2side.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/side2side.png?referer=');"><img class=" " title="Side-to-side comparison of Gaussian blur with discrete and linear sampling" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/side2side_thumbnail.png" alt="Side-to-side comparison of Gaussian blur with discrete and linear sampling" width="513" height="250" /></a><p class="wp-caption-text">9-tap Gaussian blur applied nine times with discrete sampling (left) and linear sampling (right). Click for the full resolution of the image. Note that there is no visible difference between the two techniques even after several passes.</p></div>
<p>While the implementation of the linear sampling is pretty straightforward, it has a quite visible effect on the performance of the Gaussian blur filter. Taking into consideration that we managed to implement a 9-tap filter using just five texture fetches instead of nine, back to our example, blurring a 1024&#215;1024 image with a 33-tap filter takes only 1024*1024*5*3*2 ≈ 31 million texture fetches instead of the 56 million required by discrete sampling. This is a quite reasonable difference and in order to better present how much that matters I&#8217;ve done some experiment to measure the difference between the two techniques. The result speaks for itself:</p>
<div class="wp-caption aligncenter" style="width: 532px"><img title="Performance comparison of discrete and linear sampling" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/comparison2.png" alt="Performance comparison of discrete and linear sampling" width="522" height="400" /><p class="wp-caption-text">Performance comparison of the 9-tap Gaussian blur filter with discrete and linear sampling on a Radeon HD5770. The vertical axis is the frames per second (higher is better) and the horizontal axis represents results with various number of blur steps (higher is blurrier).</p></div>
<p>As we can see, the performance of the Gaussian filter implemented with linear sampling is about 60% faster than the one implemented with discrete sampling indifferent from the number of blur steps applied to the image. This roughly proportional to the number of texture fetches spared by using linear filtering.</p>
<h2>Conclusion</h2>
<p>We&#8217;ve seen that implementing an efficient Gaussian blur filter is quite straightforward and the result is a very fast real-time algorithm, especially using the linear sampling, that can be used as the basis of more advanced rendering techniques.</p>
<p>Even though we concentrated on Gaussian blur in this article, many of the discussed principles apply to most convolution filter types. Also, most of the theory applies in case we need a blurred image of reduced size like it is usually needed by the bloom effect, even the linear sampling. The only thing that is really different in case of a reduced size blurred image is that our center pixel is also a &#8220;double-pixel&#8221;. This means that we have to use a row from our Pascal triangle that has even number of coefficients as we would like to linear sample the middle texels as well.</p>
<p>We&#8217;ve also had a brief insight into the computational complexity of the various techniques and how the filter can be efficiently implemented on the GPU.</p>
<p>The demo application used for the measurements performed to compare the discrete and linear sampling method can be downloaded here:</p>
<h3>Binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.3 capable graphics driver<br />
<strong>Download link:<span style="font-weight: normal;"> </span><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_win32.zip" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_win32.zip?referer=');"><span style="font-weight: normal;">gaussian_win32.zip (2.96MB)</span></a></strong></p>
<p><a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_win32.zip"></a><strong>Source code</strong></p>
<p><strong>Language:</strong> C++<br />
<strong>Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_src.zip" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_src.zip?referer=');">gaussian_src.zip (5.37KB)</a><br />
<strong> </strong></p>
<p>P.S.: Sorry for the high minimum requirements of the application just I would really like to stick to strict OpenGL 3+ demos.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/feed/</wfw:commentRss>
		<slash:comments>52</slash:comments>
		</item>
		<item>
		<title>Instance Cloud Reduction reloaded</title>
		<link>http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/</link>
		<comments>http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 19:36:38 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[attribute divisor]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[instanced array]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[SFML]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[uniform buffer]]></category>
		<category><![CDATA[vertex buffer]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=251</guid>
		<description><![CDATA[A few months ago I&#8217;ve presented an object culling mechanism that I&#8217;ve named Instance Cloud Reduction (ICR) in the article Instance culling using geometry shaders. The technique targets the first generation of OpenGL 3 capable cards and takes advantage of geometry shaders&#8217; capability to reduce the emitted geometry amount in order to get to a]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F06%252Finstance-cloud-reduction-reloaded%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fc2unzx%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Instance%20Cloud%20Reduction%20reloaded%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 160px"><img src="http://rastergrid.com/blog/wp-content/uploads/2010/02/Nature-2010-02-08-20-20-36-24-150x150.png" alt="" width="150" height="150" /><p class="wp-caption-text">OpenGL 3.3 - Nature</p></div>
<p>A few months ago I&#8217;ve presented an object culling mechanism that I&#8217;ve named Instance Cloud Reduction (ICR) in the article <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">Instance culling using geometry shaders</a>. The technique targets the first generation of OpenGL 3 capable cards and takes advantage of geometry shaders&#8217; capability to reduce the emitted geometry amount in order to get to a fully GPU accelerated algorithm that performs view frustum culling on instanced geometry without the need of OpenCL or any other GPU compute API. After the culling step the reduced set of instance data is fed to the drawing pass in the form of a texture buffers. In this article I will present an improved version of the algorithm that exploits the use of instanced arrays introduced lately in OpenGL 3.3 to further optimize it.</p>
<p><span id="more-251"></span>Lets recap the basics of the algorithm before I present the improved technique. The geometry shaders have a very nice feature that they cannot just emit a modified version of the input geometry but can also alter the number of emitted primitives compared to the number of received ones. This is a both-way ability what means that we cannot just increase but also decrease the number of primitives. That is what the technique takes advantage.</p>
<p>In the first pass we feed a simple vertex shader &#8211; geometry shader pair with the instance data of the geometries as they&#8217;ve been the data of point primitives. The vertex shader then checks whether the actual instance is inside the view frustum or not and sends the result to the geometry shader. If the result is yes then the geometry shader outputs the instance data otherwise discards it. The primitives emitted by the geometry shaders are captured then using transform feedback into a buffer object. Also a query object is needed in order to be able to get the amount of instances that passed the view frustum culling. In the drawing pass we use the result of the query to decide how many instances we have to draw and the captured feedback buffer is used as instance data.</p>
<div class="wp-caption aligncenter" style="width: 660px"><img src="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_combined.png" alt="" width="650" height="347" /><p class="wp-caption-text">Instance Cloud Reduction - Combined view of Pass 1 + Pass 2</p></div>
<p>This is a very brief description of the culling mechanism so for a complete specification please read the <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">original article</a>.</p>
<h3>Motivation</h3>
<p>While Instance Cloud Reduction is a quite robust technique that can severely simplify and speed up the rendering of high amount of instanced geometry its performance is also limited due to some hardware and API restrictions. The most important ones are the following:</p>
<ul>
<li>Needs an extra rendering pass to perform the culling.</li>
<li>Requires the usage of asynchronous queries to determine the number of visible instances.</li>
<li>Uses texture fetching in the vertex shader of the actual drawing pass.</li>
</ul>
<p>The first mentioned drawback means that more draw commands are required that use the output of the first pass as input. This and the second disadvantage may cause stalls due to the fact that the CPU has to wait for the data to be ready before issuing the second pass thus the GPU is not used effectively.</p>
<p>What this improvement tries to solve is the third problem. Texture fetching itself is quite fast in the latest generation of hardware, however it causes some slowdowns anyway due to the latency introduced by texture fetches even though GPUs use some latency hiding techniques.</p>
<p>Instanced arrays provide us a way to replace texture fetching with vertex fetching that is usually done by different hardware element that works synchronously with the execution of vertex shaders. I&#8217;ve expected quite a reasonable speedup by taking advantage of instanced arrays, however we will see that actual results were far from my initial expectations.</p>
<h3>Implementation</h3>
<p>Traditional vertex fetching happens in a way that one element is fetched from each enabled input attribute buffer and the vertex shader is issued with these values. One element in a vertex attribute buffer can mean up to four floating point or integer values and for each execution of the vertex shader one set of these elements is used. There is an internal counter that is increased after each fetch and the next vertex attribute fetch will use this counter as an index into the buffer object.</p>
<p>While this mechanism is satisfactory for the most attributes of a vertex, it is not practical for instance data as such data belongs to an instance rather than a vertex. In order to source instance data from vertex attributes in case of traditional vertex fetching, high amount of redundant storage is required in order to get the same information for all the vertices belonging to a particular instance. This is not just waste of memory but also waste of bandwidth and it also defeats the goal of Instance Cloud Reduction.</p>
<p>Compared to traditional vertex fetching, instanced arrays provide a way to increase the internal counter used as the index into the vertex attribute buffer in a different way, in particular one can set the frequency of increase using a vertex attribute divisor that specifies after how many instances the counter shall be increased. This is a per-attribute property and by setting it to one we end up with exactly what we need: one vertex fetch per instance.</p>
<p>This means that actually we need just a very minor change compared to the original technique, more precisely we replace our texture buffer with a vertex attribute buffer that has a divisor of one and use it as the source of instance data in the vertex shader of the drawing pass.</p>
<h3>Execution results</h3>
<p>As we are not talking about a new technique but just an optimized implementation of the same method, the best way to evaluate it is by comparing the performance of the new version with the original one.</p>
<p>As I&#8217;ve mentioned earlier, I expected a reasonable performance increase by replacing texture fetches with vertex fetches, in practice the difference was not so significant. However, the performance difference between the two implementation can heavily depend on the underlying hardware implementation so various cards from various vendors and GPU generations can show more diverging behavior. In fact even driver versions may have an effect on the results.</p>
<div class="wp-caption aligncenter" style="width: 620px"><img class="  " src="http://rastergrid.com/blog/wp-content/uploads/2010/06/comparison.png" alt="" width="610" height="139" /><p class="wp-caption-text">Performance comparison of the old implementation and the presented one on an AMD Radeon HD5770. Scale is in frames per second (higher value is better).</p></div>
<p>Due to lack of hardware to use for testing, I&#8217;ve checked only with one card, namely a Radeon HD5770 with Catalyst 10.6 drivers. I noticed roughly a 10% speedup as the the new version of the Nature demo showed 100 FPS compared to the 90 FPS observed with the old implementation.</p>
<p>Even though this was not exactly the outcome I&#8217;ve expected from the new implementation, maybe the assumption is still valid for older generation of GPUs or for NVIDIA cards. I suspect so because for Shader Model 4.0 cards the hardware implementation of the texture fetching unit and the vertex fetching unit was most probably more differentiated than that of the latest GPUs. Also my guess is that on NVIDIA cards the difference is maybe higher as the vertex fetching hardware in SM 4.0 GeForce cards is less flexible than that of AMD&#8217;s taking in consideration that the first HD series Radeons already had some form of tessellation functionality that requires more freedom from the vertex pushing hardware.</p>
<p>In order to get a better picture about how effective the presented optimization is, I would like to ask all the visitors of this post to try the two releases and send me feedback about it.</p>
<h3>Conclusion</h3>
<p>We&#8217;ve seen that how easy it was to take advantage of instanced arrays in an existing implementation of the ICR technique and how does it perform on the latest generation of GPUs compared to the previous version. While this small addition provides some benefits, it also comes at a cost and we have to talk about that as well.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Eliminates the need for texture fetching in the vertex shader thus improving performance.</li>
<li>Does not compromise the goal and the implementation architecture of the original method.</li>
<li>Frees up one texture unit that was previously reserved for the texture buffer containing the instance data.</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li>Requires OpenGL 3.3 or the <a title="GL_ARB_instanced_arrays" href="http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/instanced_arrays.txt?referer=');">GL_ARB_instanced_arrays</a> extension in addition to the OpenGL 3.2 features.</li>
<li>We have to possibly sacrifice multiple vertex input attributes to feed the instance data to the shaders.</li>
</ul>
<p>Most of the mentioned benefits and drawbacks are self-explanatory, however I would like to say a few words about the last mentioned one&#8230;</p>
<p>For the purpose of showcase I used a simple translation factor as instance data that means a single vector of floats. In real life situation one may need more complex transformation data that can only be stored in the matrix. While in the demo the feeding of instance data consumed only one vertex attribute slot, in case of a full transformation matrix it would require four of them (not to mention other possible instance attributes). As the maximum number of input attributes is severely limited, usually to 16, the application of the optimization is restricted to situations when all the vertex and instance attributes fit into this limit.</p>
<p>In case of the original implementation, where a texture buffer was used as input, this did not cause any problem as the vertex shader is free to fetch any number of texels from that (still, performance can be a concern in this case). In order to help situations when input attribute slots are at a premium, in real life scenarios it is recommended to use quaternions instead of transformation matrices as they consume two times less attribute resources. Actually this can be a general recommendation as using quaternions decreases the bandwidth requirements of the instance data fetch thus increasing performance even in situations when there are enough input attribute slots available.</p>
<p>In order to ease the performance comparison for you, you can find download links for both versions of the Nature demo.</p>
<h3>Old version binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.2 capable graphics driver<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_win32.zip">nature12_win32.zip (3.58MB)</a><br />
<strong>Comments:</strong> This version does <strong>NOT </strong>include the optimization presented in this article.</p>
<h3>Old version source code</h3>
<p><strong>Language: <span style="font-weight: normal;">C++</span><br />
Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_src.zip">nature12_src.zip (12.6KB)</a><br />
<strong>Comments:</strong> This version does <strong>NOT </strong>include the optimization presented in this article.</p>
<h3>New version binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.3 capable graphics driver<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature20_win32.zip">nature20_win32.zip (3.58MB)</a><br />
<strong>Comments:</strong> This version includes the optimization presented in this article.</p>
<h3>New version source code</h3>
<p><strong>Language:</strong> C++<br />
<strong>Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature20_src.zip">nature20_src.zip (12.8KB)</a><br />
<strong>Comments:</strong> This version includes the optimization presented in this article.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Going mobile with OpenGL ES</title>
		<link>http://rastergrid.com/blog/2010/04/going-mobile-with-opengl-es/</link>
		<comments>http://rastergrid.com/blog/2010/04/going-mobile-with-opengl-es/#comments</comments>
		<pubDate>Sun, 18 Apr 2010 16:34:53 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[Telecommunication]]></category>
		<category><![CDATA[Android]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[mobile technology]]></category>
		<category><![CDATA[Objective-C]]></category>
		<category><![CDATA[OpenAL]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[OpenGL ES]]></category>
		<category><![CDATA[phone]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=230</guid>
		<description><![CDATA[Many things have changed since the first time the public put their hands on the first mobile phone device as these days the end user rarely makes their choices when buying a mobile equipment based on their telephony capabilities. In fact, nowadays these devices are one of the most popular entertainment platforms out there. The]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F04%252Fgoing-mobile-with-opengl-es%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fa5rKKQ%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Going%20mobile%20with%20OpenGL%20ES%22%20%7D);"></div>
<p>Many things have changed since the first time the public put their hands on the first mobile phone device as these days the end user rarely makes their choices when buying a mobile equipment based on their telephony capabilities. In fact, nowadays these devices are one of the most popular entertainment platforms out there. The main problem for application developers is that these platforms tended to be very heterogeneous from point of view of hardware architecture as well as that of API support. Meanwhile things have changed. While the underlying hardware still varies a lot from device to device the work of application developers has been eased by having cross platform mobile operating systems and open standards. In particular OpenGL ES that is an embedded version of the popular graphics API. In this article I would like to talk about some of the big players of the mobile OS industry and about using OpenGL ES for creating impressive mobile applications.</p>
<p><span id="more-230"></span>The first version of the OpenGL ES specification has been released in order to provide a lightweight API for embedded graphics using a well-defined subset of the functionalities provided by the desktop version of OpenGL. While the specification is already out quite for a while, the wide adoption in the industry and the interest of application developers for it became strong only in the recent past. Currently, we have several mobile platforms that are bundled with 3D accelerators and provide a set of features via OpenGL ES that makes developers capable of creating games that weren&#8217;t possible even on desktop platforms about ten years ago.</p>
<h3>Going 3D on mobiles</h3>
<p>Those who know me, know that well that I was always interested in graphics, especially when using it for entertainment purposes. In particular, I was about to develop video games since the first time I&#8217;ve put my hands on a computer. This is no different now as well as now I&#8217;m writing about OpenGL ES and mobile platforms because I got interested in creating games for mobile phones.</p>
<p>As I&#8217;ve already mentioned before, the problem with developing for mobile equipments is the variety of hardware and software platforms that they are built on. As being somebody who is already familiar with desktop OpenGL, having OpenGL ES in the tool-set already eliminates some of the burden that I must face with.</p>
<p>Also when talking about application platform things have also changed a lot. Nowadays, we have just a few big players in the mobile OS industry thus easing the work of the developers. More precisely, if an application developer plans to go mobile and would like to grab the biggest market audience, can limit their efforts on the following platforms:</p>
<ul>
<li><strong>iPhone OS</strong> &#8211; This is the one that drives Apple&#8217;s iPhone mobile devices as well as the iPod Touch. It provides an application platform similar to that Mac developers got used to. It can be said that this platform is the most popular in the industry, especially when dealing with gaming applications.</li>
<li><strong>Android</strong> &#8211; This is the newest player in the field, brought by Google. While it&#8217;s a newbie in the industry it already captured the attention of tons of developers. We can say that currently Android and iPhone are dictating the direction of mobile entertainment.</li>
<li><strong>Symbian OS</strong> &#8211; Symbian has the largest share in most markets worldwide, still not that popular in the mobile gaming industry. It is the operating system running most of today&#8217;s Nokia phones.</li>
<li><strong>Windows Mobile</strong> &#8211; Microsoft&#8217;s product built on Windows CE, the company&#8217;s embedded operating system.</li>
<li><strong>RIM Blackberry OS</strong> &#8211; Operating system primarily designed for the business industry.</li>
</ul>
<p>While most of these mobile operating systems are built on the same design conceptions it is very difficult for the developer to create cross-platform applications for all these platforms as they vary on the language and tool-set support that minimizes the possibilities for code reuse. Unfortunately this is against the one of the most important rule of mobile development as to maximize portability.</p>
<p>It is not 100% true that there is no way to provide optimum portability for all these platforms, but if we choose this direction we are limited to two possibilities: cross-platform Java applications and web-based applications. While these seem to be excellent alternatives to native programming of the platforms, they severely limit the developer in creating applications that fully take advantage of the underlying hardware. This is when OpenGL ES comes into picture as all these platforms have API support thus providing at least some form of code reuse possibility when dealing with entertainment applications.</p>
<p>Now, I would like to continue with talking about the two platforms that I&#8217;m most interested in.</p>
<h3>iPhone OS</h3>
<p>I started to get involved in iPhone game development because one of my friends pushed me to after seeing the great success of his brother-in-law, <a title="zhooley's iPhone applications" href="http://www.zhooley.hu/iphone/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.zhooley.hu/iphone/?referer=');">zhooley</a> who had some great titles. Currently I don&#8217;t have a Mac yet to develop on, but already read some stuff about iPhone development. This is where the following information come from.</p>
<p>iPhone is currently is the most important platform for mobile application developers. It became such an important factor in the industry thanks to Apple&#8217;s AppStore. Previously there was little to no way for the end users to extend their mobile software base so easily. While this is good for the end user, it is maybe even better for application developers as AppStore provides them quite a large market audience.</p>
<p>The secret why iPhone is an excellent gaming platform lies in the palette of features that the phone hardware and the software frameworks provide. Just to mention the most important ones:</p>
<ul>
<li>Touch screen control with support for multi-touch events capturing the movement of up to five fingers.</li>
<li>Three accelerometers for tracking the spacial movement and direction of the device in all axes.</li>
<li>MVC inspired GUI framework for enhanced productivity.</li>
<li>Support for several industry standard APIs like OpenGL ES, OpenAL and much more.</li>
</ul>
<p>But that&#8217;s enough from the general speaking, let&#8217;s see what&#8217;s about OpenGL ES support on the iPhones&#8230;</p>
<p>As far as I can tell, not being an iPhone owner, the graphics hardware bundled with the mobile comes in form of PowerVR accelerators: MBX and SGX.</p>
<p>The PowerVR MBX has OpenGL ES 1.1 support, that is roughly equivalent to OpenGL 1.5, running a tile-based deferred renderer that is suitable for most 3D applications. That means it has only fixed function capabilities, however that is usually enough for most mobile applications. Also note that it has very limited amount of texture memory of 24MB.</p>
<p>The PowerVR SGX is a more powerful processor that also supports OpenGL ES 2.0, roughly equivalent to OpenGL 2.0, but has optimized fixed function shaders that provide flawless backward compatibility for OpenGL ES 1.1 applications.</p>
<p>The most important thing is still that all iPhones are able to do floating point maths natively and efficiently that is an important factor when dealing with OpenGL applications as the usage of the fixed point types can be quite a burden for developers, especially for those migrating from desktop development.</p>
<p>Additionally, the OpenGL ES implementation on iPhone provides some nice extensions like <a title="GL_OES_framebuffer_object" href="http://www.khronos.org/registry/gles/extensions/OES/OES_framebuffer_object.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.khronos.org/registry/gles/extensions/OES/OES_framebuffer_object.txt?referer=');">GL_OES_framebuffer_object</a>, <a title="GL_OES_compressed_paletted_texture" href="http://www.khronos.org/registry/gles/extensions/OES/OES_compressed_paletted_texture.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.khronos.org/registry/gles/extensions/OES/OES_compressed_paletted_texture.txt?referer=');">GL_OES_compressed_paletted_texture</a> and <a title="GL_OES_point_sprite" href="http://www.khronos.org/registry/gles/extensions/OES/OES_point_sprite.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.khronos.org/registry/gles/extensions/OES/OES_point_sprite.txt?referer=');">GL_OES_point_sprite</a>. Also, thanks to the iPhone simulator that comes with the SDK it is easy to test the application during development without an actual device. Still, one important hint to mention is that the iPhone simulator has different OpenGL ES capabilities than the actual hardwares and also the performance characteristics measured on the simulator should not be taken as valid measurements because the simulator does not really simulate the graphics hardware but only the software platform.</p>
<p>iPhone development is done using the Cocoa API and preferably Objective-C, however C, C++ and Objective-C++ can be also used for development. One just has to interface somehow the Cocoa API and the rest can be done almost in any native programming language. That is one of the key advantages of the iPhone platform that one can develop native applications and no need for Java or web-based solutions.</p>
<p>While iPhone may seem to be a perfect choice for mobile game platform, we should not forget about one big disadvantage of it, in particular that one cannot develop legal iPhone applications without owning a Mac.</p>
<h3>Android</h3>
<p>The Android platform was suggested by one of my workmates who just brought a Droid. That phone is actually a device capable to compete with the iPhone from both features and performance point of view.</p>
<p>Android is the big hit of the last year and my forecast is that it will be one of the most relevant platforms of the upcoming years. Google adopted the idea of Apple and they also created an open market for the softwares that the end user can easily download and install on their devices. This is the AndroidMarket that can easily become a powerful competitor of the AppStore.</p>
<p>While, as I said earlier, the Motorola Droid, as an example, does support about the same feature set that makes the iPhone an excellent gaming platform, this cannot be said about most of the phones running Android on them. This is maybe one of the biggest disadvantages of the Android platform. However, we can take this also as an advantage as it makes it possible for more phones to adopt this operating system.</p>
<p>As the Android operating system is running on various phones from different vendors with different hardware capabilities, there isn&#8217;t too much to talk about the graphics hardware capabilities except that some devices not just don&#8217;t have a graphics accelerator but they also lack of floating point support. This is another disadvantage as it forces developers to stick to fixed point math in their OpenGL ES applications to maximize portability or they have to maintain two different rendering paths.</p>
<p>Originally, Android supported only OpenGL ES 1.0 that is roughly equivalent to OpenGL 1.3. However, since NDK r3 there is also OpenGL ES 2.0 support for Android as well. The feature set here varies much more from both hardware point of view and extension support.</p>
<p>Development for Android is done in Java using a proprietary SDK for accessing the Android API. The SDK comes with a simulator that works fine, except the long initial boot time that I was really surprised about when first trying it out.</p>
<p>One advantage of the SDK that it can be used in virtually any operating system so application developers can work on either Windows, Linux, MacOSX or other platform. There is also a nice Eclipse plugin that makes application development for Android even easier. That&#8217;s why I started with this one.</p>
<p>Just to illustrate how easy to put together some working demo with a good SDK, I&#8217;ve created a simple box rotating app to demonstrate OpenGL ES usage on Android. From installation till having a working application it took no more than two hours. You can find the download links for both the source code and the binary release at the end of the article.</p>
<h3>Why mobile games?</h3>
<p>I am a person who was, is and will be interested in developing computer games. Previously, I was working with desktop platforms and at the time when I was 10 years old it was satisfactory to put together some simple 2D game but not now.</p>
<p>I had always planned to create a state-of-the-art game engine and use it for some game, like most people like me do, but the efforts of one is simply unsatisfactory to compete with the players in the industry out there. Even if I feel the capability to be able to write such an engine but it would take that much time that I simply don&#8217;t have since I am working. Even if I would manage to accomplish it in a year or two then the problem with content creation comes into picture. For an AAA PC game content creation takes several times more than the actual programming and here I even lack the knowledge to achieve it. On the other hand mobile game creation is a much shorter process when you can get to actual results in a matter of weeks that is far better compared to PC game creation.</p>
<p>Also, I would never use third party game engines, except some basic libraries like OpenGL, a physics library and things like that because otherwise I wouldn&#8217;t feel the results being my own creation.</p>
<p>Having game development as a hobby works well during high school and university but it gets quite difficult after you are out there in the world having a job and responsibilities. Maybe I should have been already taking my time before to develop something concrete for PC but, as most fellow hobbyist know, you usually end up having hundreds of unfinished projects.</p>
<p>While I would never forget about desktop platforms and I will actively keep myself up with the evolution of the industry, mobile application development opened another world for me where I can unfold myself.</p>
<h3>HelloAndroid Demo</h3>
<p>Source code: <a href="http://rastergrid.com/blog/wp-content/uploads/2010/04/files/helloandroid_src.zip">helloandroid_src.zip</a><br />
Binary release: <a href="http://rastergrid.com/blog/wp-content/uploads/2010/04/files/HelloAndroid.apk">HelloAndroid.apk</a></p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/04/going-mobile-with-opengl-es/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Instance culling using geometry shaders</title>
		<link>http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/</link>
		<comments>http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/#comments</comments>
		<pubDate>Mon, 08 Feb 2010 22:58:53 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[SFML]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[uniform buffer]]></category>
		<category><![CDATA[vertex buffer]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=135</guid>
		<description><![CDATA[Since the appearance of Shader Model 4.0 people wonder how to take advantage of the newly introduced programmable pipeline stage. The most important feature enabled by geometry shaders is that one can change the amount of emitted primitives inside the pipeline. The first thing that a naive developer would try to do with it is]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F02%252Finstance-culling-using-geometry-shaders%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FanKmpg%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Instance%20culling%20using%20geometry%20shaders%22%20%7D);"></div>
<div id="attachment_136" class="wp-caption alignleft" style="width: 160px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/02/Nature-2010-02-08-20-20-36-24.png"><img class="size-thumbnail wp-image-136  " title="Nature demo screenshot" src="http://rastergrid.com/blog/wp-content/uploads/2010/02/Nature-2010-02-08-20-20-36-24-150x150.png" alt="Nature demo screenshot" width="150" height="150" /></a><p class="wp-caption-text">OpenGL 3.2 - Nature</p></div>
<p>Since the appearance of Shader Model 4.0 people wonder how to take advantage of the newly introduced programmable pipeline stage. The most important feature enabled by geometry shaders is that one can change the amount of emitted primitives inside the pipeline. The first thing that a naive developer would try to do with it is geometry tesselation. However, the new shader performs very bad when used for tesselation in a real life scenario even though there are demos show casting this possibility. If we take a closer look at the new feature we observe that the most revolutionary in it is not that it can raise the number of emitted primitives but that it can discard them. This article would like to present a rendering technique that takes advantage of this aspect of geometry shaders to enable the GPU accelerated culling of higher order primitives.</p>
<p><span id="more-135"></span>Geometry shaders can be used for many different advanced rendering techniques that were impossible before the introduction of this flexible programmable shader stage. In this article I would like to present one use case that for me seemed to be one of the most practical application of primitive manipulation possibilities introduced by geometry shaders. As I haven&#8217;t seen any whitepaper talking specifically about this particular technique, even if some of them inherently used it, I would dare name the technique myself as <strong>Instance Cloud Reduction</strong>. I will also present a demo program that shows how to take advantage of the technique in a heavy workload situation.</p>
<p>The idea itself was inspired by AMD&#8217;s  tech demo for the Radeon 4800 series cards called <a title="March of the Froblins" href="http://developer.amd.com/samples/demos/pages/froblins.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/samples/demos/pages/froblins.aspx?referer=');">March of the Froblins</a>. An almost identical technique presented in this article is used in the mentioned demo for the culling of large amount of animated creatures against the view frustum. Also a somewhat similar technique is used in NVIDIA&#8217;s <a title="Skinned Instancing" href="http://developer.download.nvidia.com/SDK/10/direct3d/samples.html" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.download.nvidia.com/SDK/10/direct3d/samples.html?referer=');">Skinned Instancing</a> demo for determining LOD instance sets. Unfortunately, both demos are for DirectX only and, as far as I can tell, there is no OpenGL demo showing any of the aforementioned rendering techniques.</p>
<h3>Motivation</h3>
<p>Nowadays, as the computational capabilities of GPUs is growing in a much faster pace than that of CPUs, graphics developers meet more and more optimization problems related to CPU bound applications. More and more focus is on minimizing the number of driver invocations, actually that&#8217;s what motivated the restructuring of the two most commonly used graphics APIs. As a result we have now DirectX 10+ and OpenGL 3+. However, even if the introduction of geometry instancing, texture arrays and local memory buffer storage for the most important inputs of the rendering, there is still need for wise decisions from graphics programmers to take full advantage of the horsepower coming with the latest GPUs.</p>
<p>Earlier graphics applications strongly relied on CPU based culling techniques, whether it be the usage of the quite outdated BSPs or the more generic and still heavily applied hierarchical culling techniques. We&#8217;ve already reached the point that sometimes even the most efficient CPU based culling techniques seem to be too expensive and usually introduce the small batch problem. Instanced rendering is not an exception.</p>
<p>The applicability of geometry instancing is strongly limited by several factors. One of the most important ones is the culling of instanced geometries. One may choose to cull these objects in the same fashion as others, using the CPU, but that usually breaks the batch and maybe we loose the benefits of geometry instancing. It is more and more imminent to have a GPU based alternative. Without CPU based culling, by sending the whole bunch of instances down the graphics pipeline may choke our vertex processor in case we have high poly geometries and quite large amount of instances of it.</p>
<p>The rendering technique presented in this article will try to achieve this goal. We will use a multi-pass technique that in the first pass culls the object instances against the view frustum using the GPU and in the second pass renders only those instances that are likely to be visible in the final scene. This way we can severely reduce the amount of vertex data sent through the graphics pipeline.</p>
<h3>Implementation</h3>
<p>For some people it might seem that the promise for such a technique is simply too naive and is most probably relying on very exotic OpenGL features, heavy misuse of some basic features or need of data conversions during the frame rendering. Wondrously, this is not the case as we have all we need in OpenGL 3.2 to implement the object culling method sketched above. All we need are the followings:</p>
<ul>
<li>instanced rendering (core since OpenGL 3.1)</li>
<li>geometry shaders (core since OpenGL 3.2)</li>
<li>transform feedback (core since OpenGL 3.0)</li>
<li>uniform or texture buffers (core since OpenGL 3.1)</li>
</ul>
<p>The method itself is a multi-pass rendering technique, however, unlike other multi-pass rendering techniques it does not produce any fragments in the first pass, instead the first pass does the view frustum culling and processes data entirely only inside buffer objects.</p>
<h3>Culling pass</h3>
<p>In the first pass we will feed the graphics pipeline with information about the instances that are needed to perform the view frustum culling. For this we need two inputs for the executed shaders in order to be able to perform the required calculations:</p>
<ol>
<li><strong>Instance transformation data</strong> (whether it be a simple transformation matrix or quaternions or whatever) -- This preferably comes from one or more buffer objects that are bound as vertex buffers to the context.</li>
<li><strong>Object extents information</strong> -- Beside the instance positions we have to know the extents of an instance in order to perform correct culling. This can be either a single float representing the object radius if we choose to use bounding spheres for the culling or a three-dimensional extent vector if we would like to use bounding boxes.</li>
</ol>
<p>Using these as input we can feed in the instance transformation data as attributes of point primitives to our culling shader. The culling shader is composed of a vertex and a geometry shader. In a typical setup the role of each is the following: the vertex shader determines whether the actual object instance&#8217;s bounding volume is inside the view frustum and sends a flag about the culling to the geometry shader, that will emit the instance data to the destination buffer if the flag says that the instance is likely to be visible or does not emit anything if it is determined that the object instance is out of view.</p>
<p>Next, transform feedback is used to capture the primitives emitted by the geometry shader into another buffer object that will be used in the actual rendering pass to source instance transformation data. Beside this, we also need to have an asynchronous query to determine the number of primitives generated to know how many instances of the object do we actually need to render. The following figure shows the workflow of the first pass:</p>
<div id="attachment_146" class="wp-caption aligncenter" style="width: 460px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_pass1.png"><img class="size-full wp-image-146" title="Culling pass" src="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_pass1.png" alt="Culling pass" width="450" height="200" /></a><p class="wp-caption-text">Instance Cloud Reduction - Pass 1: Culling</p></div>
<p>The actual geometry shader implementation needed to perform the actual culling based on the view frustum check performed by the vertex shader should look like the following chunk:</p>
<pre class="brush: c">#version 150 core

layout(points) in;
layout(points, max_vertices = 1) out;

in vec4 OrigPosition[1];
flat in int objectVisible[1];

out vec4 CulledPosition;

void main() {

	/* only emit primitive if the object is visible */
	if ( objectVisible[0] == 1 )
	{
		CulledPosition = OrigPosition[0];
		EmitVertex();
		EndPrimitive();
	}
}</pre>
<p>In this example we used only simply a four-component position vector for the instance transformation data but the technique works well for transformation matrices and quaternions as well.</p>
<p>One more thing is that beside that we set up transform feedback in a way that we feed our buffer object dedicated for the culled instance data and we also started an asynchronous query to be able to determine the number of primitives written into the buffer object, it is also useful to turn of rasterization as we wouldn&#8217;t like to produce any fragments as a result of the first pass.</p>
<h3>Rendering pass</h3>
<p>In the second pass there is nothing special to do. Simply use whatever rendering setup you would like to use. The only things that need to be changed in this step compared to your already existing rendering path is that the instance data for the rendering must be sourced from the generated culled instance data buffer and, as a result, the number of instances passed for the instanced drawing functions shall be changed in order to render only the visible instances. This number can be read from the asynchronous query&#8217;s result that we started in the first pass.</p>
<p>The instance data in the rendering pass can be, of course, sourced from either a uniform or a texture buffer object. This depends on the actual use case and is more clearly explained in the article <a href="http://rastergrid.com/blog/2010/01/uniform-buffers-vs-texture-buffers/">Uniform Buffers VS Texture Buffers</a>.</p>
<p>Important note is that when one has to deal with several instanced geometries it is recommended to do the culling phase prior to rendering any instanced primitives because of the following reasons:</p>
<ul>
<li>The result of the first instance cloud&#8217;s culling is more likely to be finished on the GPU so no sync issues arise from reading the asynchronous query result to determine the number of visible instances.</li>
<li>Probably less state changes are needed as very different setup is required by the two passes.</li>
<li>Results in tidier renderer design as culling is clearly separated from actual rendering.</li>
</ul>
<p>Putting everything together, the application of the presented technique would result in the following workflow on the GPU:</p>
<div id="attachment_150" class="wp-caption aligncenter" style="width: 660px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_combined.png"><img class="size-full wp-image-150" title="Instance Cloud Reduction" src="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_combined.png" alt="Instance Cloud Reduction" width="650" height="347" /></a><p class="wp-caption-text">Instance Cloud Reduction - Combined view of Pass 1 + Pass 2</p></div>
<h3>Conclusion</h3>
<p>We&#8217;ve seen that the presented advanced rendering technique is able to help in situations when we have to deal with large number of instanced geometries and how to take advantage of the latest features of graphics cards and OpenGL to perform view frustum culling calculations on the GPU. This prevents us from having to deal with complicated and expensive CPU based object culling methods that break the drawing batches, especially when dealing with dynamic objects. For ease the decision whether to incorporate this technique in your rendering engine I would like to present the advantages and disadvantages of it.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Heavily reduces the amount of processed data in a naive implementation.</li>
<li>No need for any space partitioning methods in the host application to handle the culling of dynamic objects.</li>
<li>Can handle huge amount of instanced objects due to the enormous horsepower of today&#8217;s GPUs.</li>
<li>Scales well with increased number of instances as the per-instance calculation is relatively low.</li>
<li>Relies strictly on OpenGL 3.2 core features.</li>
<li>No need for OpenCL capable hardware.</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li>Needs an extra rendering pass to perform the culling.</li>
<li>Requires the usage of asynchronous queries to determine the number of visible instances.</li>
</ul>
<p>I hope you agree with me and think about this technique as one more step towards fully GPU based scene management. If you have any remarks or improvement ideas regarding to the rendering technique itself feel free to tell me.</p>
<h3>The Demo</h3>
<p>As I promised, the technique presented above comes with a live demo that actually took most of my time dedicated to writing this blog in the last two weeks. The demo itself is more like a technical show cast rather than a presentation of a real-life use case scenario.</p>
<p>First of all, I used high polygon count models for the rendering to emphasize the amount of time the culling phase spares from the very valuable time of our GPU. In a real world application one would never do something like this. As a result, the demo is more like a benchmark than an interactive application. However, maybe on high-end graphics cards it can perform pretty well.</p>
<p>The demo scene consists of two object types: trees and grass blocks. The tree model is further divided into two parts as they need different textures: the tree trunk and the tree foliage. Obviously, this additional burden can be prevented by using texture arrays to avoid the need of separate draw calls to render the trunk and the foliage.</p>
<p>The tree trunk consists of 33138 triangles, the tree foliage has 16069 triangles and the faking-free grass block consists of 8961 triangles which I had to model myself as didn&#8217;t found any suitable model. Actually this modeling step consumed quite a reasonable amount of my time spent with the demo as I&#8217;m not an expert in this domain.As you can see, these models are not the ones that one might use in an interactive real-time application like games. However, they seemed to be very suitable for the purpose of the demonstration.</p>
<p>What really kicks off the boundaries of GPUs is that the demo renders 10,000 trees and 250,000 grass blocks using instancing. This ends up in more than <strong>2.7 billion triangles</strong> in the scene. This is far more that a GPU can handle without the aid of some scene management and culling. However, we will use no scene management at all and the only culling method that we will use is the one presented in this article.</p>
<p>The actual results are quite promising. The view frustum culling step usually spares more than <strong>99.9%</strong> of the GPU horsepower as the amount of actually rendered triangles after the culling step is far below 2 million triangles. This is still quite much but as we use high polygon count models and we don&#8217;t use any LOD techniques this seems reasonable.</p>
<p>Even if the demo scene statistics doesn&#8217;t seem like a typical use case scenario, the ease of the implementation and the compelling visual results made me pleased anyway:</p>
<p style="text-align: center;"><span class="youtube">
<object width="640" height="480">
<param name="movie" value="http://www.youtube.com/v/srbOFTLTe8k?color1=3a3a3a&amp;color2=999999&amp;border=0&amp;fs=1&amp;hl=en&amp;modestbranding=1&amp;loop=&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;rel=1&amp;hd=1" />
<param name="allowFullScreen" value="true" />
<embed wmode="opaque" src="http://www.youtube.com/v/srbOFTLTe8k?color1=3a3a3a&amp;color2=999999&amp;border=0&amp;fs=1&amp;hl=en&amp;modestbranding=1&amp;loop=&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;rel=1&amp;hd=1" type="application/x-shockwave-flash" allowfullscreen="true" width="640" height="480"></embed>
<param name="wmode" value="opaque" />
</object>
</span><p><a href="http://www.youtube.com/watch?v=srbOFTLTe8k&fmt=18" onclick="pageTracker._trackPageview('/outgoing/www.youtube.com/watch?v=srbOFTLTe8k_fmt=18&amp;referer=');">www.youtube.com/watch?v=srbOFTLTe8k</a></p></p>
<p>On my Radeon HD2600XT I have achieved 6-7 frames per second which is acceptable taking in consideration the huge amount of geometry data still passed to the graphics card. On more recent cards I suppose it should run with good frame rates, however, due to the lack of hardware to test on, these are my only results. If anybody manages to take a better screen capture than mine above then please let me know.</p>
<h3>Implementation details</h3>
<p>Just to tell a few words about what techniques and tricks I&#8217;ve used during the creation of the demo here is a listing of the most important ones:</p>
<ul>
<li>Three models are used as mentioned previously with high instance counts with over 2.7 billion of total triangles in the scene as mentioned already.</li>
<li>Three 512x512 RGBA textures are used for the models that are partially handmade, and again, I&#8217;m not a texture artist so sorry if they don&#8217;t look flawless.</li>
<li>The wavefront model and TGA image loader that accompany the demo are very roughly implemented only for the demo so I would strongly encourage you not to use it to any purpose as it handles only a subset of the possibilities of the file formats.</li>
<li>The vertex data from the wavefront model files is transferred in a very naive way so vertex reuse isn&#8217;t taken into account.</li>
<li>The instance data consists of simple four-component vectors representing the world-space position of the instance. This seemed to be the most simple for the demonstration purposes.</li>
<li>In the second pass, the instance data is sourced from a texture buffer but not really because the visible instance count exceeded the amount that would fit in a uniform buffer. I used texture buffers because for this simple demonstration they seemed to be a little bit more easy to be integrated.</li>
<li>The morphing effect that simulated wind blow is done using hard-coded geometry deformation in the vertex shader. It is not physically correct but visually compelling.</li>
<li>The lighting is a simple directional light using Phong&#8217;s shading and reflection model.</li>
<li>Simple fog is simulated with some awkward formula that I&#8217;ve chosen after a few test runs.</li>
<li>Alpha testing is achieved by using the discard operation in the fragment shader.</li>
</ul>
<h3>Driver issues</h3>
<p>During the development of the demonstration program I&#8217;ve met several driver related problems as I&#8217;ve never used so heavily the latest OpenGL features previously. I&#8217;ve worked with Catalyst 9.12 and 10.1 but both seemed to lack of a proper GLSL compiler. Here are some of the issues I&#8217;ve met:</p>
<ul>
<li>When I&#8217;ve forgot to declare the varyings in the geometry shader as arrays like the standard requires then still the driver hasn&#8217;t complained about any syntax error but when tried to execute the code the program crashed.</li>
<li>Except the texture sampler uniform, all other uniforms failed to work when used in the fragment shader only so I&#8217;ve put them all in the vertex shader.</li>
<li>For loops seemed not to work when used inside the geometry shader, that&#8217;s why the culling itself is done in the vertex shader in the demo.</li>
</ul>
<p>All these problems resulted in nasty tricks to make things working and ended up in awful shader code. Sorry for that. At least now it works on my configuration but pretty unsure whether it will work on other graphics card and driver combos. Please report me any success or failure when trying out the demo. Anyway, be sure to have the latest graphics drivers installed as, at least in case of AMD, OpenGL 3.2 drivers came out only at the fall of 2009.</p>
<p><em><strong>Edit:</strong></em></p>
<p><em>Thanks to the information got from Pierre Boudier from AMD I&#8217;ve updated both the source and binary releases to support the latest drivers properly. The problem was that I didn&#8217;t use attribute location binding as specified in the standard.</em></p>
<p><em>Also have to mention that with my new Radeon HD5770 I managed to achieve over 90 frames per second that actually show that this technique can be in fact used for games and other interactive applications.</em></p>
<p><em>One more thing in the end. As you know this version of the Nature demo uses a texture buffer to source instance positions. I plan to create another version that will take advantage of the instanced arrays introduced in core with OpenGL 3.4. I expect quite a reasonable speedup as that would eliminate the need for texture fetches in the vertex array by rather dedicating a vertex fetcher for the purpose thus increasing the overall performance of the technique.</em></p>
<h3>Binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.2 capable graphics driver<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_win32.zip" target="_blank">nature12_win32.zip (3.58MB)<br />
</a><strong>Comments:</strong> Includes the update that makes it work even with the latest drivers.</p>
<h3>Full source code</h3>
<p><strong>Language:</strong> C++<br />
<strong>Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_src.zip" target="_blank">nature12_src.zip (12.6KB)<br />
</a><strong>Comments:</strong> Sorry for the many dependencies, however, I would recommend the mentioned libraries for everybody who is doing OpenGL development.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/feed/</wfw:commentRss>
		<slash:comments>46</slash:comments>
		</item>
		<item>
		<title>Synchronizable objects for C++</title>
		<link>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/</link>
		<comments>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 19:01:56 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Multiprocessing]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[lock]]></category>
		<category><![CDATA[macro]]></category>
		<category><![CDATA[multithreading]]></category>
		<category><![CDATA[mutex]]></category>
		<category><![CDATA[OOP]]></category>
		<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[synchronization]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=120</guid>
		<description><![CDATA[Previously I talked about how one can easily take advantage of multiprocessing using OpenMP. Even if the C pragmas introduced by the parallel programming API standard is very straightforward for simple programs, it simply doesn&#8217;t fit nicely in a complex C++ application that is built from the ground with the OOP in mind. To smoothly]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F02%252Fsynchronizable-objects-for-c%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FbbpIPT%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Synchronizable%20objects%20for%20C%2B%2B%22%20%7D);"></div>
<p>Previously I talked about how one can easily take advantage of multiprocessing using OpenMP. Even if the C pragmas introduced by the parallel programming API standard is very straightforward for simple programs, it simply doesn&#8217;t fit nicely in a complex C++ application that is built from the ground with the OOP in mind. To smoothly introduce OpenMP into such projects one need higher level constructs that hide the actual implementation details. This is the first article of a series that will try to provide reference implementations of such an abstraction. First, we will start with synchronizable primitives that try to reflect the functionality provided by the &#8220;synchronized&#8221; statement of Java.</p>
<p><span id="more-120"></span>This article is highly inspired by an article written by <a title="A &quot;synchronized&quot; statement for C++ like in Java" href="http://www.codeproject.com/KB/threads/cppsyncstm.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.codeproject.com/KB/threads/cppsyncstm.aspx?referer=');">Achilleas Margaritis</a><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;"> and is mostly equivalent with his thoughts. My article tries to provide a portable reference implementation of a slightly modified version of the trick presented by Margaritis that uses OpenMP as the multiprocessing API back-end.</span></p>
<h2>Motivation</h2>
<p><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;">According to the OO paradigm, classes and consequently objects provide an abstract interface to the underlying internal data or services of the modeled entity or entity class. When it comes to parallel programing we should provide facilities to enable concurrent access to shared resources that are in this case objects. Using plain OpenMP can be satisfactory, however when used extensively the OpenMP pragmas and API function calls introduced can greatly affect the readability and the maintainability of the code. Nevertheless, there can be platforms that use other APIs for handling race conditions. It is obvious that we need to encapsulate these facilities and provide an abstract tool-set instead.</span></p>
<h2>Implementation</h2>
<p><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;">The very first building block of such a framework can be a mutex class that provides mutually exclusive access to certain resources. In the world of OpenMP this should look like something similar to the following:</span></p>
<pre class="brush: cpp">class Mutex {
public:
    Mutex() { omp_init_lock(&amp;_mutex); }
    ~Mutex() { omp_destroy_lock(&amp;_mutex); }
    void lock() { omp_set_lock(&amp;_mutex); }
    void unlock() { omp_unset_lock(&amp;_mutex); }
private:
    omp_lock_t _mutex;
};</pre>
<p>This seems already enough for us to make our Java-like &#8220;synchronized&#8221; statement, however we would like to create a framework that makes usage as easy and safe as possible. In order to get closer to this goal we apply the RAII (Resource Acquisition Is Initialization) design pattern to create our lock class:</p>
<pre class="brush: cpp">class Lock {
public:
    Lock(Mutex&amp; mutex) : _mutex(mutex), _release(false) { _mutex.lock(); }
    ~Lock() { _mutex.unlock(); }
    bool operator() const { return !_release; }
    void release() { _release = true; }
private:
    Mutex&amp; _mutex;
    bool _release;
};</pre>
<p>Our goal is to provide an inheritable interface for such objects that needs synchronization. However, this step has to involve severe considerations regarding to the provided interface as we explicitly need to conform to the following requirements:</p>
<ul>
<li>The interface shall not expose the interface of the underlying synchronization primitive, in our case the mutex class methods.</li>
<li>The interface shall be available only to the synchronizable objects but not for the external world as we would like to not just hide the implementation details of our abstract entity but also prevent the users to synchronize our objects as it should be the responsibility of the object itself.</li>
<li>The interface shall expose methods which are less prone to name collision, for convenience.</li>
</ul>
<p>If we take care of the presented conventions we end up with an interface similar to the following:</p>
<pre class="brush: cpp">class Synchronizable: protected Mutex {
protected:
	void enterSyncBlock() { this-&gt;lock(); }
	void exitSyncBlock() { this-&gt;unlock(); }
};</pre>
<p>Now we are almost at the finish line. We just need to inherit this class in order to have the needed facilities for an object that needs synchronization. However, using this interface directly is not the most comfortable and safe. If we would like to have a Java-like &#8220;synchronized&#8221; statement we have to call for additional help. Fortunately, we have our not so well respected C macro language coming to rescue us as we can use it to make some pseudo-language extensions. The simplest way to define our new statement is using the following line:</p>
<pre class="brush: cpp">#define synchronized(obj)  for(Lock obj##_lock = *obj; obj##_lock; obj##_lock.release())</pre>
<p>From now, we can really use object synchronization in C++ as easy as in Java, we just need the following syntax in the method of our shared objects:</p>
<pre class="brush: cpp">synchronized(this) {
    // some code that needs synchronization
}</pre>
<p>Now it is clearly visible how handy the RAII pattern became in our case. Beside that it is now very straightforward to use this statement it provides additional benefits:</p>
<ul>
<li>It makes the code more readable and as a result it is easier to maintain.</li>
<li>No need to call inconveniently named methods and use lock variables.</li>
<li>The synchronized code has it&#8217;s own scope inside the code.</li>
<li>It is exception-safe as the mutex is unlocked upon destruction.</li>
</ul>
<p>Additionally, we can also take advantage of the inherent problem in C++ regarding to multiple inheritance. If we inherit our object from other two synchronized objects then using a simple type casting we can explicitly specify which ancestor we would like to synchronize in a particular block. Also, to ease this we can define our synchronization statement instead of the Java-like one using the following line:</p>
<pre class="brush: cpp">#define synchronized(cls)  for(Lock obj##_lock = *static_cast&lt;cls*&gt;(this); obj##_lock; obj##_lock.release())</pre>
<p>In this case we pass the class name instead of the object pointer <em>this</em>. Using this later construct we can easily specify the correct ancestor that we would like to synchronize in case when we deal with multiple inheritance situations. Personally I prefer the later syntax as it is much more customized for C++ use cases.</p>
<p>As from now we don&#8217;t need a direct interface for entering and exiting our synchronization block we can simplify our synchronizable interface to the following chunk:</p>
<pre class="brush: cpp">class Synchronizable: protected Mutex {
};</pre>
<p>This is enough from now to provide the facilities needed for a synchronization block but still complies to the requirement that we would like to hide the synchronization primitive related details.</p>
<p>Beside this, Jörg came up with the idea today to replace the for loop in our macro with a single if statement. This seems reasonable as we don&#8217;t have to sacrifice any scoping and safety related benefits of our framework. This simplifies our lock class to the following:</p>
<pre class="brush: cpp">class Lock {
public:
    Lock(Mutex&amp; mutex) : _mutex(mutex) { _mutex.lock(); }
    ~Lock() { _mutex.unlock(); }
    bool operator() const { return true; }
private:
    Mutex&amp; _mutex;
};</pre>
<p>This definition of the lock class is satisfactory if we redefine our synchronized macro to use an if statement instead:</p>
<pre class="brush: cpp">/* Java-like synchronized statement */
#define synchronized(obj)  if (Lock obj##_lock = *obj)
/* alternative synchronized statement to support multiple inheritance */
#define synchronized(cls)  if (Lock obj##_lock = *static_cast&lt;cls*&gt;(this))</pre>
<p>Thanks to the useful comments we even managed to further optimize and minimize the support code needed for our new pseudo-language extension.</p>
<h2>Conclusion</h2>
<p>We have seen an example how one can implement an easy to use synchronizable interface for C++. Also, we&#8217;ve provided a concrete implementation that is based on OpenMP. This library is still far from an API that provides all the necessary constructs that one needs for using parallel programming in their C++ projects, however we made our first step and I will recap on the subject in subsequent articles to further extend this framework.</p>
<p>Credits go to Achilleas Margaritis whose article inspired me to write mine and to Jörg for the useful improvement ideas.</p>
<h3>Full source code</h3>
<p><strong>Language:</strong> C++<br />
<strong> Platform:</strong> cross-platform<br />
<strong> Dependency:</strong> OpenMP<br />
<strong> Download link:</strong> <a title="omp_sync.h" href="/blog/wp-content/uploads/2010/02/files/omp_sync.h" target="_blank">omp_sync.h</a><br />
<strong> Comments:</strong> In order to use it as it is, you will need a C++ compiler supporting OpenMP like GCC 4.2 or Visual C++ 2008.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

