<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>RasterGrid Blog &#187; C++</title>
	<atom:link href="http://rastergrid.com/blog/tag/c/feed/" rel="self" type="application/rss+xml" />
	<link>http://rastergrid.com/blog</link>
	<description>A technical blog from Daniel Rákos (aka aqnuep)</description>
	<lastBuildDate>Tue, 07 Sep 2010 20:49:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Efficient Gaussian blur with linear sampling</title>
		<link>http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/</link>
		<comments>http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/#comments</comments>
		<pubDate>Tue, 07 Sep 2010 20:48:16 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[bloom]]></category>
		<category><![CDATA[blur]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[depth-of-field]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[postprocessing]]></category>
		<category><![CDATA[SFML]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=299</guid>
		<description><![CDATA[

Gaussian blur is an image space effect that is used to create a softly blurred version of the original image. This image then can be used by more sophisticated algorithms to produce effects like bloom, depth-of-field, heat haze or fuzzy glass. In this article I will present how to take advantage of the various properties [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F09%252Fefficient-gaussian-blur-with-linear-sampling%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FcLq0EW%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Efficient%20Gaussian%20blur%20with%20linear%20sampling%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 160px"><br />
<img class=" " title="Gaussian blur" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_thumbnail.png" alt="Gaussian blur" width="150" height="150" /><p class="wp-caption-text">Gaussian blur</p></div>
<p>Gaussian blur is an image space effect that is used to create a softly blurred version of the original image. This image then can be used by more sophisticated algorithms to produce effects like bloom, depth-of-field, heat haze or fuzzy glass. In this article I will present how to take advantage of the various properties of the Gaussian filter to create an efficient implementation as well as a technique that can greatly improve the performance of a naive Gaussian blur filter implementation by taking advantage of bilinear texture filtering to reduce the number of necessary texture lookups. While the article focuses on the Gaussian blur filter, most of the principles presented are valid for most convolution filters used in real-time graphics.</p>
<p><span id="more-299"></span></p>
<p>Gaussian blur is a widely used technique in the domain of computer graphics and many rendering techniques rely on it in order to produce convincing photorealistic effects, no matter if we talk about an offline renderer or a game engine. Since the advent of configurable fragment processing through texture combiners and then using fragment shaders the use of Gaussian blur or some other blur filter is almost a must for every rendering engine. While the basic convolution filter algorithm is a rather expensive one, there are a lot of neat techniques that can drastically reduce the computational cost of it, making it available for real-time rendering even on pretty outdated hardware. This article will be most like a tutorial article that tries to present most of the available optimization techniques. Some of them may be familiar to all of you but maybe the linear sampling will bring you some surprise, but let&#8217;s not go that far but start with the basics.</p>
<h2>Terminology</h2>
<p>In order to precede any possibility of confusion, I&#8217;ll start the article with the introduction of some terms and concepts that I will use in the post.</p>
<p><strong>Convolution filter</strong> &#8211; An algorithm that combines the color value of a group of pixels.</p>
<p><strong>NxN-tap filter &#8211; </strong>A filter that uses a square shaped footprint of pixels with the square&#8217;s side length being N pixels.</p>
<p><strong>N-tap filter</strong> &#8211; A filter that uses an N-pixel footprint. Note that an N-tap filter does *not* necessarily mean that the filter has to sample N texels as we will see that an N-tap filter can be implemented using less than N texel fetches.</p>
<p><strong>Filter kernel</strong> &#8211; A collection of relative coordinates and weights that are used to combine the pixel footprint of the filter.</p>
<p><strong>Discrete sampling</strong> &#8211; Texture sampling method when we fetch the data of exactly one texel (aka GL_NEAREST filtering).</p>
<p><strong>Linear sampling</strong> &#8211; Texture sampling method when we fetch a footprint of 2&#215;2 texels and we apply a bilinear filter to aquire the final color information (aka GL_LINEAR filtering).</p>
<h2>Gaussian filter</h2>
<p>The image space Gaussian filter is an NxN-tap convolution filter that weights the pixels inside of its footprint based on the Gaussian function:</p>
<p style="text-align: center;"><img class=" aligncenter" title="Gaussian function 2D" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_function_2D.png" alt="Gaussian function 2D" width="190" height="41" /></p>
<p>The pixels of the filter footprint are weighted using the values got from the Gaussian function thus providing a blur effect. The spacial representation of the Gaussian filter, sometimes referred to as the &#8220;bell surface&#8221;, demonstrates how much the individual pixels of the footprint contribute to the final pixel color.</p>
<div class="wp-caption aligncenter" style="width: 444px"><img title="Gaussian function graphical representation" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_graph.png" alt="Gaussian function graphical representation" width="434" height="351" /><p class="wp-caption-text">The graphical representation of the 2-dimensional Gaussian function</p></div>
<p>Based on this some of you may already say &#8220;aha, so we simply need to do NxN texture fetches and weight them together and voilà&#8221;. While this is true, it is not that efficient as it looks like. In case of a 1024&#215;1024 image, using a fragment shader that implements a 33&#215;33-tap Gaussian filter based on this approach would need an enormous number of 1024*1024*33*33 ≈ 1.14 billion texture fetches in order to apply the blur filter for the whole image.</p>
<p>In order to get to a more efficient algorithm we have to analyze a bit some of the nice properties of the Gaussian function:</p>
<ul>
<li>The 2-dimensional Gaussian function can be calculated by multiplying two 1-dimensional Gaussian function:</li>
</ul>
<p style="text-align: center;"><img class="aligncenter" title="Gaussian function 1D" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_function_1D.png" alt="Gaussian function 1D" width="190" height="41" /></p>
<ul>
<li>A Gaussian function with a distribution of 2σ is equivalent with the product of two Gaussian functions with a distribution of σ.</li>
</ul>
<p>Both of these properties of the Gaussian function give us room for heavy optimization.</p>
<p>Based on the first property, we can separate our 2-dimensional Gaussian function into two 1-dimensional one. In case of the fragment shader implementation this means that we can separate our Gaussian filter into a horizontal blur filter and the vertical blur filter, still getting the accurate results after the rendering. This results in two N-tap filters and an additional rendering pass needed for the second filter. Getting back to our example, applying the two filters to a 1024&#215;1024 image using two 33-tap Gaussian filters will get us to 1024*1024*33*2 ≈ 69 million texture fetches. That is already more than an order of magnitude less than the original approach made possible.</p>
<p>Using the second property of the Gaussian function, we can separate our 33&#215;33-tap filter into three 9&#215;9-tap filter (9+8=17, 17+16=33). Back to our example, for the 1024&#215;1024 sized image this results in 1024*1024*9*9*3 ≈ 255 million texture fetches. As we can see, we also spared a large amount of the necessary texture fetches using this approach as well.</p>
<p>Of course, the combination of the two techniques is also possible. That means we both separate our filter to a vertical and horizontal filter as well as decompose our 33-tap filter into three 9-tap filter. This will get us to the almost optimal number of 1024*1024*9*3*2 ≈ 56 million texture fetches.</p>
<h2>Gaussian kernel weights</h2>
<p>We&#8217;ve seen how to implement an efficient Gaussian blur filter for our application, at least in theory, but we haven&#8217;t talked about how we should calculate the weights for each pixel we combine using the filter in order to get the proper results. The most straightforward way to determine the kernel weights is by simply calculating the value of the Gaussian function for various distribution and coordinate values. While this is the most generic solution, there is a simpler way to get some weights by using the binomial coefficients. Why we can do that? Because the Gaussian function is actually the distribution function of the normal distribution and the normal distribution&#8217;s discrete equivalent is the binomial distribution which uses the binomial coefficients for weighting its samples.</p>
<div class="wp-caption aligncenter" style="width: 610px"><img class=" " title="Binomial coefficients" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/binomial_coeff.png" alt="Binomial coefficients" width="600" height="250" /><p class="wp-caption-text">The Pascal triangle showcasing the binomial coefficients that can be used to calculate the kernel weights (each element in the succeeding rows is the sum of its &quot;parents&quot;).</p></div>
<p>For implementing our 9-tap horizontal and vertical Gaussian filter we will use the last row of the Pascal triangle illustrated above in order to calculate our weights. One may ask why we don&#8217;t use the row with index 8 as it has 9 coefficients. This is a justifiable question, but it is rather easy to answer it. This is because with a typical 32 bit color buffer the outermost coefficients don&#8217;t have any effect on the final image while the second outermost ones have little to no effect. We would like to minimize the number of texture fetches but provide the highest quality blur as possible with our 9-tap filter. Obviously, in case very high precision results are a must and a higher precision color buffer is available, preferably a floating point one, using the row with index 8 is better. But let&#8217;s stick to our original idea and use the last row&#8230;</p>
<p>By having the necessary coefficients, it is very easy to calculate the weights that will be used to linearly interpolate our pixels. We just have to divide the coefficient by the sum of the coefficients that is 1024 in this case. Of course, for correcting the elimination of the four outermost coefficients, we can reduce the sum to 1002, however this does not have any visible effect either as our color buffer has a limited precision.</p>
<p>Now, as we have our weights it is very straightforward to implement our fragment shaders. Let&#8217;s see how the vertical file shader will look like in GLSL:</p>
<pre class="brush:cpp">uniform sampler2D image;

out vec4 FragmentColor;

uniform float offset[5] = float[]( 0.0, 1.0, 2.0, 3.0, 4.0 );
uniform float weight[5] = float[]( 0.2255859375, 0.193359375, 0.120849609375,
                                   0.0537109375, 0.01611328125 );

void main(void)
{
    FragmentColor = texture2D( image, vec2(gl_FragCoord)/1024.0 ) * weight[0];
    for (int i=1; i&lt;5; i++) {
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)+vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)-vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
    }
}</pre>
<p>Obviously the horizontal filter is no different just the offset value is applied to the X component rather than to the Y component of the fragment coordinate. Note that we hardcoded here the size of the image as we divide the resulting window space coordinate by 1024. In a real life scenario one may replace that with a uniform or simply use texture rectangles that don&#8217;t use normalized texture coordinates.</p>
<p>If you have to apply the filter several times in order to get a more strong blur effect, the only thing you have to do is ping-pong between two framebuffers and apply the shaders to the result of the previous step.</p>
<div class="wp-caption aligncenter" style="width: 610px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian1.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian1.png?referer=');"><img class=" " title="Gaussian blur effect" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian1_thumbnail.png" alt="Gaussian blur effect" width="600" height="200" /></a><p class="wp-caption-text">9-tap Gaussian blur filter applied to an image of size 1024x1024: no filter applied (left), applied once (middle), applied nine times (right). Click to view the full-sized image in order to better see the difference.</p></div>
<h2>Linear sampling</h2>
<p>So far, we were able to see how to implement a separable Gaussian filter using two rendering pass in order to get a 9-tap Gaussian blur. We&#8217;ve also seen that we can run this filter three times over a 1024&#215;1024 sized image in order to get a 33-tap Gaussian blur by using only 56 million texture fetches. While this is already quite efficient it does not really expose any possibilities of the GPUs as this form of the algorithm would work perfectly almost unmodified on a CPU as well.</p>
<p>Now, we will see that we can take advantage of the fixed function hardware available on the GPU that can even further reduce the number of required texture fetches. In order to get to this optimization let&#8217;s discuss one of the assumptions that we made from the beginning of the article:</p>
<p>So far, we assumed that in order to get information about a single pixel we have to make a texture fetch, that means for 9 pixels we need 9 texture fetches. While this is true in case of a CPU implementation, it is not necessarily true in case of a GPU implementation. This is because in the GPU case we have bilinear texture filtering at our disposal that comes with practically no cost. That means if we don&#8217;t fetch at texel center positions our texture then we can get information about multiple pixels. As we already use the separability property of the Gaussian function we actually working in 1D so for us bilinear filter will provide information about two pixels. The amount of how much each texel contribute to the final color value is based on the coordinate that we use.</p>
<p>By properly adjusting the texture coordinate offsets we can get the accurate information of two texels or pixels using a single texture fetch. That means for implementing a 9-tap horizontal/vertical Gaussian filter we need only 5 texture fetches. In general, for an N-tap filter we need [N/2] texture fetches.</p>
<p>What this will mean for our weight values previously used for the discrete sampled Gaussian filter? It means that each case we use a single texture fetch to get information about two texels we have to weight the color value retrieved by the sum of the weights corresponding to the two texels. Now that we know what are our weights, we just have to calculate the texture coordinate offsets properly.</p>
<p>For texture coordinates, we can simply use the middle coordinate between the two texel centers. While this is a good approximation, we won&#8217;t accept it as we can calculate much better coordinates that will result us exactly the same values as when we used discrete sampling.</p>
<p>In case of such a merge of two texels we have to adjust the coordinates that the distance of the determined coordinate from the texel #1 center should be equal to the weight of texel #2 divided by the sum of the two weights. In the same style, the distance of the determined coordinate from the texel #2 center should be equal to the weight of texel #1 divided by the sum of the two weights.</p>
<p>As a result, we get the following formulas to determine the weights and offsets for our linear sampled Gaussian blur filter:</p>
<p style="text-align: center;"><img class="aligncenter" title="Weight and offset calculation for linear sampling" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/equation.png" alt="Weight and offset calculation for linear sampling" width="597" height="116" /></p>
<p>By using this information we just have to replace our uniform constants and decrease the number of iterations in our vertical filter shader and we get the following:</p>
<pre class="brush:cpp">uniform sampler2D image;

out vec4 FragmentColor;

uniform float offset[3] = float[]( 0.0, 1.3846153846, 3.2307692308 );
uniform float weight[3] = float[]( 0.2255859375, 0.314208984375, 0.06982421875 );

void main(void)
{
    FragmentColor = texture2D( image, vec2(gl_FragCoord)/1024.0 ) * weight[0];
    for (int i=1; i&lt;3; i++) {
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)+vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
        FragmentColor +=
            texture2D( image, ( vec2(gl_FragCoord)-vec2(0.0, offset[i]) )/1024.0 )
                * weight[i];
    }
}</pre>
<p>This simplification of the algorithm is mathematically correct and if we don&#8217;t consider possible rounding errors resulting from the hardware implementation of the bilinear filter we should get the exact same result with our linear sampling shader like in case of the discrete sampling one.</p>
<div class="wp-caption aligncenter" style="width: 523px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/side2side.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/side2side.png?referer=');"><img class=" " title="Side-to-side comparison of Gaussian blur with discrete and linear sampling" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/side2side_thumbnail.png" alt="Side-to-side comparison of Gaussian blur with discrete and linear sampling" width="513" height="250" /></a><p class="wp-caption-text">9-tap Gaussian blur applied nine times with discrete sampling (left) and linear sampling (right). Click for the full resolution of the image. Note that there is no visible difference between the two techniques even after several passes.</p></div>
<p>While the implementation of the linear sampling is pretty straightforward, it has a quite visible effect on the performance of the Gaussian blur filter. Taking into consideration that we managed to implement a 9-tap filter using just five texture fetches instead of nine, back to our example, blurring a 1024&#215;1024 image with a 33-tap filter takes only 1024*1024*5*3*2 ≈ 31 million texture fetches instead of the 56 million required by discrete sampling. This is a quite reasonable difference and in order to better present how much that matters I&#8217;ve done some experiment to measure the difference between the two techniques. The result speaks for itself:</p>
<div class="wp-caption aligncenter" style="width: 532px"><img title="Performance comparison of discrete and linear sampling" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/comparison2.png" alt="Performance comparison of discrete and linear sampling" width="522" height="400" /><p class="wp-caption-text">Performance comparison of the 9-tap Gaussian blur filter with discrete and linear sampling on a Radeon HD5770. The vertical axis is the frames per second (higher is better) and the horizontal axis represents results with various number of blur steps (higher is blurrier).</p></div>
<p>As we can see, the performance of the Gaussian filter implemented with linear sampling is about 60% faster than the one implemented with discrete sampling indifferent from the number of blur steps applied to the image. This roughly proportional to the number of texture fetches spared by using linear filtering.</p>
<h2>Conclusion</h2>
<p>We&#8217;ve seen that implementing an efficient Gaussian blur filter is quite straightforward and the result is a very fast real-time algorithm, especially using the linear sampling, that can be used as the basis of more advanced rendering techniques.</p>
<p>Even though we concentrated on Gaussian blur in this article, many of the discussed principles apply to most convolution filter types. Also, most of the theory applies in case we need a blurred image of reduced size like it is usually needed by the bloom effect, even the linear sampling. The only thing that is really different in case of a reduced size blurred image is that our center pixel is also a &#8220;double-pixel&#8221;. This means that we have to use a row from our Pascal triangle that has even number of coefficients as we would like to linear sample the middle texels as well.</p>
<p>We&#8217;ve also had a brief insight into the computational complexity of the various techniques and how the filter can be efficiently implemented on the GPU.</p>
<p>The demo application used for the measurements performed to compare the discrete and linear sampling method can be downloaded here:</p>
<h3>Binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.3 capable graphics driver<br />
<strong>Download link:<span style="font-weight: normal;"> </span><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_win32.zip" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_win32.zip?referer=');"><span style="font-weight: normal;">gaussian_win32.zip (2.96MB)</span></a></strong></p>
<p><a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_win32.zip"></a><strong>Source code</strong></p>
<p><strong>Language:</strong> C++<br />
<strong>Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_src.zip" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/09/gaussian_src.zip?referer=');">gaussian_src.zip (5.41KB)</a><br />
<strong> </strong></p>
<p>P.S.: Sorry for the high minimum requirements of the application just I would really like to stick to strict OpenGL 3+ demos.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Instance Cloud Reduction reloaded</title>
		<link>http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/</link>
		<comments>http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 19:36:38 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[attribute divisor]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[instanced array]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[SFML]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[uniform buffer]]></category>
		<category><![CDATA[vertex buffer]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=251</guid>
		<description><![CDATA[

A few months ago I&#8217;ve presented an object culling mechanism that I&#8217;ve named Instance Cloud Reduction (ICR) in the article Instance culling using geometry shaders. The technique targets the first generation of OpenGL 3 capable cards and takes advantage of geometry shaders&#8217; capability to reduce the emitted geometry amount in order to get to a [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F06%252Finstance-cloud-reduction-reloaded%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fc2unzx%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Instance%20Cloud%20Reduction%20reloaded%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 160px"><img src="http://rastergrid.com/blog/wp-content/uploads/2010/02/Nature-2010-02-08-20-20-36-24-150x150.png" alt="" width="150" height="150" /><p class="wp-caption-text">OpenGL 3.3 - Nature</p></div>
<p>A few months ago I&#8217;ve presented an object culling mechanism that I&#8217;ve named Instance Cloud Reduction (ICR) in the article <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">Instance culling using geometry shaders</a>. The technique targets the first generation of OpenGL 3 capable cards and takes advantage of geometry shaders&#8217; capability to reduce the emitted geometry amount in order to get to a fully GPU accelerated algorithm that performs view frustum culling on instanced geometry without the need of OpenCL or any other GPU compute API. After the culling step the reduced set of instance data is fed to the drawing pass in the form of a texture buffers. In this article I will present an improved version of the algorithm that exploits the use of instanced arrays introduced lately in OpenGL 3.3 to further optimize it.</p>
<p><span id="more-251"></span>Lets recap the basics of the algorithm before I present the improved technique. The geometry shaders have a very nice feature that they cannot just emit a modified version of the input geometry but can also alter the number of emitted primitives compared to the number of received ones. This is a both-way ability what means that we cannot just increase but also decrease the number of primitives. That is what the technique takes advantage.</p>
<p>In the first pass we feed a simple vertex shader &#8211; geometry shader pair with the instance data of the geometries as they&#8217;ve been the data of point primitives. The vertex shader then checks whether the actual instance is inside the view frustum or not and sends the result to the geometry shader. If the result is yes then the geometry shader outputs the instance data otherwise discards it. The primitives emitted by the geometry shaders are captured then using transform feedback into a buffer object. Also a query object is needed in order to be able to get the amount of instances that passed the view frustum culling. In the drawing pass we use the result of the query to decide how many instances we have to draw and the captured feedback buffer is used as instance data.</p>
<div class="wp-caption aligncenter" style="width: 660px"><img src="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_combined.png" alt="" width="650" height="347" /><p class="wp-caption-text">Instance Cloud Reduction - Combined view of Pass 1 + Pass 2</p></div>
<p>This is a very brief description of the culling mechanism so for a complete specification please read the <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">original article</a>.</p>
<h3>Motivation</h3>
<p>While Instance Cloud Reduction is a quite robust technique that can severely simplify and speed up the rendering of high amount of instanced geometry its performance is also limited due to some hardware and API restrictions. The most important ones are the following:</p>
<ul>
<li>Needs an extra rendering pass to perform the culling.</li>
<li>Requires the usage of asynchronous queries to determine the number of visible instances.</li>
<li>Uses texture fetching in the vertex shader of the actual drawing pass.</li>
</ul>
<p>The first mentioned drawback means that more draw commands are required that use the output of the first pass as input. This and the second disadvantage may cause stalls due to the fact that the CPU has to wait for the data to be ready before issuing the second pass thus the GPU is not used effectively.</p>
<p>What this improvement tries to solve is the third problem. Texture fetching itself is quite fast in the latest generation of hardware, however it causes some slowdowns anyway due to the latency introduced by texture fetches even though GPUs use some latency hiding techniques.</p>
<p>Instanced arrays provide us a way to replace texture fetching with vertex fetching that is usually done by different hardware element that works synchronously with the execution of vertex shaders. I&#8217;ve expected quite a reasonable speedup by taking advantage of instanced arrays, however we will see that actual results were far from my initial expectations.</p>
<h3>Implementation</h3>
<p>Traditional vertex fetching happens in a way that one element is fetched from each enabled input attribute buffer and the vertex shader is issued with these values. One element in a vertex attribute buffer can mean up to four floating point or integer values and for each execution of the vertex shader one set of these elements is used. There is an internal counter that is increased after each fetch and the next vertex attribute fetch will use this counter as an index into the buffer object.</p>
<p>While this mechanism is satisfactory for the most attributes of a vertex, it is not practical for instance data as such data belongs to an instance rather than a vertex. In order to source instance data from vertex attributes in case of traditional vertex fetching, high amount of redundant storage is required in order to get the same information for all the vertices belonging to a particular instance. This is not just waste of memory but also waste of bandwidth and it also defeats the goal of Instance Cloud Reduction.</p>
<p>Compared to traditional vertex fetching, instanced arrays provide a way to increase the internal counter used as the index into the vertex attribute buffer in a different way, in particular one can set the frequency of increase using a vertex attribute divisor that specifies after how many instances the counter shall be increased. This is a per-attribute property and by setting it to one we end up with exactly what we need: one vertex fetch per instance.</p>
<p>This means that actually we need just a very minor change compared to the original technique, more precisely we replace our texture buffer with a vertex attribute buffer that has a divisor of one and use it as the source of instance data in the vertex shader of the drawing pass.</p>
<h3>Execution results</h3>
<p>As we are not talking about a new technique but just an optimized implementation of the same method, the best way to evaluate it is by comparing the performance of the new version with the original one.</p>
<p>As I&#8217;ve mentioned earlier, I expected a reasonable performance increase by replacing texture fetches with vertex fetches, in practice the difference was not so significant. However, the performance difference between the two implementation can heavily depend on the underlying hardware implementation so various cards from various vendors and GPU generations can show more diverging behavior. In fact even driver versions may have an effect on the results.</p>
<div class="wp-caption aligncenter" style="width: 620px"><img class="  " src="http://rastergrid.com/blog/wp-content/uploads/2010/06/comparison.png" alt="" width="610" height="139" /><p class="wp-caption-text">Performance comparison of the old implementation and the presented one on an AMD Radeon HD5770. Scale is in frames per second (higher value is better).</p></div>
<p>Due to lack of hardware to use for testing, I&#8217;ve checked only with one card, namely a Radeon HD5770 with Catalyst 10.6 drivers. I noticed roughly a 10% speedup as the the new version of the Nature demo showed 100 FPS compared to the 90 FPS observed with the old implementation.</p>
<p>Even though this was not exactly the outcome I&#8217;ve expected from the new implementation, maybe the assumption is still valid for older generation of GPUs or for NVIDIA cards. I suspect so because for Shader Model 4.0 cards the hardware implementation of the texture fetching unit and the vertex fetching unit was most probably more differentiated than that of the latest GPUs. Also my guess is that on NVIDIA cards the difference is maybe higher as the vertex fetching hardware in SM 4.0 GeForce cards is less flexible than that of AMD&#8217;s taking in consideration that the first HD series Radeons already had some form of tessellation functionality that requires more freedom from the vertex pushing hardware.</p>
<p>In order to get a better picture about how effective the presented optimization is, I would like to ask all the visitors of this post to try the two releases and send me feedback about it.</p>
<h3>Conclusion</h3>
<p>We&#8217;ve seen that how easy it was to take advantage of instanced arrays in an existing implementation of the ICR technique and how does it perform on the latest generation of GPUs compared to the previous version. While this small addition provides some benefits, it also comes at a cost and we have to talk about that as well.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Eliminates the need for texture fetching in the vertex shader thus improving performance.</li>
<li>Does not compromise the goal and the implementation architecture of the original method.</li>
<li>Frees up one texture unit that was previously reserved for the texture buffer containing the instance data.</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li>Requires OpenGL 3.3 or the <a title="GL_ARB_instanced_arrays" href="http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/instanced_arrays.txt?referer=');">GL_ARB_instanced_arrays</a> extension in addition to the OpenGL 3.2 features.</li>
<li>We have to possibly sacrifice multiple vertex input attributes to feed the instance data to the shaders.</li>
</ul>
<p>Most of the mentioned benefits and drawbacks are self-explanatory, however I would like to say a few words about the last mentioned one&#8230;</p>
<p>For the purpose of showcase I used a simple translation factor as instance data that means a single vector of floats. In real life situation one may need more complex transformation data that can only be stored in the matrix. While in the demo the feeding of instance data consumed only one vertex attribute slot, in case of a full transformation matrix it would require four of them (not to mention other possible instance attributes). As the maximum number of input attributes is severely limited, usually to 16, the application of the optimization is restricted to situations when all the vertex and instance attributes fit into this limit.</p>
<p>In case of the original implementation, where a texture buffer was used as input, this did not cause any problem as the vertex shader is free to fetch any number of texels from that (still, performance can be a concern in this case). In order to help situations when input attribute slots are at a premium, in real life scenarios it is recommended to use quaternions instead of transformation matrices as they consume two times less attribute resources. Actually this can be a general recommendation as using quaternions decreases the bandwidth requirements of the instance data fetch thus increasing performance even in situations when there are enough input attribute slots available.</p>
<p>In order to ease the performance comparison for you, you can find download links for both versions of the Nature demo.</p>
<h3>Old version binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.2 capable graphics driver<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_win32.zip">nature12_win32.zip (3.58MB)</a><br />
<strong>Comments:</strong> This version does <strong>NOT </strong>include the optimization presented in this article.</p>
<h3>Old version source code</h3>
<p><strong>Language: <span style="font-weight: normal;">C++</span><br />
Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_src.zip">nature12_src.zip (12.6KB)</a><br />
<strong>Comments:</strong> This version does <strong>NOT </strong>include the optimization presented in this article.</p>
<h3>New version binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.3 capable graphics driver<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature20_win32.zip">nature20_win32.zip (3.58MB)</a><br />
<strong>Comments:</strong> This version includes the optimization presented in this article.</p>
<h3>New version source code</h3>
<p><strong>Language:</strong> C++<br />
<strong>Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature20_src.zip">nature20_src.zip (12.8KB)</a><br />
<strong>Comments:</strong> This version includes the optimization presented in this article.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Flexible static analysis for C++ code bases</title>
		<link>http://rastergrid.com/blog/2010/03/flexible-static-analysis-for-c-code-bases/</link>
		<comments>http://rastergrid.com/blog/2010/03/flexible-static-analysis-for-c-code-bases/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 17:12:37 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[code analysis]]></category>
		<category><![CDATA[CppDepend]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GoogleMock]]></category>
		<category><![CDATA[maintenance]]></category>
		<category><![CDATA[refactoring]]></category>
		<category><![CDATA[SFML]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=190</guid>
		<description><![CDATA[

The importance of static code analysis is already a well known thing in the domain of software development. There are plenty of useful and less useful tools for the purpose, especially in the case of C++. However, even if in general the quality of these softwares is adequate they usually suffer from the inability for [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F03%252Fflexible-static-analysis-for-c-code-bases%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fap60wo%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Flexible%20static%20analysis%20for%20C%2B%2B%20code%20bases%22%20%7D);"></div>
<p>The importance of static code analysis is already a well known thing in the domain of software development. There are plenty of useful and less useful tools for the purpose, especially in the case of C++. However, even if in general the quality of these softwares is adequate they usually suffer from the inability for extending or customizing behavior. Also, a usual problem arises from the fact that the C++ language syntax is overwhelmingly complex and it makes the code parser of any static analysis tool a nightmare. In this article I would like to present a tool called CppDepend that gracefully solves the aforementioned problems primarily focusing on providing an interface that enables 100% adaptability and extensibility for creating customized metrics that are relevant or applicable in a particular domain.</p>
<p><span id="more-190"></span></p>
<h3>Why static code analysis?</h3>
<p>Analysis of computer software, in particular verification and validation, is a very important factor in professional software development. The process behind itself can come in different forms. Generally all kind of verification and validation techniques can be categorized in two major groups: static analysis and dynamic analysis. The key difference between the two is while dynamic analysis verifies the execution of the code, static analysis strictly works on the code base itself.</p>
<p>Well, there are thousands of reasons why using a static code analysis tool makes any benefits to a particular software development process. If you ask various people they will all have their own reasons and rationale behind that. Just to mention my favorites here is a brief excerpt from the long list:</p>
<ul>
<li>Find coding errors before executing a single line of code. This is important as it does not require the project to be built or executed as in many cases these two additional phases can be quite expensive from both time and budget point of view.</li>
<li>Identifies parts of the code that seem to be difficult to maintain or do not conform to various policies of a particular company or organization. This provides us the benefit to move towards a sustainable development by heavily reducing maintenance costs.</li>
<li>Provides us miscellaneous metrics about our code that can have key importance in measuring the quality of the code base.</li>
</ul>
<blockquote><p>If you can&#8217;t measure it, you can&#8217;t improve it &#8211; Lord Kelvin</p></blockquote>
<p>Many people still think that code metrics are overrated. Even if at first sight it seems to be true for micro-projects its importance becomes very obvious when one mets a large code bases specifically talking about situations when legacy code is inherited from earlier software developer generations. When the magnitude of the software goes out of the limits a programmer is capable to keep in mind (this means the 99% of software products) code metrics provide great value to identify &#8220;hot spots&#8221; in the code base, no matter what actual situation we are talking about.</p>
<p>Also, making decision about whether the evolution of the software goes in the right direction is very difficult if not impossible without ways of measuring the quality of the code. The most naive solution for this problem is to measure the amount of bug reports reported over time, however, code metrics provide a much more sophisticated way of measuring the quality by different aspects and on different levels.</p>
<p>During my career, as a software developer, I also faced many situations when the inspection of the legacy code was necessary in order to introduce new functionalities. Unfortunately, in most of the cases, due to the lack of an adequate static code analyst, this required developers to read and manually inspect the code in order to solve the particular problem. I can tell you that it&#8217;s not a joyful duty. Just to mention some of the most critical situations that current developers meet regarding to the topic:</p>
<p><strong>Removing dependencies on deprecated features.</strong> This is a thing that each software development faces from time to time. This time interval is usually relatively low, as we talk about few years which can be called quite often compared to other industries. Just think about situations when one migrates to a new version of a third party library that the whole software depends on. As a recent event, we can talk about the release of version 3 of the OpenGL specification. CAD software developer companies faced a huge challenge by being forced to adopt the new features as the old ones became deprecated and obsolete. Actually they were quite lucky that vendors denied to drop features from their implementations. Using a code analyst one can easily identify the modules that needs to be modified in order to adopt to the latest changes.</p>
<p><strong>Introducing multiprocessing.</strong> This is also a very imminent problem that every software development company will face sooner or later. Code bases inherited from the previous decades were not prepared to handle concurrent execution of the code thus making big headaches to software architects to redesign the code in order to be SMP compliant, especially when dealing with multi-core processors. I&#8217;ve also faced this situation during my career and it was a painful lesson that code analyzing possibilities have a great importance. Before inspecting carefully the whole code base it is very difficult to identify the possible problems that may arise by the introduction of multiprocessing. Automatic inspection of the code can be a very handy tool for minimizing the required efforts.</p>
<h3>What makes up a good static code analysis tool?</h3>
<p>There are many different aspects that affect how good a particular static code analysis tool is. In many situations having competing alternatives for this purpose is at a premium. Fortunately, this is not the case regarding to C++ as being a well supported programming language from the community. However, in order to choose a suitable alternative we have to collect our requirements:</p>
<ul>
<li><strong>Correctness</strong> &#8211; It must correctly analyze the code. This is a very basic requirement against any software development tool. While this seems to be a completely obvious requirement and one expects that tools behave as expected from this point of view, most of such tools for C++ do not conform to this principle. Those who know the C++ language standard know well that writing a good parser for it is almost impossible.</li>
<li><strong>Usefulness</strong> &#8211; There is no sense in using a static code analyst if we don&#8217;t get any benefits from it. The reports generated by the analyst should provide useful information that are directly applicable in a particular use case. One typical example that I also faced quite often is that when one analyses legacy code and gets a report about thousands of problematic code parts. These reports are almost impossible to be handled and it makes headaches to the developers even to answer the very simple question: where to start?</li>
<li><strong>Customizability</strong> &#8211; This requirement directly relates to the previous one. By examining the previous example if there would be some customization possibility to get reports only about the 10 most problematic module it would be much easier to handle it. However, this requirement goes far beyond this. As an example, beside the build-in metrics of the analysis tool, it should provide means to add or modify metrics in order to have more relevant measures about the code fitting a particular domain or use case.</li>
</ul>
<p>We&#8217;ve just mentioned three requirements explicitly and we already heavily reduced the number of alternatives&#8230;</p>
<h3>CppDepend as a flawless alternative</h3>
<p>Recently I&#8217;ve got a request to review a C++ static code analyst tool called CppDepend. After having a brief eye shot on the product I realized that it deserves a thorough inspection as it features a revolutionary technology called CQL that I will talk about a bit later in the article.</p>
<p>CppDepend was developed in partnership with NDepend, it was released six months ago having a two years development history by a very small team of experts. Actually it is accompanied with it&#8217;s brothers NDepend and XDepend that accomplish the same job for .NET and Java projects respectively.</p>
<p>We are talking about a Windows application that has tight integration with Visual Studio projects but also provides ways to be applicable in case of projects built with other development tool-set. Beside it is a command-line static code analysis tool for the C++ language, it provides a powerful GUI tool for visual inspection of different aspects of the code base thus enabling increased productivity and ease of use.</p>
<p>Lets have our first sight on the tool by using the visual interface to analyse a sample code base that will be in our case the source code of <a title="Simple and Fast Multimedia Library" href="http://www.sfml-dev.org/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.sfml-dev.org/?referer=');">SFML</a>.</p>
<p>Setting up the basic configuration for an analysis project is very straightforward. Beside that, the code analysis itself is surprisingly fast. While testing, the longest time it took was in case when I parsed the code of the <a title="Bullet Physics Library" href="http://bulletphysics.org/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/bulletphysics.org/?referer=');">Bullet Physics Library</a> but even that didn&#8217;t required a minute on my system.</p>
<div id="attachment_194" class="wp-caption aligncenter" style="width: 624px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/03/cppdepend.png"><img class="size-large wp-image-194 " title="CppDepend graphical user interface" src="http://rastergrid.com/blog/wp-content/uploads/2010/03/cppdepend-1024x789.png" alt="CppDepend graphical user interface" width="614" height="473" /></a><p class="wp-caption-text">CppDepend graphical user interface</p></div>
<p>The visual controls themselves sometimes lack of good responsiveness due to the complex structures and relationships presented by them but we soon forgive CppDepend this minor issue when we take a closer look at the navigation possibilities offered by the tool.</p>
<p>At first sight, the user interface seems to be a bit overcomplicated but we soon realize that each and every element of it is made by purpose in order to provide as much freedom in navigation as possible. Just to mention the most interesting ones here&#8217;s the explanation of the purpose of the graphical figures at the top right part of the GUI:</p>
<ul>
<li>At top left we see a graphical representation of the currently selected code metric. It shows the magnitude of the result of the metric according to the selected level of granularity. We can easily visualize here as an example how the size of different classes of our project compare to each other.</li>
<li>At middle left is the dependency matrix of our solution. We can easily find &#8220;hot spots&#8221; in our code regarding to coupling, by default, on project level. The granularity of the table can be easily changed in a non-proportional way from project level down to method level. I used the word &#8220;non-proportional&#8221; by intension as we can examine dependency even between a method and a foreign project thus providing additional flexibility over how fine grained we would like to have our numbers.</li>
<li>My favorite is in the middle, called dependency graph. It can present the dependencies between different software elements from project level down to method level, as usual, by means of a graph that is very convenient for human inspection.</li>
</ul>
<p>The whole user interface is designed in a way that each time we point on a particular element it shows convenient information about that particular element and its environment, no matter if we talk about the metrics view, the dependency graph or matrix.</p>
<p>Beside the tools for navigation and easy visualization, the GUI provides a collection of built-in reports about different aspects of the code. One of the first thing everybody would try out from these is the query called &#8220;Quick summary of methods to refactor&#8221;. This is exactly the answer what the developer would like to have for the question &#8220;where to start?&#8221; that I mentioned earlier.</p>
<p>To emphasize even more the fact that how convenient is the user interface, when one selects a particular query it will immediately show the results by means of a list of classes, methods or whatever, but beside this, the code elements in question are immediately highlighted in the relevant graphical views as well.</p>
<p>Maybe I already convinced most of you that CppDepend is a tool that deserves attention as being a valuable tool in good hands but I haven&#8217;t even talked about the most interesting feature that really makes it a uniquely powerful software.</p>
<h3>The power of extensibility</h3>
<p>I have often brought to relief the importance of extensibility and customizability of a static code analyst. This, in fact, is not just my craze but it is an important factor in the decision of most software developers out there. Being able to get some common metrics about the code is one thing, having the possibility to define own metrics and analysis criterias is another&#8230;</p>
<p>The power of CppDepend is behind a revolutionary technology that provides us an interface to retrieve information about the code that is relevant for us as easy as querying a relational database. The apparatus in our hand to achieve this is the <a title="Code Query Language 1.8 Specification" href="http://www.cppdepend.com/CQL.htm" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.cppdepend.com/CQL.htm?referer=');">Code Query Language (CQL)</a>. CppDepend actually builds some internal database structure from the source code and provides us an SQL-like language to make queries that fetches reports from this internal database. Those who are already familiar with SQL will adore this feature. Just to illustrate how easy it is to use CQL in order to build custom queries, let&#8217;s query the classes that have more than 20 methods is as simple as the following line of CQL code:</p>
<pre class="brush: sql">SELECT TYPES WHERE NbMethods &gt; 20</pre>
<p>Simple, isn&#8217;t it? For further details, please refer to the specification of the Code Query Language: <a href="http://www.cppdepend.com/CQL.htm" onclick="pageTracker._trackPageview('/outgoing/www.cppdepend.com/CQL.htm?referer=');">http://www.cppdepend.com/CQL.htm</a></p>
<p>This means that the software developers have complete freedom over how they define the metrics that indicate whether the code quality reaches the levels required by company policies or individual needs. It is also useful to solve the problems arising from the sample situations I&#8217;ve mentioned earlier, namely the problem with dependency on deprecated features and the introduction of multiprocessing, by easily and clearly identifying the modules that need to be changed even in situations when the code base is extremely huge and traditional ways for identifying affected modules are not applicable or simply not feasible.</p>
<h3>Endurance test</h3>
<p>Well, I&#8217;ve already talked enough about the abilities of CppDepend regarding to usefulness and customizability, however, I&#8217;ve barely touched the topic of correctness. As I&#8217;ve already mentioned, parsing C++ code correctly is not as easy as it may look like. For this purpose I&#8217;ve prepared a bunch of template heavy libraries like <a title="OpenGL Mathematics" href="http://glm.g-truc.net/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/glm.g-truc.net/?referer=');">GLM</a> and <a title="GoogleMock" href="http://code.google.com/p/googlemock/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/googlemock/?referer=');">GoogleMock</a> to check how well CppDepend handles code bases when it comes to awkward features of the C++ language.</p>
<p>Even though generally static analyst tools does not provide too much useful information about such project, due to their special nature, it still looked convenient to try to make parsed these libraries by CppDepend in order to have a picture about how it would handle huge projects that also take advantage of the templating mechanisms of C++. I have to say that the results are very promising as it had problems only with GoogleMock but the developers were already informed about the problem I&#8217;ve encountered.</p>
<h3>The dark side of the story</h3>
<p>While CppDepend is an excellent tool for software developers working under Windows, especially if they use Visual Studio, I would like to see a cross-platform version of CppDepend in the future, at least for Linux and MacOSX.</p>
<p>Also, CppDepend does not come for free but at a reasonable price. Even though most probably individuals and hobbyists would not consider buying it, for enterprises, even for small ones, the price of the tool will most probably pay back soon by heavily decreasing short- and long-run maintenance costs of the development.</p>
<h3>Conclusion</h3>
<p>A clever static code analyst tool is nowadays a must for every software development company that deals with code whose size have already ran over a certain threshold but it is also good to use one from the very beginning of a new project. Selecting a particular tool for this purpose is the choice of the enterprise, still, the requirements against such a software are usually the same.</p>
<p>CppDepend proved to me of being a valuable software in the tool-chain of every C++ programmer using Windows as primary development platform. If you are still not convinced then check out the <a title="CppDepend - Features" href="http://www.cppdepend.com/Features.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.cppdepend.com/Features.aspx?referer=');">full feature list</a> on the official site.</p>
<p>Even if you are not interested in using CppDepend or in static analysis tools at all, you should still take a look at CQL and the great idea behind it as it is a perfect example how a solution for a well discussed problem can ascend to new levels by adopting good practices from other domains, in this case from relational databases and related technologies.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/03/flexible-static-analysis-for-c-code-bases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unit testing OpenGL applications</title>
		<link>http://rastergrid.com/blog/2010/02/unit-testing-opengl-applications/</link>
		<comments>http://rastergrid.com/blog/2010/02/unit-testing-opengl-applications/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 19:54:15 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GoogleMock]]></category>
		<category><![CDATA[macro]]></category>
		<category><![CDATA[mocks]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[TDD]]></category>
		<category><![CDATA[unit test]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=182</guid>
		<description><![CDATA[

Nowadays comprehensive testing is a must for any software product. However, it isn&#8217;t such a general rule when it comes to graphics applications. Many developers face difficulties when they have to test their rendering codes. Manual tests and visual feedback is sometimes satisfactory but if one would like to have automated regression tests usual approaches [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F02%252Funit-testing-opengl-applications%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2F9vHcy8%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Unit%20testing%20OpenGL%20applications%22%20%7D);"></div>
<p>Nowadays comprehensive testing is a must for any software product. However, it isn&#8217;t such a general rule when it comes to graphics applications. Many developers face difficulties when they have to test their rendering codes. Manual tests and visual feedback is sometimes satisfactory but if one would like to have automated regression tests usual approaches seem to fail. Even if at first sight unit testing of rendering code doesn&#8217;t look really straightforward, in fact it is. OpenGL is not an exception from this rule as well. In this article I would like to briefly present a few methods how to unit test OpenGL rendering code and also present my choice and the reasons behind the decision.</p>
<p><span id="more-182"></span>There are several ways how to create automated test cases for rendering code. To present the different approaches we first have to select a small portion of our rendering code to demonstrate the differences of each technique, mentioning the strengths and weaknesses of them.</p>
<p>Before going any further, we have to lay down our requirements against a good OpenGL unit testing environment:</p>
<ul>
<li><strong>Verifies results</strong> &#8211; This is the most basic requirement for any testing framework. We have to have the ability to check whether the rendering code executed by the module is valid and works as expected.</li>
<li><strong>Productive</strong> &#8211; The usage and maintenance of the framework shall require minimal effort. Many times unit testing is attacked because it requires additional code writing. While this is generally true, a nice unit testing environment can be kept very simple yet flexible. An OpenGL testing environment shouldn&#8217;t be different.</li>
<li><strong>Fast</strong> &#8211; This is a general requirement for any unit testing environment especially when combined with a continuous integration framework. We want our test results as fast as possible as long feedback cycles severely slow down the development process.</li>
<li><strong>Standalone</strong> &#8211; Does not require complex setup or environmental support in order to be executed. This is a general requirement when we deal with unit testing as if the code is tightly coupled by any of the surroundings then both development and maintenance costs increase.</li>
<li><strong>Compatible</strong> &#8211; Does not require any special hardware so it can be tested on a machine that wouldn&#8217;t necessarily be suitable for manually testing the actual product. This is especially important when the target hardware is some type of embedded platform. It is also important to ensure that it will work on hardware provided by different vendors. In one word, it should comply to the standard, not to driver implementations.</li>
<li><strong>Cross-platform</strong> &#8211; Does not rely on the services of a particular operating system or platform, instead it can be executed on any machine as usually all unit tests. Of course, this restriction can be relaxed depending on actual use case scenarios.</li>
</ul>
<p>Now that we know what we would like to achieve, we can continue with a sample use case. Lets say we would like to create an OpenGL 3.2 based rendering engine. One of the first things that we would write is a class (or set of classes) that will help us handling OpenGL buffer objects as it seems to be one of the main building blocks of such a system. As a very basic example, our first version of the buffer handling class will act simply as a wrapper for buffer objects having the following interface:</p>
<pre class="brush: cpp">class Buffer {
public:
    Buffer();
    virtual ~Buffer();
};</pre>
<p>As it can be seen for now we just require that our class to handle the creation and deletion of a buffer object. Obviously, our test has to check that the constructor successfully creates a buffer by calling <em>glGenBuffers</em> and the destructor deletes that by calling <em>glDeleteBuffers</em> with proper arguments. Now lets see what possibilities we have to test OpenGL rendering code and whether it conforms to our requirements and is able to test our simple module.</p>
<h3>Checking rendered image</h3>
<p>The most naive solution for creating automated tests for rendering code is to actually execute the OpenGL commands and check whether the rendering happened as expected. This can be done by comparing reference rendering results to the actual ones. This approach has the benefit that we actually verify the concrete behavior but lets see how it looks like when we check against our previously laid down requirements:</p>
<ul>
<li><strong>Verifies results</strong> &#8211; Partially fulfilled. We check against the correct behavior, however, the ability to reproduce the actual same image is often difficult if not impossible due to different relaxations regarding to precision in both the standard and driver implementations. In order to have reproducible results the testing environment shall also provide some mechanisms to allow slight differences.</li>
<li><strong>Productive</strong> &#8211; Not met. It can be quite expensive to create an assertion system. Also, the production of reference data can be quite time consuming.</li>
<li><strong>Fast</strong> &#8211; Not met. Even if the checkers are highly optimized components of the framework, it wouldn&#8217;t fit into the time-frame of unit test cycles to execute possibly thousands of test cases that require complete verification of the produced image.</li>
<li><strong>Standalone</strong> &#8211; Not met. We have to setup a complete rendering environment in order to test even the simplest rendering code. Also, it relies on the assumption that the rendering code actually produces some image. As we can see in our buffer handling example, this is not always the case.</li>
<li><strong>Compatible</strong> &#8211; Not met. We need a testing machine that has the hardware capabilities to execute the rendering code and produce the required image.</li>
<li><strong>Cross-platform</strong> &#8211; Partially fulfilled. If our rendering code is cross-platform then it is possible to test it on any of the supported platforms. However, this makes the assertion system even more complicated as it also has to support the target platforms. Also, driver implementations may vary even further when dealing with different operating systems.</li>
</ul>
<p>As we can see, even if this version is quite natural way of thinking for anybody it&#8217;s simply impractical and not feasible for actual use. To be able to find a good solution we must look deeper into what unit testing exactly is as the presented solution has nothing to do with it. In order to be able to do real unit testing we have to eliminate the dependency on OpenGL driver implementations and strictly concentrating on the module under test.</p>
<h3>Fake OpenGL driver</h3>
<p>The second presented solution is to create a layer between the code under testing and the actual OpenGL driver implementation. This can be easily achieved by creating a fake driver, as an example a dynamic library called <em>opengl32.dll</em> in case of Windows. This additional layer would do nothing else than just recording and checking whether the required API calls happened as expected. Providing an interface towards the testing environment that can be used to request the informations needed to make a verdict about the successfulness of the test case.</p>
<p>Beside that this version accommodates much more to the idea behind unit testing it also has the benefit that it is acting as a totally independent layer and does not directly disturb the development of the actual code. Still, if we go back to our checklist we have some issues that raise some concerns regarding to the applicability of this approach:</p>
<ul>
<li><strong>Verifies results</strong> &#8211; Partially fulfilled. It is up to the implementation of the new layer whether it provides the required facilities to properly check the behavior of our tested code. Nevertheless, it also highly depends on the implementation on how we define correct behavior and the responsibility of the library.</li>
<li><strong>Productive</strong> &#8211; Partially fulfilled. Now we have a separate module that helps us in the testing. This may introduce some additional maintenance work but, of course, this depends on how intelligently is the library actually implemented.</li>
<li><strong>Fast</strong> &#8211; Mostly resolved. We do not have expensive assertions, however, as we have a quite restricted interface between our testing environment and the new layer we most probably met situations when we have to make trade-offs between speed and flexibility.</li>
<li><strong>Standalone</strong> &#8211; Resolved. We have a totally independent module that is responsible to simulate the surrounding environment of the code under testing as it should be when doing unit test. However, the question arises whether we would like this layer to be that separated from the testing code.</li>
<li><strong>Compatible</strong> &#8211; Resolved. There is no dependency on dedicated graphics hardware or any other piece of metal. In case of a robust driver simulation layer we can test our code on whatever platform we prefer.</li>
<li><strong>Cross-platform</strong> &#8211; Resolved. As previously mentioned, if the additional layer is well designed, there should be no problems regarding to this issue.</li>
</ul>
<p>Now we have a resolution that can be seriously taken into consideration as a good way to test rendering code. It can also be simply applied to test our buffer handling code as well. Also, as it is a totally standalone software element it is also very portable so it is easy to reuse between projects written in different programming languages and for different platforms.</p>
<p>Still, there is one thing that may need further investigation. Most probably for the other portions of our production code we already use some kind of mocking mechanisms for our unit testing. Having an additional interface type to handle the OpenGL related mocking (as the presented fake driver approach is nothing more than a mock library for OpenGL) may reduce the productivity of our developers. Also, it can make the testing code less uniform so introducing a slight maintenance penalty. At least for comparison, we should try to integrate the OpenGL mocking into our existing mocking facilities.</p>
<h3>API mocks</h3>
<p>All the people who seriously do unit testing use some mocking techniques to eliminate dependency on any external software element like databases, network or another code element. Why should the OpenGL API be different?</p>
<p>As I already written about that I use <a title="GoogleMock" href="http://code.google.com/p/googlemock/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/googlemock/?referer=');">GoogleMock</a> to test my C++ code. Lets see how this mocking framework is capable for removing OpenGL related dependencies. By default, GoogleMock does support only class mocks, however it is fairly straightforward to mock out OpenGL API functions as well. As an example, our buffer handling class needs at least a mock for the <em>glGenBuffers</em> and <em>glDeleteBuffers</em> API functions. These mocks can be very easily created using GoogleMock as part of a class in the following way:</p>
<pre class="brush: cpp">class CGLMock {
public:
    MOCK_METHOD2( GenBuffers, void(GLsizei n, GLuint* buffers) );
    MOCK_METHOD2( DeleteBuffers, void(GLsizei n, GLuint* buffers) );
};
CGLMock GLMock;</pre>
<p>This, however is not enough to replace the already existing real API function pointers with the fake ones. I did this with a nasty little trick by taking advantage of the C preprocessor:</p>
<pre class="brush: cpp">#undef glGenBuffers
#define glGenBuffers                  GLMock.GenBuffers
#undef glDeleteBuffers
#define glDeleteBuffers               GLMock.DeleteBuffers</pre>
<p>The <em>#undef</em> is needed because I use <a title="GLEW" href="http://glew.sourceforge.net/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/glew.sourceforge.net/?referer=');">GLEW</a> for accessing OpenGL API functions and it uses macros for the API function names as well.</p>
<p>All these are put into a file that can be called like <em>glmock.h</em>. In order to force the production code to use these definitions when trying to access the API inside a test case we have to create a wrapper header called something like <em>opengl.h</em> that will include original headers in case of normal build and include the mock library in case of unit test build. This is kind of a workaround but it works quite well in practice.</p>
<p>In theory, this trick can be applied in case of any mocking framework. As a result, from now we can write a very simple test case to check the creation and deletion of our buffer object as easily as the following few lines of code:</p>
<pre class="brush: cpp">TEST(BufferTest, CreationAndDestruction) {
    EXPECT_CALL(GLMock, GenBuffers(1,_))
        .WillOnce(SetArgumentPointee&lt;1&gt;(13));
    Buffer* buffer = new Buffer;
    EXPECT_CALL(GLMock, DeleteBuffers(1,Pointee(13)));
    delete buffer;
}</pre>
<p>I would not like to go into the details related to the interface of GoogleMock. In one word, the test case above checks whether the constructor calls <em>glGenBuffers</em> with a number of 1 for the requested number of buffer objects and returns a buffer ID in the pointer argument, and at the end it checks if <em>glDeleteBuffers</em> was called with the buffer ID value got at creation.</p>
<p>It is maybe a matter of taste whether the second or this third solution is more attractive for you. My choice was this last solution because I didn&#8217;t want to develop an separate library and also was afraid of messing up my test code with different syntactical representations of mocks. Finally, lets sum up the achievements of this last version:</p>
<ul>
<li><strong>Verifies results</strong> &#8211; Fulfilled. An existing mocking framework is used for emulating the OpenGL API thus we have all the facilities required for the proper checking of the API calls.</li>
<li><strong>Productive</strong> &#8211; Fulfilled. Again, we don&#8217;t have to deal with writing an own mocking mechanisms as we have everything out of the box. We can also incrementally extend our mock library on-the-fly while editing the test cases and the production code.</li>
<li><strong>Fast</strong> &#8211; Resolved. Our rendering related unit test cases should be as fast as any other test codes as they are indifferent, just the purposes are dissimilar.</li>
<li><strong>Standalone</strong> &#8211; Mostly resolved. The mocking library is independent, however, as we&#8217;ve seen, the introduction may require some nasty tricks in order to inject foreign code into the production code.</li>
<li><strong>Compatible</strong> &#8211; Resolved. From this point of view, this approach behaves the same as the previous version.</li>
<li><strong>Cross-platform</strong> &#8211; Resolved. Again, the same like in the previous case, maybe even a bit easier to make it portable.</li>
</ul>
<h3>Conclusion</h3>
<p>We&#8217;ve seen a few ways how we can extend our testing environment in order to support the verification of rendering code. We&#8217;ve also seen that the range varies from techniques that provide high level methods suitable especially for functional testing, until very low level methods that tightly integrate in the mocking methodology of unit testing. These, of course, do not replace traditional testing methods rather they extend it in order to find problems in the early phases of software development.</p>
<p>I also tried to present a very basic example of production code that needs such a facility in order to be tested, as well as a sample test case written using GoogleMocks applying the last presented technique.</p>
<p>While writing this article I got the idea that it would be nice to have a complete and general framework for OpenGL testing. If there is interest for it, maybe I&#8217;ll allocate some time to write one. I&#8217;m also interested which approach is the most attractive for you, especially if you have some concrete experience with any of these or with some other technique.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/02/unit-testing-opengl-applications/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
		</item>
		<item>
		<title>One more degree of freedom for C++</title>
		<link>http://rastergrid.com/blog/2010/02/one-more-degree-of-freedom-for-c/</link>
		<comments>http://rastergrid.com/blog/2010/02/one-more-degree-of-freedom-for-c/#comments</comments>
		<pubDate>Sun, 14 Feb 2010 14:38:42 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[callback]]></category>
		<category><![CDATA[delegate]]></category>
		<category><![CDATA[delegation]]></category>
		<category><![CDATA[Delphi]]></category>
		<category><![CDATA[event handling]]></category>
		<category><![CDATA[message]]></category>
		<category><![CDATA[signal]]></category>
		<category><![CDATA[signals and slots]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=176</guid>
		<description><![CDATA[

Those who worked enough with C or other procedure oriented languages know how much flexibility callbacks provide. The simplest example is the qsort function of the C standard library. It is also not unintentional that many libraries, windowing system APIs and operating system APIs also highly rely on callbacks to pass a particular task over [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F02%252Fone-more-degree-of-freedom-for-c%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fdrfdzt%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22One%20more%20degree%20of%20freedom%20for%20C%2B%2B%22%20%7D);"></div>
<p>Those who worked enough with C or other procedure oriented languages know how much flexibility callbacks provide. The simplest example is the qsort function of the C standard library. It is also not unintentional that many libraries, windowing system APIs and operating system APIs also highly rely on callbacks to pass a particular task over to another program module and it is one of the fundamental tools needed to implement an event-driven application. At the same time, object oriented languages does not directly support the concept of callbacks as they don&#8217;t really fit into the paradigms used by these languages. Fortunately, even if not as a language feature, all object oriented languages support a similar facility like callbacks in the form of delegates.</p>
<p><span id="more-176"></span>Delegation as a design pattern is used to describe the situation when one object passes on the implementation of a particular task to another object. This clearly reflects the purpose of callbacks used in procedure oriented languages. Many languages does natively support some form of delegation, some of the well known ones are C# and Delphi.</p>
<h2>Callbacks</h2>
<p>As mentioned before, the facility present in procedure oriented languages that enables the delegation of functionalities to other modules is done with callbacks. These callbacks are specified by passing function pointers to some registration functions provided by the library. Here is a very simple C example:</p>
<pre class="brush: c">/* server header */
void registerFooCallback(int (*fooCB)(int, float));
int doFoo(int a, float b);

/* client code */
int myFooCallback(int a, float b) {
    /* ... do something ... */
}

int main() {
    registerFooCallback(myFooCallback);
    cout &lt;&lt; doFoo(5, 3.2f);
    return 0;
}</pre>
<p>Here we can see how easily callbacks provide injection of user code for handling events happened in the server.</p>
<h2>Delegation as a design pattern</h2>
<p>The simplest way to create object oriented callbacks is by applying the design pattern of delegation. If we would like to construct the C++ equivalent of the example above using the mentioned pattern, we end up with something like the following:</p>
<pre class="brush: cpp">/* server header */
class IFooCallback {
public:
    virtual int operator() (int a, float b) = 0;
};

class Foo {
private:
    IFooCallback* _fooCB;
public:
    void registerCallback(IFooCallback* fooCB);
    int doFoo(int a, float b);
};

/* client code */
class MyFooCallback: public IFooCallback {
    int operator() (int a, float b) {
        /* ... do something ... */
    }
};

int main() {
    Foo foo;
    MyFooCallback fooCB;
    foo.registerCallback(fooCB);
    cout &lt;&lt; foo.doFoo(5, 3.2f);
    return 0;
}</pre>
<p>As you can see, it is quite straightforward to provide an object oriented alternative to callbacks. However, there is a very significant drawback when using the technique above, namely the type intrusion inherently coming from this definition of a callback. The client code needs to explicitly inherit it&#8217;s own code from a type defined in the server. This results in tight coupling and is likely to carry other disadvantages inside regarding to maintainability and migration issues.</p>
<h2>Delegate methods</h2>
<p>In our previous attempt to provide an easy to use C++ alternative for callbacks with OOP in mind we tried to replace function pointers with a pure virtual base class that acts like an interface definition for our callback. However, it somewhat violates the original goals of delegates which by definition should be some form of run-time inheritance (this varies from definition to definition, still, this is the one that I&#8217;m referring to in this article). We soon figure out that the most convenient way would be to be able to assign member functions of any class as a callback. Obviously, the parameters and return type should still match as previously to provide type safety, but we would like to remove any additional dependencies between the client and the server.</p>
<p>While C++ does have the term of pointers to member functions there is no easy and standard way to implement callbacks using them. Or is there? First of all, there is no particular problem with class static member functions as they are much like C functions, however, limiting delegates to static methods heavily affects the freedom of the developer. The problem with object member functions and especially with virtual member functions is that they have the implicit parameter <strong>this</strong> that enables them to access the object they correspond to.</p>
<p>The popular Boost library provides mechanisms that enables the use of object member functions as separate entities by using the <strong>bind</strong> functor adaptor which became part of the language standard as part of <a title="ISO/IEC TR 19768:2007" href="http://www.iso.org/iso/catalogue_detail.htm?csnumber=43289" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.iso.org/iso/catalogue_detail.htm?csnumber=43289&amp;referer=');">Technical Report 1</a>. This extension makes it possible to use member functions as delegates in a way that does not involve any type intrusion side effects.</p>
<p>Unfortunately, these facilities involve a noticeable performance hit when the callback is invoked compared to simple method invocations. Also, using functor adaptors for implementing delegates is not the most straightforward and makes the code quite ugly compared to an ideal situation when delegates are part of the language itself. Of course, this is only my opinion, others who used these libraries more often may have a different vision about the topic.</p>
<p>Anyway, as for me performance is always a concern, I started to look around for alternatives. It surprised me that I&#8217;ve found even two of them very soon:</p>
<ul>
<li><a title="Fastest Possible C++ Delegates" href="http://www.codeproject.com/KB/cpp/FastDelegate.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.codeproject.com/KB/cpp/FastDelegate.aspx?referer=');">Fastest Possible C++ Delegates</a> by Don Clugston &#8211; This is a library that provides delegates that are as fast as simple virtual method invocations. The implementation strongly relies on the behavior of different compilers, yet is very portable, at least as far as I can tell.</li>
<li><a title="The Impossibly Fast C++ Delegates" href="http://www.codeproject.com/KB/cpp/ImpossiblyFastCppDelegate.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.codeproject.com/KB/cpp/ImpossiblyFastCppDelegate.aspx?referer=');">The Impossibly Fast C++ Delegates</a> by Sergey Ryazanov &#8211; This library was introduced as an alternative to the previous one that strictly relies only on standard features of the languages. Surprisingly, this later is less supported by different compiler implementations and it is also somewhat slower than the previous one.</li>
</ul>
<p>Personally, I go with the first one as for me performance and portability is more important than conformance with the standard. And, of course, it is not that hard to change the back-end for the delegate support at some time if I change my mind. Finally, lets see how our foo callback looks like when using the fast delegates of Don Clugston:</p>
<pre class="brush: cpp">/* server header */
class Foo {
private:
    FastDelegate2&lt;int, float, int&gt; _fooCB;
public:
    void registerCallback(FastDelegate2&lt;int, float, int&gt; fooCB);
    int doFoo(int a, float b);
};

/* client code */
class MyClass {
    virtual int handleFoo(int a, float b) {
        /* ... do something ... */
    }
};

int main() {
    Foo foo;
    MyClass myObj;
    foo.registerCallback( MakeDelegate(&amp;myObj, &amp;MyClass::handleFoo) );
    cout &lt;&lt; foo.doFoo(5, 3.2f);
    return 0;
}</pre>
<h2>Multicast delegates</h2>
<p>The delegates presented previously can only be bound to a single method, as usually delegates behave this way, although a single method can be bound by many delegates. The signals and slots model extends this to a many-to-many relationship. Thus a signal is actually just a delegate that can bind to multiple methods at once. Such a primitive is sometimes also referred to as a multicast delegate.</p>
<p>Multicast delegates come handy especially in case of user interface programming and other situations where the event based programming model is used. The basic foundation behind this programming model is the idea of &#8220;subscribe and notify&#8221;. That means there are <em>publishers</em> who will do some logic and sometimes publish <em>events</em>. When such an <em>event</em> is published, it is actually sent out to the <em>subscribers</em> who have subscribed to receive the specific event. At implementation level this is nothing more than having a multicast delegate in the <em>publisher</em> object and providing an interface that will be used by the <em>subscriber</em> objects to register one of their methods that has to be called in case a particular <em>event</em> occurs.</p>
<p>There are plenty of signals and slots libraries out there including but not limited to the Boost Signals library. However, again, if performance is a concern one must look around carefully to find the appropriate library suitable for a particular purpose. One such library that extends the fast delegates of Clugston with a signals and slots framework is that of <a title="Simpler UI Code With Signals and Slots" href="http://www.gallantgames.com/2009/12/13/simpler-ui-code-with-signals-and-slots" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.gallantgames.com/2009/12/13/simpler-ui-code-with-signals-and-slots?referer=');">Patrick Hogan</a>&#8217;s.</p>
<h2>Asynchronous delegates</h2>
<p>If we do one more step forward, we arrive to asynchronous delegates that can provide us a flexible yet efficient messaging system for multi-threaded applications. The only additional thing we have to implement a message queue on the callee side and optionally some form of synchronization if we would like to also make it possible for the asynchronous delegates to return data to the caller.</p>
<p>As this topic deserves a thorough discussion on its own, I would recap on the subject in a future article and try to provide a sample implementation using OpenMP as usual.</p>
<h2>Conclusion</h2>
<p>We&#8217;ve just touched the surface of what possible use case scenarios of delegates one can met during software development, still, we&#8217;ve seen how many advantages such a programming primitive can give to C++ developers no matter if they are implementing a very simple library of sorting algorithms like the qsort C standard library function or a robust, fully event-driven multi-threaded application. We&#8217;ve also seen that there exist several efficient implementations of such a framework for those performance fanatics like me.</p>
<p>It is a perfect example how easily one can extend C++ with another facility that is usually available only in the most modern managed languages. By the way, I would be interested in your opinion what do you like the most in other languages like Java and C#, and you are disappointed that C++ does not directly provide the same thing. Maybe there exists a C++ alternative for those facilities as well, just we have to look around to find them&#8230;</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/02/one-more-degree-of-freedom-for-c/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Instance culling using geometry shaders</title>
		<link>http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/</link>
		<comments>http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/#comments</comments>
		<pubDate>Mon, 08 Feb 2010 22:58:53 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[SFML]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[uniform buffer]]></category>
		<category><![CDATA[vertex buffer]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=135</guid>
		<description><![CDATA[

Since the appearance of Shader Model 4.0 people wonder how to take advantage of the newly introduced programmable pipeline stage. The most important feature enabled by geometry shaders is that one can change the amount of emitted primitives inside the pipeline. The first thing that a naive developer would try to do with it is [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F02%252Finstance-culling-using-geometry-shaders%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FanKmpg%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Instance%20culling%20using%20geometry%20shaders%22%20%7D);"></div>
<div id="attachment_136" class="wp-caption alignleft" style="width: 160px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/02/Nature-2010-02-08-20-20-36-24.png"><img class="size-thumbnail wp-image-136  " title="Nature demo screenshot" src="http://rastergrid.com/blog/wp-content/uploads/2010/02/Nature-2010-02-08-20-20-36-24-150x150.png" alt="Nature demo screenshot" width="150" height="150" /></a><p class="wp-caption-text">OpenGL 3.2 - Nature</p></div>
<p>Since the appearance of Shader Model 4.0 people wonder how to take advantage of the newly introduced programmable pipeline stage. The most important feature enabled by geometry shaders is that one can change the amount of emitted primitives inside the pipeline. The first thing that a naive developer would try to do with it is geometry tesselation. However, the new shader performs very bad when used for tesselation in a real life scenario even though there are demos show casting this possibility. If we take a closer look at the new feature we observe that the most revolutionary in it is not that it can raise the number of emitted primitives but that it can discard them. This article would like to present a rendering technique that takes advantage of this aspect of geometry shaders to enable the GPU accelerated culling of higher order primitives.</p>
<p><span id="more-135"></span>Geometry shaders can be used for many different advanced rendering techniques that were impossible before the introduction of this flexible programmable shader stage. In this article I would like to present one use case that for me seemed to be one of the most practical application of primitive manipulation possibilities introduced by geometry shaders. As I haven&#8217;t seen any whitepaper talking specifically about this particular technique, even if some of them inherently used it, I would dare name the technique myself as <strong>Instance Cloud Reduction</strong>. I will also present a demo program that shows how to take advantage of the technique in a heavy workload situation.</p>
<p>The idea itself was inspired by AMD&#8217;s  tech demo for the Radeon 4800 series cards called <a title="March of the Froblins" href="http://developer.amd.com/samples/demos/pages/froblins.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/samples/demos/pages/froblins.aspx?referer=');">March of the Froblins</a>. An almost identical technique presented in this article is used in the mentioned demo for the culling of large amount of animated creatures against the view frustum. Also a somewhat similar technique is used in NVIDIA&#8217;s <a title="Skinned Instancing" href="http://developer.download.nvidia.com/SDK/10/direct3d/samples.html" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.download.nvidia.com/SDK/10/direct3d/samples.html?referer=');">Skinned Instancing</a> demo for determining LOD instance sets. Unfortunately, both demos are for DirectX only and, as far as I can tell, there is no OpenGL demo showing any of the aforementioned rendering techniques.</p>
<h3>Motivation</h3>
<p>Nowadays, as the computational capabilities of GPUs is growing in a much faster pace than that of CPUs, graphics developers meet more and more optimization problems related to CPU bound applications. More and more focus is on minimizing the number of driver invocations, actually that&#8217;s what motivated the restructuring of the two most commonly used graphics APIs. As a result we have now DirectX 10+ and OpenGL 3+. However, even if the introduction of geometry instancing, texture arrays and local memory buffer storage for the most important inputs of the rendering, there is still need for wise decisions from graphics programmers to take full advantage of the horsepower coming with the latest GPUs.</p>
<p>Earlier graphics applications strongly relied on CPU based culling techniques, whether it be the usage of the quite outdated BSPs or the more generic and still heavily applied hierarchical culling techniques. We&#8217;ve already reached the point that sometimes even the most efficient CPU based culling techniques seem to be too expensive and usually introduce the small batch problem. Instanced rendering is not an exception.</p>
<p>The applicability of geometry instancing is strongly limited by several factors. One of the most important ones is the culling of instanced geometries. One may choose to cull these objects in the same fashion as others, using the CPU, but that usually breaks the batch and maybe we loose the benefits of geometry instancing. It is more and more imminent to have a GPU based alternative. Without CPU based culling, by sending the whole bunch of instances down the graphics pipeline may choke our vertex processor in case we have high poly geometries and quite large amount of instances of it.</p>
<p>The rendering technique presented in this article will try to achieve this goal. We will use a multi-pass technique that in the first pass culls the object instances against the view frustum using the GPU and in the second pass renders only those instances that are likely to be visible in the final scene. This way we can severely reduce the amount of vertex data sent through the graphics pipeline.</p>
<h3>Implementation</h3>
<p>For some people it might seem that the promise for such a technique is simply too naive and is most probably relying on very exotic OpenGL features, heavy misuse of some basic features or need of data conversions during the frame rendering. Wondrously, this is not the case as we have all we need in OpenGL 3.2 to implement the object culling method sketched above. All we need are the followings:</p>
<ul>
<li>instanced rendering (core since OpenGL 3.1)</li>
<li>geometry shaders (core since OpenGL 3.2)</li>
<li>transform feedback (core since OpenGL 3.0)</li>
<li>uniform or texture buffers (core since OpenGL 3.1)</li>
</ul>
<p>The method itself is a multi-pass rendering technique, however, unlike other multi-pass rendering techniques it does not produce any fragments in the first pass, instead the first pass does the view frustum culling and processes data entirely only inside buffer objects.</p>
<h3>Culling pass</h3>
<p>In the first pass we will feed the graphics pipeline with information about the instances that are needed to perform the view frustum culling. For this we need two inputs for the executed shaders in order to be able to perform the required calculations:</p>
<ol>
<li><strong>Instance transformation data</strong> (whether it be a simple transformation matrix or quaternions or whatever) -- This preferably comes from one or more buffer objects that are bound as vertex buffers to the context.</li>
<li><strong>Object extents information</strong> -- Beside the instance positions we have to know the extents of an instance in order to perform correct culling. This can be either a single float representing the object radius if we choose to use bounding spheres for the culling or a three-dimensional extent vector if we would like to use bounding boxes.</li>
</ol>
<p>Using these as input we can feed in the instance transformation data as attributes of point primitives to our culling shader. The culling shader is composed of a vertex and a geometry shader. In a typical setup the role of each is the following: the vertex shader determines whether the actual object instance&#8217;s bounding volume is inside the view frustum and sends a flag about the culling to the geometry shader, that will emit the instance data to the destination buffer if the flag says that the instance is likely to be visible or does not emit anything if it is determined that the object instance is out of view.</p>
<p>Next, transform feedback is used to capture the primitives emitted by the geometry shader into another buffer object that will be used in the actual rendering pass to source instance transformation data. Beside this, we also need to have an asynchronous query to determine the number of primitives generated to know how many instances of the object do we actually need to render. The following figure shows the workflow of the first pass:</p>
<div id="attachment_146" class="wp-caption aligncenter" style="width: 460px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_pass1.png"><img class="size-full wp-image-146" title="Culling pass" src="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_pass1.png" alt="Culling pass" width="450" height="200" /></a><p class="wp-caption-text">Instance Cloud Reduction - Pass 1: Culling</p></div>
<p>The actual geometry shader implementation needed to perform the actual culling based on the view frustum check performed by the vertex shader should look like the following chunk:</p>
<pre class="brush: c">#version 150 core

layout(points) in;
layout(points, max_vertices = 1) out;

in vec4 OrigPosition[1];
flat in int objectVisible[1];

out vec4 CulledPosition;

void main() {

	/* only emit primitive if the object is visible */
	if ( objectVisible[0] == 1 )
	{
		CulledPosition = OrigPosition[0];
		EmitVertex();
		EndPrimitive();
	}
}</pre>
<p>In this example we used only simply a four-component position vector for the instance transformation data but the technique works well for transformation matrices and quaternions as well.</p>
<p>One more thing is that beside that we set up transform feedback in a way that we feed our buffer object dedicated for the culled instance data and we also started an asynchronous query to be able to determine the number of primitives written into the buffer object, it is also useful to turn of rasterization as we wouldn&#8217;t like to produce any fragments as a result of the first pass.</p>
<h3>Rendering pass</h3>
<p>In the second pass there is nothing special to do. Simply use whatever rendering setup you would like to use. The only things that need to be changed in this step compared to your already existing rendering path is that the instance data for the rendering must be sourced from the generated culled instance data buffer and, as a result, the number of instances passed for the instanced drawing functions shall be changed in order to render only the visible instances. This number can be read from the asynchronous query&#8217;s result that we started in the first pass.</p>
<p>The instance data in the rendering pass can be, of course, sourced from either a uniform or a texture buffer object. This depends on the actual use case and is more clearly explained in the article <a href="http://rastergrid.com/blog/2010/01/uniform-buffers-vs-texture-buffers/">Uniform Buffers VS Texture Buffers</a>.</p>
<p>Important note is that when one has to deal with several instanced geometries it is recommended to do the culling phase prior to rendering any instanced primitives because of the following reasons:</p>
<ul>
<li>The result of the first instance cloud&#8217;s culling is more likely to be finished on the GPU so no sync issues arise from reading the asynchronous query result to determine the number of visible instances.</li>
<li>Probably less state changes are needed as very different setup is required by the two passes.</li>
<li>Results in tidier renderer design as culling is clearly separated from actual rendering.</li>
</ul>
<p>Putting everything together, the application of the presented technique would result in the following workflow on the GPU:</p>
<div id="attachment_150" class="wp-caption aligncenter" style="width: 660px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_combined.png"><img class="size-full wp-image-150" title="Instance Cloud Reduction" src="http://rastergrid.com/blog/wp-content/uploads/2010/02/icr_combined.png" alt="Instance Cloud Reduction" width="650" height="347" /></a><p class="wp-caption-text">Instance Cloud Reduction - Combined view of Pass 1 + Pass 2</p></div>
<h3>Conclusion</h3>
<p>We&#8217;ve seen that the presented advanced rendering technique is able to help in situations when we have to deal with large number of instanced geometries and how to take advantage of the latest features of graphics cards and OpenGL to perform view frustum culling calculations on the GPU. This prevents us from having to deal with complicated and expensive CPU based object culling methods that break the drawing batches, especially when dealing with dynamic objects. For ease the decision whether to incorporate this technique in your rendering engine I would like to present the advantages and disadvantages of it.</p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Heavily reduces the amount of processed data in a naive implementation.</li>
<li>No need for any space partitioning methods in the host application to handle the culling of dynamic objects.</li>
<li>Can handle huge amount of instanced objects due to the enormous horsepower of today&#8217;s GPUs.</li>
<li>Scales well with increased number of instances as the per-instance calculation is relatively low.</li>
<li>Relies strictly on OpenGL 3.2 core features.</li>
<li>No need for OpenCL capable hardware.</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li>Needs an extra rendering pass to perform the culling.</li>
<li>Requires the usage of asynchronous queries to determine the number of visible instances.</li>
</ul>
<p>I hope you agree with me and think about this technique as one more step towards fully GPU based scene management. If you have any remarks or improvement ideas regarding to the rendering technique itself feel free to tell me.</p>
<h3>The Demo</h3>
<p>As I promised, the technique presented above comes with a live demo that actually took most of my time dedicated to writing this blog in the last two weeks. The demo itself is more like a technical show cast rather than a presentation of a real-life use case scenario.</p>
<p>First of all, I used high polygon count models for the rendering to emphasize the amount of time the culling phase spares from the very valuable time of our GPU. In a real world application one would never do something like this. As a result, the demo is more like a benchmark than an interactive application. However, maybe on high-end graphics cards it can perform pretty well.</p>
<p>The demo scene consists of two object types: trees and grass blocks. The tree model is further divided into two parts as they need different textures: the tree trunk and the tree foliage. Obviously, this additional burden can be prevented by using texture arrays to avoid the need of separate draw calls to render the trunk and the foliage.</p>
<p>The tree trunk consists of 33138 triangles, the tree foliage has 16069 triangles and the faking-free grass block consists of 8961 triangles which I had to model myself as didn&#8217;t found any suitable model. Actually this modeling step consumed quite a reasonable amount of my time spent with the demo as I&#8217;m not an expert in this domain.As you can see, these models are not the ones that one might use in an interactive real-time application like games. However, they seemed to be very suitable for the purpose of the demonstration.</p>
<p>What really kicks off the boundaries of GPUs is that the demo renders 10,000 trees and 250,000 grass blocks using instancing. This ends up in more than <strong>2.7 billion triangles</strong> in the scene. This is far more that a GPU can handle without the aid of some scene management and culling. However, we will use no scene management at all and the only culling method that we will use is the one presented in this article.</p>
<p>The actual results are quite promising. The view frustum culling step usually spares more than <strong>99.9%</strong> of the GPU horsepower as the amount of actually rendered triangles after the culling step is far below 2 million triangles. This is still quite much but as we use high polygon count models and we don&#8217;t use any LOD techniques this seems reasonable.</p>
<p>Even if the demo scene statistics doesn&#8217;t seem like a typical use case scenario, the ease of the implementation and the compelling visual results made me pleased anyway:</p>
<p style="text-align: center;"><span class="youtube">
<object width="640" height="480">
<param name="movie" value="http://www.youtube.com/v/srbOFTLTe8k&amp;color1=3a3a3a&amp;color2=999999&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0?rel=1&amp;hd=1" />
<param name="allowFullScreen" value="true" />
<embed wmode="transparent" src="http://www.youtube.com/v/srbOFTLTe8k&amp;color1=3a3a3a&amp;color2=999999&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0?rel=1&amp;hd=1" type="application/x-shockwave-flash" allowfullscreen="true" width="640" height="480"></embed>
<param name="wmode" value="transparent" />
</object>
</span><p><a href="http://www.youtube.com/watch?v=srbOFTLTe8k&fmt=18" onclick="pageTracker._trackPageview('/outgoing/www.youtube.com/watch?v=srbOFTLTe8k_fmt=18&amp;referer=');">www.youtube.com/watch?v=srbOFTLTe8k</a></p></p>
<p>On my Radeon HD2600XT I have achieved 6-7 frames per second which is acceptable taking in consideration the huge amount of geometry data still passed to the graphics card. On more recent cards I suppose it should run with good frame rates, however, due to the lack of hardware to test on, these are my only results. If anybody manages to take a better screen capture than mine above then please let me know.</p>
<h3>Implementation details</h3>
<p>Just to tell a few words about what techniques and tricks I&#8217;ve used during the creation of the demo here is a listing of the most important ones:</p>
<ul>
<li>Three models are used as mentioned previously with high instance counts with over 2.7 billion of total triangles in the scene as mentioned already.</li>
<li>Three 512x512 RGBA textures are used for the models that are partially handmade, and again, I&#8217;m not a texture artist so sorry if they don&#8217;t look flawless.</li>
<li>The wavefront model and TGA image loader that accompany the demo are very roughly implemented only for the demo so I would strongly encourage you not to use it to any purpose as it handles only a subset of the possibilities of the file formats.</li>
<li>The vertex data from the wavefront model files is transferred in a very naive way so vertex reuse isn&#8217;t taken into account.</li>
<li>The instance data consists of simple four-component vectors representing the world-space position of the instance. This seemed to be the most simple for the demonstration purposes.</li>
<li>In the second pass, the instance data is sourced from a texture buffer but not really because the visible instance count exceeded the amount that would fit in a uniform buffer. I used texture buffers because for this simple demonstration they seemed to be a little bit more easy to be integrated.</li>
<li>The morphing effect that simulated wind blow is done using hard-coded geometry deformation in the vertex shader. It is not physically correct but visually compelling.</li>
<li>The lighting is a simple directional light using Phong&#8217;s shading and reflection model.</li>
<li>Simple fog is simulated with some awkward formula that I&#8217;ve chosen after a few test runs.</li>
<li>Alpha testing is achieved by using the discard operation in the fragment shader.</li>
</ul>
<h3>Driver issues</h3>
<p>During the development of the demonstration program I&#8217;ve met several driver related problems as I&#8217;ve never used so heavily the latest OpenGL features previously. I&#8217;ve worked with Catalyst 9.12 and 10.1 but both seemed to lack of a proper GLSL compiler. Here are some of the issues I&#8217;ve met:</p>
<ul>
<li>When I&#8217;ve forgot to declare the varyings in the geometry shader as arrays like the standard requires then still the driver hasn&#8217;t complained about any syntax error but when tried to execute the code the program crashed.</li>
<li>Except the texture sampler uniform, all other uniforms failed to work when used in the fragment shader only so I&#8217;ve put them all in the vertex shader.</li>
<li>For loops seemed not to work when used inside the geometry shader, that&#8217;s why the culling itself is done in the vertex shader in the demo.</li>
</ul>
<p>All these problems resulted in nasty tricks to make things working and ended up in awful shader code. Sorry for that. At least now it works on my configuration but pretty unsure whether it will work on other graphics card and driver combos. Please report me any success or failure when trying out the demo. Anyway, be sure to have the latest graphics drivers installed as, at least in case of AMD, OpenGL 3.2 drivers came out only at the fall of 2009.</p>
<p><em><strong>Edit:</strong></em></p>
<p><em>Thanks to the information got from Pierre Boudier from AMD I&#8217;ve updated both the source and binary releases to support the latest drivers properly. The problem was that I didn&#8217;t use attribute location binding as specified in the standard.</em></p>
<p><em>Also have to mention that with my new Radeon HD5770 I managed to achieve over 90 frames per second that actually show that this technique can be in fact used for games and other interactive applications.</em></p>
<p><em>One more thing in the end. As you know this version of the Nature demo uses a texture buffer to source instance positions. I plan to create another version that will take advantage of the instanced arrays introduced in core with OpenGL 3.4. I expect quite a reasonable speedup as that would eliminate the need for texture fetches in the vertex array by rather dedicating a vertex fetcher for the purpose thus increasing the overall performance of the technique.</em></p>
<h3>Binary release</h3>
<p><strong>Platform:</strong> Windows<br />
<strong>Dependency:</strong> OpenGL 3.2 capable graphics driver<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_win32.zip" target="_blank">nature12_win32.zip (3.58MB)<br />
</a><strong>Comments:</strong> Includes the update that makes it work even with the latest drivers.</p>
<h3>Full source code</h3>
<p><strong>Language:</strong> C++<br />
<strong>Platform:</strong> cross-platform<br />
<strong>Dependency:</strong> GLEW, SFML, GLM<br />
<strong>Download link:</strong> <a href="http://rastergrid.com/blog/wp-content/uploads/2010/06/nature12_src.zip" target="_blank">nature12_src.zip (12.6KB)<br />
</a><strong>Comments:</strong> Sorry for the many dependencies, however, I would recommend the mentioned libraries for everybody who is doing OpenGL development.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/feed/</wfw:commentRss>
		<slash:comments>38</slash:comments>
		</item>
		<item>
		<title>Synchronizable objects for C++</title>
		<link>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/</link>
		<comments>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 19:01:56 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Multiprocessing]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[lock]]></category>
		<category><![CDATA[macro]]></category>
		<category><![CDATA[multithreading]]></category>
		<category><![CDATA[mutex]]></category>
		<category><![CDATA[OOP]]></category>
		<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[synchronization]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=120</guid>
		<description><![CDATA[

Previously I talked about how one can easily take advantage of multiprocessing using OpenMP. Even if the C pragmas introduced by the parallel programming API standard is very straightforward for simple programs, it simply doesn&#8217;t fit nicely in a complex C++ application that is built from the ground with the OOP in mind. To smoothly [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F02%252Fsynchronizable-objects-for-c%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FbbpIPT%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Synchronizable%20objects%20for%20C%2B%2B%22%20%7D);"></div>
<p>Previously I talked about how one can easily take advantage of multiprocessing using OpenMP. Even if the C pragmas introduced by the parallel programming API standard is very straightforward for simple programs, it simply doesn&#8217;t fit nicely in a complex C++ application that is built from the ground with the OOP in mind. To smoothly introduce OpenMP into such projects one need higher level constructs that hide the actual implementation details. This is the first article of a series that will try to provide reference implementations of such an abstraction. First, we will start with synchronizable primitives that try to reflect the functionality provided by the &#8220;synchronized&#8221; statement of Java.</p>
<p><span id="more-120"></span>This article is highly inspired by an article written by <a title="A &quot;synchronized&quot; statement for C++ like in Java" href="http://www.codeproject.com/KB/threads/cppsyncstm.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.codeproject.com/KB/threads/cppsyncstm.aspx?referer=');">Achilleas Margaritis</a><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;"> and is mostly equivalent with his thoughts. My article tries to provide a portable reference implementation of a slightly modified version of the trick presented by Margaritis that uses OpenMP as the multiprocessing API back-end.</span></p>
<h2>Motivation</h2>
<p><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;">According to the OO paradigm, classes and consequently objects provide an abstract interface to the underlying internal data or services of the modeled entity or entity class. When it comes to parallel programing we should provide facilities to enable concurrent access to shared resources that are in this case objects. Using plain OpenMP can be satisfactory, however when used extensively the OpenMP pragmas and API function calls introduced can greatly affect the readability and the maintainability of the code. Nevertheless, there can be platforms that use other APIs for handling race conditions. It is obvious that we need to encapsulate these facilities and provide an abstract tool-set instead.</span></p>
<h2>Implementation</h2>
<p><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;">The very first building block of such a framework can be a mutex class that provides mutually exclusive access to certain resources. In the world of OpenMP this should look like something similar to the following:</span></p>
<pre class="brush: cpp">class Mutex {
public:
    Mutex() { omp_init_lock(&amp;_mutex); }
    ~Mutex() { omp_destroy_lock(&amp;_mutex); }
    void lock() { omp_set_lock(&amp;_mutex); }
    void unlock() { omp_unset_lock(&amp;_mutex); }
private:
    omp_lock_t _mutex;
};</pre>
<p>This seems already enough for us to make our Java-like &#8220;synchronized&#8221; statement, however we would like to create a framework that makes usage as easy and safe as possible. In order to get closer to this goal we apply the RAII (Resource Acquisition Is Initialization) design pattern to create our lock class:</p>
<pre class="brush: cpp">class Lock {
public:
    Lock(Mutex&amp; mutex) : _mutex(mutex), _release(false) { _mutex.lock(); }
    ~Lock() { _mutex.unlock(); }
    bool operator() const { return !_release; }
    void release() { _release = true; }
private:
    Mutex&amp; _mutex;
    bool _release;
};</pre>
<p>Our goal is to provide an inheritable interface for such objects that needs synchronization. However, this step has to involve severe considerations regarding to the provided interface as we explicitly need to conform to the following requirements:</p>
<ul>
<li>The interface shall not expose the interface of the underlying synchronization primitive, in our case the mutex class methods.</li>
<li>The interface shall be available only to the synchronizable objects but not for the external world as we would like to not just hide the implementation details of our abstract entity but also prevent the users to synchronize our objects as it should be the responsibility of the object itself.</li>
<li>The interface shall expose methods which are less prone to name collision, for convenience.</li>
</ul>
<p>If we take care of the presented conventions we end up with an interface similar to the following:</p>
<pre class="brush: cpp">class Synchronizable: protected Mutex {
protected:
	void enterSyncBlock() { this-&gt;lock(); }
	void exitSyncBlock() { this-&gt;unlock(); }
};</pre>
<p>Now we are almost at the finish line. We just need to inherit this class in order to have the needed facilities for an object that needs synchronization. However, using this interface directly is not the most comfortable and safe. If we would like to have a Java-like &#8220;synchronized&#8221; statement we have to call for additional help. Fortunately, we have our not so well respected C macro language coming to rescue us as we can use it to make some pseudo-language extensions. The simplest way to define our new statement is using the following line:</p>
<pre class="brush: cpp">#define synchronized(obj)  for(Lock obj##_lock = *obj; obj##_lock; obj##_lock.release())</pre>
<p>From now, we can really use object synchronization in C++ as easy as in Java, we just need the following syntax in the method of our shared objects:</p>
<pre class="brush: cpp">synchronized(this) {
    // some code that needs synchronization
}</pre>
<p>Now it is clearly visible how handy the RAII pattern became in our case. Beside that it is now very straightforward to use this statement it provides additional benefits:</p>
<ul>
<li>It makes the code more readable and as a result it is easier to maintain.</li>
<li>No need to call inconveniently named methods and use lock variables.</li>
<li>The synchronized code has it&#8217;s own scope inside the code.</li>
<li>It is exception-safe as the mutex is unlocked upon destruction.</li>
</ul>
<p>Additionally, we can also take advantage of the inherent problem in C++ regarding to multiple inheritance. If we inherit our object from other two synchronized objects then using a simple type casting we can explicitly specify which ancestor we would like to synchronize in a particular block. Also, to ease this we can define our synchronization statement instead of the Java-like one using the following line:</p>
<pre class="brush: cpp">#define synchronized(cls)  for(Lock obj##_lock = *static_cast&lt;cls*&gt;(this); obj##_lock; obj##_lock.release())</pre>
<p>In this case we pass the class name instead of the object pointer <em>this</em>. Using this later construct we can easily specify the correct ancestor that we would like to synchronize in case when we deal with multiple inheritance situations. Personally I prefer the later syntax as it is much more customized for C++ use cases.</p>
<p>As from now we don&#8217;t need a direct interface for entering and exiting our synchronization block we can simplify our synchronizable interface to the following chunk:</p>
<pre class="brush: cpp">class Synchronizable: protected Mutex {
};</pre>
<p>This is enough from now to provide the facilities needed for a synchronization block but still complies to the requirement that we would like to hide the synchronization primitive related details.</p>
<p>Beside this, Jörg came up with the idea today to replace the for loop in our macro with a single if statement. This seems reasonable as we don&#8217;t have to sacrifice any scoping and safety related benefits of our framework. This simplifies our lock class to the following:</p>
<pre class="brush: cpp">class Lock {
public:
    Lock(Mutex&amp; mutex) : _mutex(mutex) { _mutex.lock(); }
    ~Lock() { _mutex.unlock(); }
    bool operator() const { return true; }
private:
    Mutex&amp; _mutex;
};</pre>
<p>This definition of the lock class is satisfactory if we redefine our synchronized macro to use an if statement instead:</p>
<pre class="brush: cpp">/* Java-like synchronized statement */
#define synchronized(obj)  if (Lock obj##_lock = *obj)
/* alternative synchronized statement to support multiple inheritance */
#define synchronized(cls)  if (Lock obj##_lock = *static_cast&lt;cls*&gt;(this))</pre>
<p>Thanks to the useful comments we even managed to further optimize and minimize the support code needed for our new pseudo-language extension.</p>
<h2>Conclusion</h2>
<p>We have seen an example how one can implement an easy to use synchronizable interface for C++. Also, we&#8217;ve provided a concrete implementation that is based on OpenMP. This library is still far from an API that provides all the necessary constructs that one needs for using parallel programming in their C++ projects, however we made our first step and I will recap on the subject in subsequent articles to further extend this framework.</p>
<p>Credits go to Achilleas Margaritis whose article inspired me to write mine and to Jörg for the useful improvement ideas.</p>
<h3>Full source code</h3>
<p><strong>Language:</strong> C++<br />
<strong> Platform:</strong> cross-platform<br />
<strong> Dependency:</strong> OpenMP<br />
<strong> Download link:</strong> <a title="omp_sync.h" href="/blog/wp-content/uploads/2010/02/files/omp_sync.h" target="_blank">omp_sync.h</a><br />
<strong> Comments:</strong> In order to use it as it is, you will need a C++ compiler supporting OpenMP like GCC 4.2 or Visual C++ 2008.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Flawless alternative to SDL</title>
		<link>http://rastergrid.com/blog/2010/01/flawless-alternative-to-sdl/</link>
		<comments>http://rastergrid.com/blog/2010/01/flawless-alternative-to-sdl/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 19:47:01 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[GLFW]]></category>
		<category><![CDATA[multimedia]]></category>
		<category><![CDATA[multithreading]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[OpenAL]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[SDL]]></category>
		<category><![CDATA[SFML]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=108</guid>
		<description><![CDATA[

There was always big need for libraries that provide an abstract interface towards the basic platform specific facilities that are necessary for setting up an execution environment for a particular application. In the OpenGL world one of the first such libraries was GLUT. After a while more and more functionalities were put into these libraries [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F01%252Fflawless-alternative-to-sdl%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FaXVqCz%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Flawless%20alternative%20to%20SDL%22%20%7D);"></div>
<p>There was always big need for libraries that provide an abstract interface towards the basic platform specific facilities that are necessary for setting up an execution environment for a particular application. In the OpenGL world one of the first such libraries was GLUT. After a while more and more functionalities were put into these libraries that reflect more or less the requirements of application developers. One such framework is SDL. It seems that SDL is still the most respected one of these and it is preferred by the developer community. However, in this topic I will present an alternative that proved its superiority to me in the last few months&#8230;<br />
<span id="more-108"></span><img title="More..." src="http://rastergrid.com/blog/wp-includes/js/tinymce/plugins/wordpress/img/trans.gif" alt="" /></p>
<h2>Why would I need such a library?</h2>
<p>There are loads of reasons why it is good to have such a framework in your toolkit. I would like to present only a few that I consider important. First of all, having some easy to use API to setup the basic environment for your application, like a window with an OpenGL rendering context, simply removes the burden from dealing with such platform specific details and concentrate on the actual product. Next, they usually give a quite good degree of platform portability so you don&#8217;t have to study specific operating system APIs and you can still deploy your application on multiple platforms. I can recite many other reasons but the most important one is that they are reusable components so you don&#8217;t reinvent the wheel.</p>
<p>For a while I was quite satisfied with my own implementation of such a toolkit until I moved from Delphi to C++ development. This forced me to look around on the market to find a replacement for my proprietary solution as C++ has a very large developer community so it shouldn&#8217;t be that hard to find a suitable framework. Well, actually this wasn&#8217;t the case as it was the time when the OpenGL 3 specification came out with its new context creation and deprecation model. It was very embarrassing to observe that even the most popular multimedia frameworks are hardly adopting these new features.</p>
<p>At that time I realized that my library choice will heavily affect my productivity in the future so I have to think well which one I will use afterwards. To ease the selection I created something like a wish-list about what I expect from such a library. The most important issues were the following:</p>
<ul>
<li>Feature-rich - The library must provide the most basic functionalities needed for an average OpenGL application. This includes window and rendering context handling, keyboard and mouse user input capture, timing facilities and, of course, supports OpenGL 3 contexts. Optionally it would be nice if the API also provides multi-threading, image and audio handling, basic network operations, joystick support, etc.</li>
<li>Modular - The library must be modular, that means I can select which components of the framework I would like to use in a particular application. Sometimes less is more so, as an example, in a very simple cube rotating OpenGL demo I don&#8217;t want to link against a library which contains network handling. Such monolithic libraries makes life much harder as they usually rely on exotic dependencies and prevent easy deployment of the application.</li>
<li>Portable - It should be portable, at least it should work on the three most popular PC platforms: Windows, Linux and MacOSX. It has to work also with a variety of build systems, preferably with Visual C++, GCC and Xcode. Optionally it would be nice if it can be interfaced by applications written in other programming languages than C/C++.</li>
<li>Easy to use - In an ideal world such a framework should have a very clean interface and its usage must be very natural for the developer. This issue is however rather subjective so what it easy to use for one developer maybe it&#8217;s not the best for another. Regarding to this issue I will present the alternatives, obviously, from my perspective.</li>
</ul>
<p>Maybe choosing such a multimedia library for an OpenGL hobby project is just a matter of taste, but when it comes to support for OpenGL 3 context support, developers have very limited choices and one most probably faces decisions when they have to make some trade-offs when selecting any of  these libraries. Anyway, as one of my key requirements is that the library must support OpenGL 3 contexts, it is not that difficult to present all the most popular alternatives.</p>
<h2>Simple DirectMedia Layer</h2>
<p>SDL has a long history in this domain and it proved that it is an excellent choice for most hobbyists and even for professionals. It has been used in tons of different free and commercial applications and it is probably still the most preferred library in this category. Lets examine it regarding to the issues presented previously.</p>
<p>SDL provides almost all the facilities that are needed for an average graphics application. Together with some additional libraries developed as an extension to SDL like SDL_image and SDL_net it conforms to almost all of my requirements regarding to feature content. The most important is that its latest development branch also has support for OpenGL 3.</p>
<p>From point of view of modularity, especially taking into account that the less common used facilities are provided by different add-ons, SDL seems to be a good choice. However, the SDL core itself has already a bit too much dependencies on other operating system specific libraries, especially the reliance on DirectX on the Windows platform. Even if probably most of you can live with this, it is simply unacceptable for me. Anyway, this is still not the biggest problem what I&#8217;ve faced with when I checked out whether SDL suites my needs.</p>
<p>Platform portability is one of the key advantages of SDL as it even supports many other platforms than what was on my wish-list. Also regarding to language portability, the C interface of SDL makes it very easy to drop into a Delphi project as an example. Actually there are also plenty of  bindings for other languages for interfacing SDL including but not limited to Java, C#, Delphi, Ada, Perl and python. However, when it comes to build system portability I have to mention my bad experiences.</p>
<p>First of all I am that kind of animal who uses GCC also for compilations under Windows. As SDL comes with an automake based build system which proved to be unusable using MSYS and I would rather not use Cygwin as it also introduces quite many unwanted external dependencies. After giving up compilation using MinGW, I tried to build the library using the good old Visual C++ IDE. This is when I faced the problem that I would have to install the DirectX SDK in order to compile SDL. The final hit was that even after downloading and installing the huge DirectX SDK, SDL still refused to compile with weird compiler errors.</p>
<p>By the way, all these compilation related issues wouldn&#8217;t be a problem for me if SDL would come with a binary release. Even if it has such releases for earlier versions, it does not have it for the latest development branch which contains the OpenGL 3 related stuff. So actually I cannot even prove that the OpenGL 3 specific implementation in SDL works in practice or not.</p>
<p>The API interface provided by SDL got skewed up meanwhile, arising from the fact that SDL has a long history, but still I cannot say that the interface is not clean enough to be easily used in any project. So in this regard SDL is still a major player.</p>
<h2>The OpenGL Framework</h2>
<p>My second was GLFW as it was the only library that supported OpenGL 3 as far as I knew at that time. For those who are not familiar with this library, GLFW is a simple yet powerful toolkit with a similar interface like GLUT but with added capabilities like multi-threading and joystick support (see <a title="GLFW Home Page" href="http://glfw.sourceforge.net/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/glfw.sourceforge.net/?referer=');">GLFW home page</a>).</p>
<p>It is a very feature-rich framework compared to its size and even not being very modular, it still does not involve that much external dependencies as SDL. As it also provides OpenGL 3 support only in it&#8217;s latest development branch I had to compile the code here as well, however, it was straightforward to do it even using MinGW so I don&#8217;t have any complains regarding to this subject. Also it supports multiple platforms, at least those that I am mainly interested in.</p>
<p>Unfortunately, as many other early OpenGL 3 context handling implementations, it wasn&#8217;t working on all platforms. More precisely, under Windows the OpenGL 3 code in the development at that time (about half year ago) worked only with NVIDIA cards as, non-compatible with the Microsoft WGL specification, NVIDIA&#8217;s ICD exposed the wglCreateContextAttribARB function even if there was no valid OpenGL context bound and, as usual, the developers used only NVIDIA cards for testing which resulted in a partially working OpenGL 3 context handling implementation.</p>
<p>As the code of GLFW is also very straightforward and handy to read, I easily corrected the bug in the OpenGL 3 context handling and I used GLFW for a few months for my hobby projects. After a while, however, the lack of some facilities in GLFW made me to think through this library selection again as I didn&#8217;t want to end up using several different libraries for different purposes which would result in a barely well designed code structure.</p>
<h2>Simple and Fast Multimedia Library</h2>
<p>We&#8217;ve arrived to the main topic of this article. Finally I&#8217;ve found SFML which at first sight seemed to be &#8220;just another multimedia library&#8221; but I soon realized that it&#8217;s far more than that.</p>
<p>First of all, it has all the features I needed. Beside the basic ones, it has very nice support for networking but not just basic socket programming using TCP or UDP packets, but a more comprehensive toolkit for even HTTP and FTP transactions. It also has built-in sound support via OpenAL which was another thing that caught my attention as I also preferred OpenAL over other audio libraries like fmod. Beside this, it has many other interesting features but you&#8217;d better check out the <a title="SFML Home Page" href="http://www.sfml-dev.org/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.sfml-dev.org/?referer=');">SFML project site</a>.</p>
<p>As I already got used to, SFML had also support for OpenGL 3 context handling but only in the latest development branch. Anyway, this seemed to be no problem as I had no building issues like that what I met in case of SDL. What is more important that the OpenGL 3 implementation actually worked flawlessly, there was no need for any modification. Just to demonstrate, setting up an OpenGL 3 window with SFML can be as easy as writing the following line of code:</p>
<pre class="brush: cpp">sf::Window App(sf::VideoMode(800, 600, 32), "OpenGL 3 window", sf::Style::Close, sf::ContextSettings(24, 8, 0, 3, 2));</pre>
<p>When it comes to modularity, SFML also wins at me. First, it does not need any exotic libraries or headers for rebuilding the framework. Beside this, it has minimal number of external dependencies but only to the operating system libraries used. It is also modular as different sub-systems of the framework are compiled into different library modules so in your project you can simply select which ones do you intend to use to further minimize deployment issues.</p>
<p>From portability point of view, SFML supports all the platforms that are important for me. Also SFML works with no fuss indifferent with all the compiler tool-chains I&#8217;ve tried. At first sight I had concerns regarding to the language portability of SFML as is was written in C++ and this C++ interface is exposed to the client. However, this issue was solved by the library by providing a C wrapper called CSFML together with the framework itself which makes it rather straightforward to write binding for virtually any programming language.</p>
<h2>Conclusion</h2>
<p>If I haven&#8217;t convinced you so far that SFML can be the perfect choice as a multimedia library for almost any hobby or commercial product then you should check out the full feature list and see for yourself. I will probably write more about SFML in the future and you will also meet some usage examples in my upcoming demos.</p>
<p>If you are not interested in the additional features provided by SFML, instead you are just searching for a very basic framework that provides you a window and some user input handling then GLFW can be your choice. However, based on my bad experiences I would not advise anymore the usage of SDL to anybody but maybe I&#8217;m a bit too inclement.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/01/flawless-alternative-to-sdl/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Exploit parallelism with the least effort</title>
		<link>http://rastergrid.com/blog/2010/01/exploit-parallelism-with-the-least-effort/</link>
		<comments>http://rastergrid.com/blog/2010/01/exploit-parallelism-with-the-least-effort/#comments</comments>
		<pubDate>Tue, 19 Jan 2010 21:12:15 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Multiprocessing]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Fortran]]></category>
		<category><![CDATA[OpenMP]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=84</guid>
		<description><![CDATA[

Multiprocessing has been there for decades as a premium feature for enterprise applications but adopting this technology still brings huge burden to software companies that still maintain and develop legacy code. Nowadays, as most commodity hardware already have highly parallelized architectures, a modern application is almost unimaginable without proper multi-threading capabilities even if we talk [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F01%252Fexploit-parallelism-with-the-least-effort%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Exploit%20parallelism%20with%20the%20least%20effort%22%20%7D);"></div>
<p>Multiprocessing has been there for decades as a premium feature for enterprise applications but adopting this technology still brings huge burden to software companies that still maintain and develop legacy code. Nowadays, as most commodity hardware already have highly parallelized architectures, a modern application is almost unimaginable without proper multi-threading capabilities even if we talk about text editor or a multimedia application. The transition from traditional software development to multiprocessing is not an easy and painless task. Fortunately we have such tools in our hand like OpenMP.</p>
<p><span id="more-84"></span>Currently the biggest hit is OpenCL as it seems to be the ultimate solution to harness the power of highly parallel architectures like multi-core CPUs, DSPs and probably most important is that it can leverage the huge raw computational capabilities of GPUs. However it is one of the most important standard that came out lately, it is not the answer for all questions. For those who would like to converge their legacy code with multiprocessing technology maybe it&#8217;s a better advice to look around for other solutions.</p>
<p>My intension was not related to this when I started to search around for a multiprocessing framework. I just wanted to find something that provides an easy to use interface to introduce multi-threading and the needed shared memory semantics into my hobby projects. This is how I found <a title="OpenMP Homepage" href="http:/www.openmp.org/" target="_blank">OpenMP</a>.</p>
<h2>What is OpenMP?</h2>
<p>Basically, OpenMP is an API specification for parallel programming that is intended to extend the most preferred programming languages used for computationally heavy and scientific calculations with a tool set that enables cross-platform multi-threading support tightly integrated into the language itself. Namely, OpenMP adds shared memory parallel programming capabilities to the C, C++ and Fortran languages.</p>
<p>While OpenMP is limited to these particular programming languages, it is truly an open and multi-platform API that is very well supported by different compilers (at least as far as I can tell). The standard itself is developed and maintained in a similar fashion like OpenGL as it has it&#8217;s own <a title="OpenMP Architecture Review Board" href="http://www.openmp.org/wp/about-openmp/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.openmp.org/wp/about-openmp/?referer=');">Architecture Review Board</a> with representatives from all major hardware and software vendors like AMD, HP, IBM, Intel, Sun Microsystems, Microsoft and others.</p>
<p>The specification itself is maintained in two different versions: one for C/C++ and another for Fortran. As I was never involved in development with Fortran, I dug deeper only in the C/C++ specific details, however the facilities provided by the API are basically the same for Fortran as well.</p>
<p>The language extensions are introduced using OpenMP specific pragmas and a run-time library. At first sight this does not seem to be the most elegant solution but this fits very well into all versions of the programming language specifications so there are no further interworking issues and the OpenMP standard can be maintained totally separated from the underlying language itself. Looking at the evolution process of the C and the C++ programming languages this makes sense by the way.</p>
<h2>Say Hello World to parallel programming</h2>
<p>I think the best way to show the power and simplicity of OpenMP is to show a basic example on how easy is to add parallel computing capabilities even to the most straightforward algorithms:</p>
<pre class="brush:c">void quicksort(int *a, int lo, int hi) {
    int i=lo, j=hi, h;
    int x=a[(lo+hi)/2];

    do {
        while (a[i] &lt; x) i++;
        while (a[j] &gt; x) j--;
        if (i &lt;= j) {
            h=a[i]; a[i]=a[j]; a[j]=h;
            i++; j--;
        }
    } while (i &lt;= j);

    #pragma omp parallel sections
    {
        #pragma omp section
        if (lo &lt; j) quicksort(a, lo, j);
        #pragma omp section
        if (i &lt; hi) quicksort(a, i, hi);
    }
}</pre>
<p>This is the quick sort algorithm in OpenMP fashion. As you may already observed this function is not really different from the original sequential version of the famous sorting technique. The only added content is the presence of the three OpenMP specific pragmas and an additional block.</p>
<p>I will now explain how we exploited parallel programming with just these few added lines but I don&#8217;t want to go into details as it is always better to read the specification itself before starting to heavily use OpenMP. First, we&#8217;ve created &#8220;parallel sections&#8221; which means that we expressed our intension that we would like to separate the tasks in the next code block between multiple threads. Next we&#8217;ve specified the actual &#8220;sections&#8221; that one thread should execute.</p>
<p>This way each time we&#8217;ve split up the array in two pieces we sort the separate regions using separate threads. Of course, for a very huge this would not mean that the number of threads will exponentially grow as it will be saturated at some point. However, this is just one parameter that is fully controlled by the programmer.</p>
<h2>Parallelize loops with minimal effort</h2>
<p>Many times happens that the performance bottleneck is inside a for loop that moves or does calculations on huge data arrays. One example is an algorithm that interpolates two float arrays to another one. This can be of course parallelized using the &#8220;sections&#8221; semantics presented earlier, however it would need modification to the original algorithm and after this it would not clearly reflect the purpose of that anymore. OpenMP supports also such cases very elegantly:</p>
<pre class="brush:c">#pragma omp parallel for
for(int i = 1; i &lt; size; ++i)
    C[i] = A[i] * alpha + B[i] * (1 - alpha);</pre>
<p>Notice that there are no loop-carried dependencies. This means that one iteration of the loop does not depend upon the result of another iteration of the loop. This makes it appropriate for parallelization. Only by adding a single pragma the time needed to execute this loop may scale down perfectly on multi-core systems.</p>
<p>For more control over how many threads will likely to carry out the results of this for loop one can specify the exact number of threads that should be used for the operation by adding another option to the pragma:</p>
<pre class="brush:c">#pragma omp parallel for num_threads(4)</pre>
<p>Of course there are plenty of other configuration possibilities that control how the parallelized code will actually execute but, again, this article is not meant to be a through guide on the usage of OpenMP instead it&#8217;s just a foretaste to raise interest for getting more details about this prominent tool.</p>
<h2>More than just threads</h2>
<p>We&#8217;ve seen so far that OpenMP enables the introduction of basic work sharing support for an already existing project with minimal effort. However, OpenMP is more than just another way to execute separate threads, it also provide very easy to use facilities for synchronization and shared data handling that can be the building blocks of any multiprocessing application including, but not limited to the following features:</p>
<ul>
<li>Explicitly scoped variables to indicate shared and thread private storage</li>
<li>Atomic operations and critical sections</li>
<li>Execution barriers for fine grained synchronization</li>
</ul>
<p>The best thing in these is that you just specify the appropriate pragmas for the affected statements or variables and the rest is carried out by OpenMP. For more information on the usage of these please refer to the <a title="OpenMP specification" href="http://www.openmp.org/wp/openmp-specifications/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.openmp.org/wp/openmp-specifications/?referer=');">OpenMP specification</a>.</p>
<h2>Compiler support</h2>
<p>One of the best things in OpenMP is that it is well supported by most of the major C/C++ compiler vendors:</p>
<ul>
<li><strong>GCC</strong> version 4.3.2 and later (enabled with the -fopenmp compiler switch)</li>
<li><strong>Visual C++</strong> 2008 and later (enabled with the /openmp compiler switch)</li>
<li><strong>Intel C/C++</strong> compiler version 10.1 and later (using -Qopenmp on Windows or -openmp on Linux or MacOSX)</li>
</ul>
<p>For a <a title="OpenMP compilers" href="http://www.openmp.org/wp/openmp-compilers/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.openmp.org/wp/openmp-compilers/?referer=');">complete list</a> of supported compiler please refer to the official site of OpenMP.</p>
<p>Another advantage that raises from the fact how the actual language integration of OpenMP has been designed is that it usually gracefully degrades on compilers without support for OpenMP as the pragmas can be silently ignored. I intentionally used the word &#8220;usually&#8221; as in case that the business logic of the application is consciously using the multi-threaded semantics then it wouldn&#8217;t execute in the exact same way with or without OpenMP. However, the responsibility to monitor such situations is up to the developer.</p>
<h2>Conclusion</h2>
<p>My personal opinion about OpenMP that it best suites those situations when a gradual transition is needed for legacy code towards a parallelized system or when one searches for the easiest possible way to take advantage multiprocessing capable environments. Still, OpenMP is suitable to fulfill almost all the tasks that are needed to implement completely new applications with parallel programming in mind and so I recommend it to everybody even for general use.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/01/exploit-parallelism-with-the-least-effort/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Unit testing in C++</title>
		<link>http://rastergrid.com/blog/2010/01/unit-testing-in-c/</link>
		<comments>http://rastergrid.com/blog/2010/01/unit-testing-in-c/#comments</comments>
		<pubDate>Mon, 11 Jan 2010 16:41:39 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[CPPUTest]]></category>
		<category><![CDATA[CxxTest]]></category>
		<category><![CDATA[GoogleMock]]></category>
		<category><![CDATA[GoogleTest]]></category>
		<category><![CDATA[mocks]]></category>
		<category><![CDATA[TDD]]></category>
		<category><![CDATA[unit test]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=9</guid>
		<description><![CDATA[

Many people are looking for information about which particular C++ unit testing framework they should use for their project and there are also many articles discuss the topic but few articles talk about mock frameworks which are even more important factor when applying unit testing in practice and they have much greater effect on the [...]]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F01%252Funit-testing-in-c%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FadyyfS%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Unit%20testing%20in%20C%2B%2B%22%20%7D);"></div>
<p>Many people are looking for information about which particular C++ unit testing framework they should use for their project and there are also many articles discuss the topic but few articles talk about mock frameworks which are even more important factor when applying unit testing in practice and they have much greater effect on the productivity when doing test-driven development.<br />
<span id="more-9"></span></p>
<h2>Test-driven development in a nutshell</h2>
<p>My first experience with unit testing happened to be during my university studies. As we learned Java development, we also worked a little with JUnit as it is the most popular unit testing framework for Java and also the grandfather of almost all current frameworks. I felt about it like pain in the a** because it seemed ridiculous to use it for our very dummy programs and because I&#8217;m really not a big fan of Java and it&#8217;s monolithic development environment called NetBeans, but I will talk about it in another article.</p>
<p>Next I&#8217;ve met unit testing at my current job where we used TNSDLUnit which is a proprietary framework developed by the employees as an internal open-source project for our TNSDL language. Basically this framework is just a TNSDL specific version of the well known CPPUTest framework. This time I&#8217;ve seen real life examples and I soon discovered the potentials in unit testing and in particular its application in test-driven development.</p>
<p>Test-driven development, or TDD in short, is a software development process that became very popular in the recent past and not accidentally as it is a very powerful tool in good hands. It introduces a very simple development cycle that results in simple, clean and test covered code. The process consists of five straightforward steps:</p>
<ol>
<li><strong>Add a test</strong> &#8211; this is made based on a particular requirement</li>
<li><strong>Run all tests and see if the new one fails</strong> &#8211; this ensures that the test really needs modification to the code</li>
<li><strong>Write some code</strong> &#8211; modify code to make the new test pass</li>
<li><strong>Run all tests and see them succeed</strong> &#8211; if the implementation is good then the new test now must pass</li>
<li><strong>Refactor code</strong> &#8211; tidy up the code as you can be sure now that if you do something wrong then tests will fail</li>
</ol>
<p>Not just this development method provides 100% test coverage for your code but it forces you to make the implementation based on requirements not vice versa.</p>
<p>TDD in a more strict manner also states that one must not write more code than it is necessary to make the test pass. This results sometimes in very awkward implementation stubs at the beginning but it prevents the developer from creating untested code (note that 100% coverage usually does not enough for proper testing).</p>
<h2>Why use unit testing?</h2>
<p>Many people have concerns about the necessity of testing and ask why test and why use unit testing. Well, as it maybe seems ridiculous to unit test a code which consists of a few hundreds or thousands of source line and is perspicuous even for the less competent people, the importance of testing comes into force when developing huge sized codes which are continuously modified by a group of people. One can not just test whether a newly implemented functionality is working as expected but can execute automated regression sets in order to ensure that no old functionality is crashed due to a modification.</p>
<p>Others have concerns about that unit testing doubles the code as so increases maintenance because the testing code is as big as the implementation itself. By the way, in a real life example the testing code can even be double the size of the production code. However, the added benefit of a well tested product eliminates most of the cost of bug fixes as they are discovered in the very beginning of the development process.</p>
<h2>C++ unit test frameworks</h2>
<p>When I&#8217;ve started to work with C++ one of my first thing was to look around for a good unit test framework. There are plenty of them so I will mention only some of the most popular ones: CPPUnit, CPPUTest, CxxTest, UnitTest++, Boost Test Library, GoogleTest, etc.</p>
<p>Most people spend too much time on deciding which unit test framework to use while they all have very similar syntax structure and all support the most needed assertions and facilities that are needed to get started with TDD. I think this is just a matter of taste.</p>
<h2>Resolving dependencies</h2>
<p>As one starts to actively use unit testing will face difficulties that rise from the dependencies between different elements of the software system, whether it be an external database, a foreign module or just an own class referenced by the code unit under test. These problems force people to write stubs or mocks to remove external dependencies.</p>
<p>Stubs are just not flexible enough and writing mocks sometimes seems to be too expensive. However, both have a great penalty on development time and maintenance cost. So, at a point, dependencies become the main problem and as such the relevance of which unit test framework to use gets hidden and people start to look for a mock framework. These libraries not just enable easy creation of mock objects but also integrates them tightly into the unit test framework and sometimes they even provide automated mock generation. Fortunately there are also plenty of mock frameworks for C++. Here are some of them: mockpp, Mock Objects, mockcpp, GoogleMock, etc.</p>
<p>Here the differences are much more visible from both from usability, portability and maturity point of view. Some of them automatically generate mocks, others need code for that as well. Some of them depend on specific application binary interface (ABI) formats, others not. Some of them need exotic language features while others work on all the significant compilers. Also, due to the complexity of the C++ language it is very hard to correctly parse code that is heavily obfuscated by macros, so in many cases the automatic generation of mocks simply doesn&#8217;t work.</p>
<p>I&#8217;ve put my vote on GoogleMock and as such I started to use GoogleTest as a unit test framework because of the following reasons:</p>
<ul>
<li>It does not depend on any special language feature or ABI format so it&#8217;s portable</li>
<li>It is natively integrates with GoogleTest so no need to worry about compatibility issues</li>
<li>Mocks have to be hand written, however it needs just a few lines of code (it also has a mock generator, however I don&#8217;t use it)</li>
</ul>
<h2>Summary</h2>
<p>As a final conclusion, if you take my advice then you don&#8217;t take anybody&#8217;s advice. Try it out yourself and see it whether a particular tool set fits your taste or not. It strongly depends on what you plan to unit test. Just keep it simple and have fun with TDD. Yes, it&#8217;s really fun!</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/01/unit-testing-in-c/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
