<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>RasterGrid Blog &#187; GPU</title>
	<atom:link href="http://rastergrid.com/blog/tag/gpu/feed/" rel="self" type="application/rss+xml" />
	<link>http://rastergrid.com/blog</link>
	<description>A technical blog from Daniel Rákos (aka aqnuep)</description>
	<lastBuildDate>Fri, 04 Nov 2011 18:10:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>OpenGL vs DirectX: The War Is Far From Over</title>
		<link>http://rastergrid.com/blog/2011/10/opengl-vs-directx-the-war-is-far-from-over/</link>
		<comments>http://rastergrid.com/blog/2011/10/opengl-vs-directx-the-war-is-far-from-over/#comments</comments>
		<pubDate>Fri, 07 Oct 2011 19:02:12 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Direct3D]]></category>
		<category><![CDATA[DirectX]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[occlusion culling]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[tessellation control shader]]></category>
		<category><![CDATA[tessellation evaluation shader]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[uniform buffer]]></category>
		<category><![CDATA[vertex buffer]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=652</guid>
		<description><![CDATA[I&#8217;ve chosen the title based on the popular article that tries to prove that OpenGL lost the war against Direct3D. To be honest, I didn&#8217;t really like the article at all. First, because it compared OpenGL 3 which targeted Shader Model 4.0 hardware and DirectX 11 which targeted Shader Model 5.0 hardware. Besides that, as we]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2011%252F10%252Fopengl-vs-directx-the-war-is-far-from-over%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FnmYZeW%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22OpenGL%20vs%20DirectX%3A%20The%20War%20Is%20Far%20From%20Over%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 260px"><img title="OpenGL vs DirectX" src="http://rastergrid.com/blog/wp-content/uploads/2011/10/opengl-vs-directx-250x138.jpg" alt="OpenGL vs DirectX" width="250" height="138" /><p class="wp-caption-text">The War Is Far From Over</p></div>
<p>I&#8217;ve chosen the title based on the <a title="OpenGL 3 &amp; DirectX 11: The War Is Over" href="http://www.tomshardware.com/reviews/opengl-directx,2019.html" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.tomshardware.com/reviews/opengl-directx_2019.html?referer=');">popular article</a> that tries to prove that OpenGL lost the war against Direct3D. To be honest, I didn&#8217;t really like the article at all. First, because it compared OpenGL 3 which targeted Shader Model 4.0 hardware and DirectX 11 which targeted Shader Model 5.0 hardware. Besides that, as we will see, the war is really far from over&#8230; This article aims to list the most important features introduced by OpenGL 3.x, OpenGL 4.x, Direct3D 10, Direct3D 11 and we will also talk about the promised features of the upcoming Direct3D 11.1 to be fair with DirectX <img src='http://rastergrid.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><span id="more-652"></span></p>
<p>After I wrote <a title="An introduction to OpenGL 4.2" href="http://rastergrid.com/blog/2011/08/an-introduction-to-opengl-4-2/">my article about the latest features introduced in OpenGL</a> someone asked me whether I can write an article about the comparison of the hardware features exposed by OpenGL and Direct3D. Instead of a long explanation, I decided to simply create a table of the features introduced by the APIs. Please note that the list focuses on hardware features and does not discuss API feature differences between the two APIs. The list may be far from complete and I&#8217;m happy to get feedback about what is missing from the table so that I can extend it. Also there are features for which I did not find whether an equivalent exists in D3D and are marked with a question mark. If anybody can point me to the answer, I would be happy, but I did not find a specification of the HLSL versions.</p>
<table style="width: 100%;" border="0">
<tbody>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>HARDWARE FEATURES EXPOSED</strong></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Draw command related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Conditional/predicated rendering based on the result of occlusion queries (<a href="http://www.opengl.org/registry/specs/NV/conditional_render.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/conditional_render.txt?referer=');">NV_conditional_render</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Basic geometry instancing support and instanced draw commands (<a href="http://www.opengl.org/registry/specs/ARB/draw_instanced.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_instanced.txt?referer=');">ARB_draw_instanced</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Geometry instancing with the ability to specify instanced vertex attributes (<a href="http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/instanced_arrays.txt?referer=');">ARB_instanced_arrays</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Primitive restart (cut index) feature for batching multiple strips together (<a href="http://www.opengl.org/registry/specs/NV/primitive_restart.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/primitive_restart.txt?referer=');">NV_primitive_restart</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Draw commands allowing modification of the base vertex index (<a href="http://www.opengl.org/registry/specs/ARB/draw_elements_base_vertex.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_elements_base_vertex.txt?referer=');">ARB_draw_elements_base_vertex</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Indirect draw commands that source their parameters from server side buffers (<a href="http://www.opengl.org/registry/specs/ARB/draw_indirect.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_indirect.txt?referer=');">ARB_draw_indirect</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>New shader type related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Geometry shader support and adjacency primitive support (<a href="http://www.opengl.org/registry/specs/ARB/geometry_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/geometry_shader4.txt?referer=');">ARB_geometry_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Instanced geometry shader support with fixed number of invocations (<a href="http://www.opengl.org/registry/specs/ARB/gpu_shader5.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/gpu_shader5.txt?referer=');">ARB_gpu_shader5</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Tessellation control and evaluation (hull and domain) shader support (<a href="http://www.opengl.org/registry/specs/ARB/tessellation_shader.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/tessellation_shader.txt?referer=');">ARB_tessellation_shader</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Transform feedback (stream-output) related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Basic transform feedback (stream-output) support (<a href="http://www.opengl.org/registry/specs/EXT/transform_feedback.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/transform_feedback.txt?referer=');">EXT_transform_feedback</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Transform feedback support without a geometry shader being active (<a href="http://www.opengl.org/registry/specs/EXT/transform_feedback.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/transform_feedback.txt?referer=');">EXT_transform_feedback</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for pausing and resuming transform feedback (stream-output) (<a href="http://www.opengl.org/registry/specs/ARB/transform_feedback2.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback2.txt?referer=');">ARB_transform_feedback2</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Auto-draw support (feed back the contents of the transform feedback buffer) (<a href="http://www.opengl.org/registry/specs/ARB/transform_feedback2.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback2.txt?referer=');">ARB_transform_feedback2</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Instanced auto-draw support (transform feedback buffer drawing with instancing support) (<a href="http://www.opengl.org/registry/specs/ARB/transform_feedback_instanced.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback_instanced.txt?referer=');">ARB_transform_feedback_instanced</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for outputting multiple primitive streams using transform feedback (stream-output) (<a href="http://www.opengl.org/registry/specs/ARB/transform_feedback3.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback3.txt?referer=');">ARB_transform_feedback3</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Asynchronous queries and related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Support for occlusion query for getting number of samples passed (<a href="http://www.opengl.org/registry/specs/ARB/occlusion_query.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/occlusion_query.txt?referer=');">ARB_occlusion_query</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for occlusion query for getting only a boolean value about visibility (<a href="http://www.opengl.org/registry/specs/ARB/occlusion_query2.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/occlusion_query2.txt?referer=');">ARB_occlusion_query2</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number vertices processed and the number of vertex shader invocations</td>
<td style="background-color: #cc5555"></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt1">[1]</a></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number of geometry shader invocations in case a geometry shader is active</td>
<td style="background-color: #cc5555"></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt1">[1]</a></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number of primitives output by the geometry shader (<a href="http://www.opengl.org/registry/specs/EXT/transform_feedback.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/transform_feedback.txt?referer=');">EXT_transform_feedback</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number of primitives that were sent to the rasterizer (<a href="http://www.opengl.org/registry/specs/EXT/transform_feedback.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/transform_feedback.txt?referer=');">EXT_transform_feedback</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number of primitives that were passing clipping and were actually rendered</td>
<td style="background-color: #cc5555"></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt1">[1]</a></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number of times a fragment/pixel shader was invoked</td>
<td style="background-color: #cc5555"></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt1">[1]</a></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number of primitives written during transform feedback (stream-output) (<a href="http://www.opengl.org/registry/specs/EXT/transform_feedback.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/transform_feedback.txt?referer=');">EXT_transform_feedback</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the number of primitives generated during transform feedback (stream-output) (<a href="http://www.opengl.org/registry/specs/EXT/transform_feedback.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/transform_feedback.txt?referer=');">EXT_transform_feedback</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query a server side high resolution timestamp (<a href="http://www.opengl.org/registry/specs/ARB/timer_query.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/timer_query.txt?referer=');">ARB_timer_query</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to query the completeness of rendering commands (<a href="http://www.opengl.org/registry/specs/ARB/sync.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/sync.txt?referer=');">ARB_sync</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Texture, vertex and renderbuffer format related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Floating point color and depth formats for textures and render buffers (various extensions)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Cube map textures with depth component internal format (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Half-float (16-bit) vertex and pixel data support (<a href="http://www.opengl.org/registry/specs/NV/half_float.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/half_float.txt?referer=');">NV_half_float</a>, <a href="http://www.opengl.org/registry/specs/ARB/half_float_pixel.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/half_float_pixel.txt?referer=');">ARB_half_float_pixel</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Non-normalized integer color formats for textures and renderbuffers (<a href="http://www.opengl.org/registry/specs/EXT/texture_integer.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/texture_integer.txt?referer=');">EXT_texture_integer</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Packed depth/stencil texture and renderbuffer formats (<a href="http://www.opengl.org/registry/specs/EXT/packed_depth_stencil.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/packed_depth_stencil.txt?referer=');">EXT_packed_depth_stencil</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">RGTC texture compression for two-component textures (<a href="http://www.opengl.org/registry/specs/EXT/texture_compression_rgtc.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/texture_compression_rgtc.txt?referer=');">EXT_texture_compression_rgtc</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Signed normalized texture component formats (<a href="http://www.opengl.org/registry/specs/EXT/texture_snorm.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/texture_snorm.txt?referer=');">EXT_texture_snorm</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Seamless cube map filtering support (to hide artifacts at cube map edges) (<a href="http://www.opengl.org/registry/specs/ARB/seamless_cube_map.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/seamless_cube_map.txt?referer=');">ARB_seamless_cube_map</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for swizzling the components of a texture (<a href="http://www.opengl.org/registry/specs/ARB/texture_swizzle.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_swizzle.txt?referer=');">ARB_texture_swizzle</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
</tr>
<tr>
<td style="padding: 0px">BPTC texture compression for floating point and unsigned normalized textures (<a href="http://www.opengl.org/registry/specs/ARB/texture_compression_bptc.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_compression_bptc.txt?referer=');">ARB_texture_compression_bptc</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">64-bit floating point vertex attribute formats (<a href="http://www.opengl.org/registry/specs/ARB/vertex_attrib_64bit.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/vertex_attrib_64bit.txt?referer=');">ARB_vertex_attrib_64bit</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>New texture type related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">One- and two-dimensional layered array textures (<a href="http://www.opengl.org/registry/specs/EXT/texture_array.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/texture_array.txt?referer=');">EXT_texture_array</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Cube map array textures as special two-dimensional array textures (<a href="http://www.opengl.org/registry/specs/ARB/texture_cube_map_array).txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_cube_map_array_.txt?referer=');">ARB_texture_cube_map_array)</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Rectangular textures with no mipmap support and that are accessed with integer coordinates (<a href="http://www.opengl.org/registry/specs/ARB/texture_rectangle.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_rectangle.txt?referer=');">ARB_texture_rectangle</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Multisampled textures and support for fetching specific sample locations (<a href="http://www.opengl.org/registry/specs/ARB/texture_multisample.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_multisample.txt?referer=');">ARB_texture_multisample</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Casting a texture&#8217;s interpreted internal format to another internal format</td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt4">[4]</a></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt4">[4]</a></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Uniform buffer (constant buffer) related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Basic uniform buffer (constant buffer) support (<a href="http://www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt?referer=');">ARB_uniform_buffer_object</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for large uniform buffers and binding subranges (<a href="http://www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/uniform_buffer_object.txt?referer=');">ARB_uniform_buffer_object</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Framebuffer and texture rendering related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Rendering to textures and renderbuffers (<a href="http://www.opengl.org/registry/specs/EXT/framebuffer_object.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/framebuffer_object.txt?referer=');">EXT_framebuffer_object</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Multisample stretch blit functionality (<a href="http://www.opengl.org/registry/specs/EXT/framebuffer_multisample.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/framebuffer_multisample.txt?referer=');">EXT_framebuffer_multisample</a>, <a href="http://www.opengl.org/registry/specs/EXT/framebuffer_blit.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/framebuffer_blit.txt?referer=');">EXT_framebuffer_blit</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">sRGB rendering and blending support for framebuffers (<a href="http://www.opengl.org/registry/specs/EXT/framebuffer_sRGB.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/framebuffer_sRGB.txt?referer=');">EXT_framebuffer_sRGB</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for enabling or disabling clamping of the depth of fragments (<a href="http://www.opengl.org/registry/specs/ARB/depth_clamp.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/depth_clamp.txt?referer=');">ARB_depth_clamp</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for logical operations on integer render targets (supported for a decade in OpenGL)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Blending related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Support for alpha-to-coverage when using multisampling (<a href="http://www.opengl.org/registry/specs/ARB/multisample.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/multisample.txt?referer=');">ARB_multisample</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Per-color-buffer blend enables and color writemasks (<a href="http://www.opengl.org/registry/specs/EXT/draw_buffers2.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/draw_buffers2.txt?referer=');">EXT_draw_buffers2</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Dual-source color blending support based on a secondary output of the fragment shader (<a href="http://www.opengl.org/registry/specs/ARB/blend_func_extended.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/blend_func_extended.txt?referer=');">ARB_blend_func_extended</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Individual blend equations and blend functions support for each color output (<a href="http://www.opengl.org/registry/specs/ARB/draw_buffers_blend.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_buffers_blend.txt?referer=');">ARB_draw_buffers_blend</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Shader related features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Texture lookup functions to access individual texels of a LOD using integer coordinates (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Query the dimensions of a specific LOD of a texture in shaders (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Ability to apply integer offsets to the texel location during texture lookup (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Ability to explicitly pass in derivative values that are used to compute LOD during texture lookup (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Control over varying variable interpolation: non-perspective, flat, centroid sampling, etc. (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Full signed and unsigned integer support in shaders (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<tr>
<td style="padding: 0px">Vertex ID built-in variable available in vertex shader (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Primitive ID built-in variable available in geometry and fragment shader (<a href="http://www.opengl.org/registry/specs/EXT/gpu_shader4.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/gpu_shader4.txt?referer=');">EXT_gpu_shader4</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Instance ID built-in variable available in vertex shader (<a href="http://www.opengl.org/registry/specs/ARB/draw_instanced.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_instanced.txt?referer=');">ARB_draw_instanced</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Shader fragment coordinate convention control (<a href="http://www.opengl.org/registry/specs/ARB/fragment_coord_conventions.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/fragment_coord_conventions.txt?referer=');">ARB_fragment_coord_conventions</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
</tr>
<tr>
<td style="padding: 0px">Provoking vertex control (for flat shaded varying value selection) (<a href="http://www.opengl.org/registry/specs/ARB/provoking_vertex.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/provoking_vertex.txt?referer=');">ARB_provoking_vertex</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cc5555;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for encoding and decoding floating point values from and to integers (<a href="http://www.opengl.org/registry/specs/ARB/shader_bit_encoding.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_bit_encoding.txt?referer=');">ARB_shader_bit_encoding</a>)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for get the results of the automatic LOD computations in shaders (<a href="http://www.opengl.org/registry/specs/ARB/texture_query_lod.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_query_lod.txt?referer=');">ARB_texture_query_lod</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for coherent indexing into arrays of samplers using non-constant indices (addressable samplers) (<a href="http://www.opengl.org/registry/specs/ARB/gpu_shader5.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/gpu_shader5.txt?referer=');">ARB_gpu_shader5</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for indexing into arrays of uniform blocks (addressable constant buffers) (<a href="http://www.opengl.org/registry/specs/ARB/gpu_shader5.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/gpu_shader5.txt?referer=');">ARB_gpu_shader5</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Gathered texture fetches over a 2&#215;2 footprint (with custom offsets) (<a href="http://www.opengl.org/registry/specs/ARB/texture_gather.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_gather.txt?referer=');">ARB_texture_gather</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Invocation ID built-in variable available in geometry shader (<a href="http://www.opengl.org/registry/specs/ARB/gpu_shader5.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/gpu_shader5.txt?referer=');">ARB_gpu_shader5</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for double-precision floating-point data types in shaders (<a href="http://www.opengl.org/registry/specs/ARB/gpu_shader_fp64.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/gpu_shader_fp64.txt?referer=');">ARB_gpu_shader_fp64</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for sample-frequency fragment shader execution (<a href="http://www.opengl.org/registry/specs/ARB/sample_shading.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/sample_shading.txt?referer=');">ARB_sample_shading</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support indirect subroutine calls in all shader stages (<a href="http://www.opengl.org/registry/specs/ARB/shader_subroutine.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_subroutine.txt?referer=');">ARB_shader_subroutine</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for selecting from multiple viewports using a geometry shader (<a href="http://www.opengl.org/registry/specs/ARB/viewport_array.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/viewport_array.txt?referer=');">ARB_viewport_array</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for dedicated atomic counters in shaders (<a href="http://www.opengl.org/registry/specs/ARB/shader_atomic_counters.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_atomic_counters.txt?referer=');">ARB_shader_atomic_counters</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55; text-align: center;"><a href="#tblcmt2">[2]</a></td>
<td style="background-color: #55cc55; text-align: center;"><a href="#tblcmt2">[2]</a></td>
</tr>
<tr>
<td style="padding: 0px">Support for backing up dedicated atomic counters with buffers (<a href="http://www.opengl.org/registry/specs/ARB/shader_atomic_counters.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_atomic_counters.txt?referer=');">ARB_shader_atomic_counters</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt5">[5]</a></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt5">[5]</a></td>
</tr>
<tr>
<td style="padding: 0px">Support for load/store (read/write) buffers and textures in shaders (<a href="http://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_image_load_store.txt?referer=');">ARB_shader_image_load_store</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #cccc55; text-align: center;"><a href="#tblcmt3">[3]</a></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for atomic operations on load/store buffers and textures (<a href="http://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_image_load_store.txt?referer=');">ARB_shader_image_load_store</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for disabling or forcing early depth test (<a href="http://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_image_load_store.txt?referer=');">ARB_shader_image_load_store</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for conservative depth (enabling safe early tests even when modifying depth) (<a href="http://www.opengl.org/registry/specs/ARB/conservative_depth.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/conservative_depth.txt?referer=');">ARB_conservative_depth</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support for coverage as input to the fragment shader (<a href="http://www.opengl.org/registry/specs/ARB/gpu_shader5.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/gpu_shader5.txt?referer=');">ARB_gpu_shader5</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="text-align: center; background-color: #c5e526;" colspan="6"><strong>Miscellaneous features</strong></td>
</tr>
<tr style="height: 20px">
<td style="background-color: #aaaaaa;"></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 3.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">GL 4.x</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 10</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11</span></strong></td>
<td style="text-align: center; width: 50px; background-color: #aaaaaa; padding: 0px;"><strong><span style="color: #ffffff;">DX 11.1</span></strong></td>
</tr>
<tr>
<td style="padding: 0px">Support for floating point viewport specification (<a href="http://www.opengl.org/registry/specs/ARB/viewport_array.txt" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/viewport_array.txt?referer=');">ARB_viewport_array</a>)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Per-texture mipmap clamping (supported since the very early versions of OpenGL)</td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
<tr>
<td style="padding: 0px">Support to use a single depth texture for depth testing and as texture input (when depth writes are disabled)</td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #cc5555;"></td>
<td style="background-color: #55cc55;"></td>
<td style="background-color: #55cc55;"></td>
</tr>
</tbody>
</table>
<p><a name="tblcmt1">[1]</a> There is no support for these counters in OpenGL, however they can be implemented with the help of shader atomic counters.<br />
<a name="tblcmt2">[2]</a> There is no support in Direct3D to use the dedicated atomic counter hardware (supported currently only by AMD GPUs) only by using an append/consume buffer. Though, as atomic counters are the part of UAVs and arbitrary number of UAVs can be attached to a single resource, the same functionality is supported indirectly.<br />
<a name="tblcmt3">[3]</a> There is read/write buffer and texture support in Direct3D 11, however it is available only in the fragment (pixel) shader. Direct3D 11.1 plans to remove this restriction.<br />
<a name="tblcmt4">[4]</a> There is no support for texture format casting in OpenGL, conversion, however, can be done by doing a copy preferably using pixel buffer objects.<br />
<a name="tblcmt5">[5]</a> There is no support for automatic storage of atomic counter values in buffers in Direct3D, however, their value can be manually copied to arbitrary resources.</p>
<p>As a conclusion, I would like to say just one thing: even though there are some features that are not supported by either OpenGL or Direct3D, we really can say that the two APIs are on par with the number of hardware features they expose.</p>
<p>(Sorry in advance for any mistakes, it took quite some time to create this table and I may became too tired at the end)</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2011/10/opengl-vs-directx-the-war-is-far-from-over/feed/</wfw:commentRss>
		<slash:comments>70</slash:comments>
		</item>
		<item>
		<title>An introduction to OpenGL 4.2</title>
		<link>http://rastergrid.com/blog/2011/08/an-introduction-to-opengl-4-2/</link>
		<comments>http://rastergrid.com/blog/2011/08/an-introduction-to-opengl-4-2/#comments</comments>
		<pubDate>Sun, 28 Aug 2011 14:25:25 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[atomic counter]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[image load store]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[uniform buffer]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=611</guid>
		<description><![CDATA[After the release of the OpenGL 4.1 specification the Khronos Group slowed down the pace a little bit but they didn&#8217;t left OpenGL developers without a new specification version for too long as a few weeks ago they&#8217;ve released OpenGL 4.2. The new version of the specification brings several API improvements as well as exposes]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2011%252F08%252Fan-introduction-to-opengl-4-2%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FpAMBuE%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22An%20introduction%20to%20OpenGL%204.2%22%20%7D);"></div>
<p>After the release of the OpenGL 4.1 specification the Khronos Group slowed down the pace a little bit but they didn&#8217;t left OpenGL developers without a new specification version for too long as a few weeks ago they&#8217;ve released OpenGL 4.2. The new version of the specification brings several API improvements as well as exposes some important pieces of hardware functionality that makes OpenGL 4.x class hardware a great step forward in GPU history. This article aims to present the newly introduced features in the latest version of the OpenGL specification and, as a few months ago I wrote an article about <a title="Suggestion for OpenGL 4.2 and beyond" href="http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/">Suggestions for OpenGL 4.2 and beyond</a>, I will write a few words about how does the new specification reflect my forecast.</p>
<p><span id="more-611"></span></p>
<h2>New features in OpenGL 4.2</h2>
<p>OpenGL 4.2 finally filled the holes in the capability matrix of Shader Model 5.0 hardware with some long waited extensions from which some of the functionalities were actually already accessible through cross-vendor and vendor specific extensions. Also, the new version of the specification brings some important API improvement extensions and GLSL constructs that continue the transition to a more easy to use state and shader management.</p>
<h3><a title="GL_ARB_texture_compression_bptc" href="http://www.opengl.org/registry/specs/ARB/texture_compression_bptc.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_compression_bptc.txt?referer=');">ARB_texture_compression_bptc</a></h3>
<p>This extension adds the new block compression texture formats called BC7 and BC6H in Direct3D terminology. The extension is actually available for quite some time, since the release of OpenGL 4.0 but now it became core. The formats provide high quality block compression for fixed point RGBA and sRGB textures as well as two floating point texture compression formats for signed and unsigned data.</p>
<p>Traditional block compression methods (as S3TC or RGTC) use the gradients in a block of pixels which works fine for smooth images but does provide poor results in case of sharp edges. BPTC solves the issue by dividing blocks into multiple partitions which are compressed using independent gradients thus providing better overall quality.</p>
<p>When comparing compression efficiency, BPTC has a compression ratio of 3:1 compared to 6:1, 4:1 and 2:1 that are the compression ratios of the S3TC DXT1, S3TC DXT5 and RGTC formats respectively.</p>
<h3><a title="GL_ARB_compressed_texture_pixel_storage" href="http://www.opengl.org/registry/specs/ARB/compressed_texture_pixel_storage.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/compressed_texture_pixel_storage.txt?referer=');">ARB_compressed_texture_pixel_storage</a></h3>
<p>This is an interesting extension that solves a problem that I didn&#8217;t even know is such a big issue. The extension is designed primarily to support compressed image formats with fixed-size blocks as that of BPTC as an example. The application can use this extension to configure pixel store parameters so that subtexture operations can provide consistent results in all cases.</p>
<h3><a title="GL_ARB_texture_storage" href="http://www.opengl.org/registry/specs/ARB/texture_storage.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_storage.txt?referer=');">ARB_texture_storage</a></h3>
<p>This is again an interesting extension that provides API improvement over how texture storage is allocated in classic OpenGL. As we all know, OpenGL was always too ad hoc on resource management, from the point of view of when actual resources are allocated for a particular API primitive. This is especially a problem in case of textures where we potentially talk about large amount of data. In classic OpenGL the driver could not know from the beginning for example whether the application will need mipmaps for the texture or how many levels are required. This could easily result in bad allocation patterns and/or large reallocations. This extension introduces the concept of immutable texture images where all the levels are allocated up-front for a texture object.</p>
<h3><a title="GL_ARB_transform_feedback_instanced" href="http://www.opengl.org/registry/specs/ARB/transform_feedback_instanced.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback_instanced.txt?referer=');">ARB_transform_feedback_instanced</a></h3>
<p>This extension extends the so called &#8220;AutoDraw&#8221; feature by providing instanced &#8220;AutoDraw&#8221;. This means that geometry captured using transform feedback can be rendered multiple time using geometry instancing. This is actually a feature that even D3D11 does not provide and being such, I didn&#8217;t even think that hardware supports it, even though I think the list usage patterns of the extensions is most probably pretty narrow.</p>
<h3><a title="GL_ARB_base_instance" href="http://www.opengl.org/registry/specs/ARB/base_instance.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/base_instance.txt?referer=');">ARB_base_instance</a></h3>
<p>This extension is actually the feature I called <strong>ARB_instanced_arrays2</strong> in my <a title="Suggestions for OpenGL 4.2 and beyond." href="http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/" target="_blank">suggestion list</a>. The extension provides three new draw commands, one is kind of illy named as <strong>DrawElementsInstancedBaseVertexBaseInstance</strong>, even though this command can be called the &#8220;basic&#8221; indexed draw commands that specifies all parameters. Also, the parameter list of the indirect indexed draw command is extended with the base instance parameter. Fortunately, however, the ARB chosen to add new commands rather than a <strong>SetBaseInstance</strong>-style state specifier command to introduce the new concept. Funnily this feature was missing for a long time as, as far as I know, it is supported by all GPUs capable of doing instanced drawing, and is available in D3D as well.</p>
<h3><a title="GL_ARB_shader_image_load_store" href="http://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_image_load_store.txt?referer=');">ARB_shader_image_load_store</a></h3>
<p>This is where things get start really interesting. This new extension is the ARBified version of the extension <a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">EXT_shader_image_load_store</a> which fortunately didn&#8217;t make it into core in its current form.</p>
<p>The extension provides GLSL built-in functions allowing shaders to load from, store to, and perform atomic read-modify-write operations to a single level of a texture called an image from any shader stage. Also, the extension indirectly enables the same set of operations for buffer objects by using buffer textures. This enables developers to implement more sophisticated algorithms using shaders that require more complex data structures than just plain arrays.</p>
<p>This, together with atomic counters that we will talk about later, enables the possibility to implement append/consume buffers and rendering techniques like AMD&#8217;s Order-Independent Transparency (OIT) algorithm as <a title="OIT and Indirect Illumination  Using DX11 Linked Lists" href="http://www.slideshare.net/hgruen/oit-and-indirect-illumination-using-dx11-linked-lists" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/hgruen/oit-and-indirect-illumination-using-dx11-linked-lists?referer=');">presented at GDC10</a>.</p>
<p>As the introduction of the new write operations to fragment shaders besides the traditional framebuffer writes makes the execution of the shader have side effects and thus sensitive to whether early-Z is used or not by the hardware, so the extension also provides a mechanism to force or disable early-Z in the fragment shader.</p>
<p>A similar issue is in case of vertex shaders as the post-transform cache may be no longer valid in case of certain usage patterns of load/store images so, based on how smart the shader compiler is, the post-transform cache could be easily disabled in case a vertex shader uses load/store images resulting in downgraded performance, so care must be taken when using read/write images in vertex shaders as OpenGL does not have any mechanism to help these issues (but I actually have a proposal that I&#8217;ll talk about in a future article).</p>
<p>The API of this extension is greatly improved compared to the EXT version, especially when dealing with various texture image formats. The extension also provides a future-proof DSA-style API. Further, the ARB version of the extension supports loads from any texture format and corrected some specification bugs of the EXT version.</p>
<p>From hardware implementation point of view, it must be noted that in case a shader contains atomic operations applied to a particular read/write image the driver uses a different hardware path, as required by atomic read-modify-writes so that care must be taken to use atomic operations only when necessary. Also note that this decision is made statically at compile time by the driver so even a single atomic operation in an unlikely taken branch will result it degraded performance. This is another reason why to use atomic counters to implement append/consume buffers instead of using read/write image atomics.</p>
<h3><a title="GL_ARB_shader_atomic_counters" href="http://www.opengl.org/registry/specs/ARB/shader_atomic_counters.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_atomic_counters.txt?referer=');">ARB_shader_atomic_counters</a></h3>
<p>This the other long waited feature that I also suggested and was still missing from OpenGL but was available in D3D11. The specification was actually ongoing for a long time now (about a year) and it even appeared for a while in AMD&#8217;s OpenGL drivers sometimes as EXT, sometimes as ARB extension. The extension provides API to access a number of hardware atomic counters that provide efficient counter operations on a GPU global scale. Atomic counters come handy in many cases like append/consume buffers or indirect draw buffer construction.</p>
<p>The extension provides access to these atomic counters from GLSL and also makes it possible to back them up with buffer objects so after OpenGL draw calls the value of the counters is preserved in these buffers for later use.</p>
<p>The OpenGL implementation is superior compared to D3D&#8217;s as it provides access to atomic counters from all shader stages, with caveats of course as, it was mentioned in the previous section, the side effects made possible with read/write images and atomic counters require special care in case of fragment and vertex shaders as they may result in invalid rendering and/or lower performance.</p>
<p>On hardware vendor implementations, it must be noted that atomic counters are much, much more faster than read/write image atomics, at least on AMD hardware which has dedicated hardware for atomic counters. On NVIDIA hardware, though, it seems that there is no different hardware path for atomic counters as their performance is roughly the same as in case of read/write image atomics.</p>
<p>The dedicated hardware implementation of atomic counters, however, comes with a trade-off as the number of atomic counters is severely limited on AMD hardware, but one can still use read/write image atomics if ran out of atomic counters.</p>
<h3><a title="GL_ARB_conservative_depth" href="http://www.opengl.org/registry/specs/ARB/conservative_depth.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/conservative_depth.txt?referer=');">ARB_conservative_depth</a></h3>
<p>This is another extension I&#8217;ve suggested and that fills another functionality hole compared to D3D11. The extension is actually an ARBified version of <a title="GL_AMD_conservative_depth" href="http://www.opengl.org/registry/specs/AMD/conservative_depth.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/conservative_depth.txt?referer=');">AMD_conservative_depth</a> that extends the application developer&#8217;s control over eary depth and stencil tests. <a title="GL_ARB_shader_image_load_store" href="http://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_image_load_store.txt?referer=');">ARB_shader_image_load_store</a>  already provides a way to force or disable eary-Z and this extension provides further modes that provide a hint to the driver about how depth is modified in a fragment shader that outputs depth. This passes enough information to the GL implementation to activate some early depth test optimizations safely while still preserving the ability to account the final depth value in the depth test.</p>
<p>The extension exposes the new capability in the form of fragment shader input layout qualifiers called &#8220;depth_any&#8221;, &#8220;depth_greater&#8221;, &#8220;depth_less&#8221; and &#8220;depth_unchanged&#8221;. The interesting ones are the one that assume a greater or less depth value as output and provide the ability to early reject groups of fragments using Hi-Z and early-Z even when depth is modified. This technique can greatly improve the rendering performance of volumetric particles, decals and billboards.</p>
<p>As far as I can tell, though, the extension provides performance benefits only the AMD hardware currently as NVIDIA hardware does not have such functionality thus using the extension would still force NVIDIA GPUs to disable early-Z in case the fragment shader outputs a depth value, but future hardware may change this.</p>
<h3><a title="GL_ARB_shading_language_420pack" href="http://www.opengl.org/registry/specs/ARB/shading_language_420pack.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shading_language_420pack.txt?referer=');">ARB_shading_language_420pack</a></h3>
<p>This is a strangely named extension that provides a lot of improvements to GLSL. These are mostly API improvements only, but have a great value when looking at source code maintainability and resource management.</p>
<p>I think the most useful addition of the extension is the &#8220;binding&#8221; layout qualifier that I referred to as ARB_explicit_sampler_location and ARB_explicit_uniform_block_index in my <a title="Suggestions for OpenGL 4.2 and beyond." href="http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/" target="_blank">suggestion list</a>. This enables shader writers to explicitly bind a uniform block binding index to a uniform block as well as explicitly bind sampler, texture and image binding points to a sampler or image variable.</p>
<p>Besides that, the extension adds other minor improvements, like implicit conversion of return values of functions, UTF-8 character set support, C-style initializer list support and scalar swizzle operators.</p>
<h3><a title="GL_ARB_internalformat_query" href="http://www.opengl.org/registry/specs/ARB/internalformat_query.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/internalformat_query.txt?referer=');">ARB_internalformat_query</a></h3>
<p>This is another kind of strangely named extension that was meant to provide the possibility to query information about the internal format of textures, however, it actually failed it as it provides only the ability to query the maximum number of samples available for different texture formats.</p>
<p>The extension was ambitious as it planned to provide internal format information like the ability to query the actual internal format used, whether the format is renderable, accessible in a particular shader stage, whether it can be used as read/write image, and even to provide performance hint about using a particular texture internal format. Unfortunately all these were left for a future extension.</p>
<h3><a title="GL_ARB_map_buffer_alignment" href="http://www.opengl.org/registry/specs/ARB/map_buffer_alignment.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/map_buffer_alignment.txt?referer=');">ARB_map_buffer_alignment</a></h3>
<p>This is the last new extension introduced in OpenGL 4.2 that trivially adds the requirement to the pointer returned by buffer mapping commands that they provide a minimum of 64 byte alignment to support processing of the data directly with special CPU instructions like SSE or AVX. This can provide further performance increase when client is modifying buffer data.</p>
<h2>Conclusion</h2>
<p>OpenGL 4.2 again proven that OpenGL is not dead, but in fact plans to be again the ultimate choice of 3D API by pushing the exposed hardware capabilities over the line set by D3D11. When thinking about the list of expected extensions I presented in my earlier article, <a title="Suggestions for OpenGL 4.2 and beyond" href="http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/" target="_blank">Suggestions for OpenGL 4.2 and beyond</a> we can see that OpenGL 4.2 fulfilled all my expectations and even my wish list was partly fulfilled, but here&#8217;s the list for a better overview:</p>
<p><strong>My expectations for OpenGL 4.2:</strong></p>
<pre style="background-color: #ccffcc;"><strong>GL_EXT_shader_image_load_store</strong>
<span>- added in the form of GL_ARB_shader_image_load_store</span></pre>
<pre style="background-color: #ccffcc;"><strong>GL_ARB_shader_atomic_counters</strong>
<span>- added as is</span></pre>
<pre style="background-color: #ccffcc;"><strong>GL_ARB_instanced_arrays2</strong>
<span>- added in the form of GL_ARB_base_instance</span></pre>
<pre style="background-color: #ccffcc;"><strong>GL_ARB_explicit_sampler_location</strong>
<span>- added in the form of GL_ARB_shading_language_420pack</span></pre>
<pre style="background-color: #ccffcc;"><strong>GL_ARB_explicit_uniform_block_index</strong>
<span>- added in the form of GL_ARB_shading_language_420pack</span></pre>
<p><strong>My personal wish-list for OpenGL 4.2:</strong></p>
<pre style="background-color: #ffcccc;"><strong>GL_ARB_draw_indirect2</strong>
<span>- still missing, though partly available though <a title="GL_AMD_multi_draw_indirect" href="http://www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt?referer=');">GL_AMD_multi_draw_indirect</a></span></pre>
<pre style="background-color: #ffcccc;"><strong>GL_ARB_direct_state_access</strong>
<span>- still missing, however, there is hope that it will be included in the next release where the ARB plans to rewrite the whole structure of the core specification</span></pre>
<pre style="background-color: #ccffcc;"><strong>GL_NV_texture_barrier</strong>
<span>- not in core but it is implicitly subsumed by GL_ARB_shader_image_load_store, they say</span></pre>
<pre style="background-color: #ccffcc;"><strong>GL_AMD_conservative_depth</strong>
<span>- added in the form of GL_ARB_conservative_depth, despite lack of NVIDIA support</span></pre>
<pre style="background-color: #ffcccc;"><strong>GL_ARB_texture_gather_lod</strong>
<span>- still missing, because of lack of supporting hardware</span></pre>
<pre style="background-color: #ffcccc;"><strong>GL_NV_copy_image</strong>
<span>- still missing, even though it could be a good API improvement</span></pre>
<pre style="background-color: #ffcccc;"><strong>GL_EXT_texture_filter_anisotropic</strong>
<span>- still missing, as I was informed, because of patent issues</span></pre>
<pre style="background-color: #ffcccc;"><strong>GL_ARB_shader_stencil_export</strong>
<span>- still missing, most probably because of lack of NVIDIA hardware support</span></pre>
<pre style="background-color: #ffcccc;"><strong>GL_AMD_depth_clamp_separate</strong>
<span>- still missing, most probably because of lack of NVIDIA hardware support</span></pre>
<pre style="background-color: #ffcccc;"><strong>GL_AMD_transform_feedback3_lines_triangles</strong>
<span>- still missing, most probably because of lack of NVIDIA hardware support</span></pre>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2011/08/an-introduction-to-opengl-4-2/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Multi-Draw-Indirect is here</title>
		<link>http://rastergrid.com/blog/2011/06/multi-draw-indirect-is-here/</link>
		<comments>http://rastergrid.com/blog/2011/06/multi-draw-indirect-is-here/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 15:04:12 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[atomic counter]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[indirect draw]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[synchronization]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=578</guid>
		<description><![CDATA[You might remember that I wrote an article about my suggestions for OpenGL 4.2 and beyond. One of the features that I recommended to be added to OpenGL was a yet non-existent extension called GL_ARB_draw_indirect2 which suggested the addition of new draw commands that are similar in fashion to the ancient MultiDraw* commands but they]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2011%252F06%252Fmulti-draw-indirect-is-here%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FlXqu10%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Multi-Draw-Indirect%20is%20here%22%20%7D);"></div>
<p>You might remember that I wrote an article about my <a href="http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/">suggestions for OpenGL 4.2 and beyond</a>. One of the features that I recommended to be added to OpenGL was a yet non-existent extension called GL_ARB_draw_indirect2 which suggested the addition of new draw commands that are similar in fashion to the ancient MultiDraw* commands but they are meant to build on top of the indirect drawing mechanism introduced by the <a title="GL_ARB_draw_indirect" href="http://www.opengl.org/registry/specs/ARB/draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_indirect.txt?referer=');">GL_ARB_draw_indirect</a> extension and OpenGL 4.0. I contacted both AMD and NVIDIA with my idea with different levels of success, but AMD saw the potential in the functionality and they actually implemented it in the form of <a title="GL_AMD_multi_draw_indirect" href="http://www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt?referer=');">GL_AMD_multi_draw_indirect</a>, well at least partially&#8230;</p>
<p><span id="more-578"></span></p>
<h2>The proposition</h2>
<p>First of all, let&#8217;s recap what exactly <a title="GL_ARB_draw_indirect" href="http://www.opengl.org/registry/specs/ARB/draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_indirect.txt?referer=');">GL_ARB_draw_indirect</a> brought us:</p>
<blockquote><p>This extension provides a mechanism for supplying the arguments to a DrawArraysInstanced or DrawElementsInstancedBaseVertex from buffer object memory. This is not particularly useful for applications where the CPU knows the values of the arguments beforehand, but is helpful when the values will be generated on the GPU through any mechanism that can write to a buffer object including image stores, atomic counters, or compute interop. This allows the GPU to consume these arguments without a round-trip to the CPU or the expensive synchronization that would involve. This is similar to the DrawTransformFeedbackEXT command from EXT_transform_feedback2, but offers much more flexibility in both generating the arguments and in the type of Draws that can be accomplished.</p></blockquote>
<p>If you know my <a href="http://rastergrid.com/blog/downloads/nature-demo/">Nature</a> or <a href="http://rastergrid.com/blog/downloads/mountains-demo/">Mountains</a> demo you know that I have dug deeply into the domain of GPU based culling algorithms. In case of these algorithms, the GPU consumes the scene data and performs visibility determination over a list of objects and writes out the culled data into a buffer object. The problem is that those algorithms that I&#8217;ve implemented in the aforementioned demo applications work only for instanced objects. In order to make it possible for the algorithms to be able to efficiently work with arbitrary object sets we still need a lot of new features (some of them may even require newer GPU generations). The most important ones are discussed in detail in the following sections.</p>
<h4>Atomic counters</h4>
<p>This feature enables us to use the global atomic counters present on the GPU, which have, at least on the AMD implementation, dedicated hardware to provide efficient chip-wide access to these counters from any shader. This can be expected in the near future in the form of the yet not published GL_ARB_shader_atomic_counter extension. The extension also provides a way to back up the atomic counter values in buffer object memory.</p>
<p>The currently available GPU based culling algorithms, including those presented in my demos, bypass the lack of this feature by using transform feedback to capture the culled data which has implicit atomic counters that are associated with each output stream. However, this has a few drawbacks. First of all, transform feedback is not as efficient if one would use atomic counters together with the random memory read/write mechanism exposed by the <a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">GL_EXT_shader_image_load_store</a> extension. This is because of its nature, geometry shaders and thus transform feedback has to preserve the original order of the incoming primitives. This is why the first GPU generation with geometry shader support had so much performance problems as the use of geometry shaders easily became the bottleneck of the rendering. Besides the performance benefits of having our own atomic counters, there are a lot of other reasons, like the ability to implement an append/consume buffer, if I&#8217;m allowed to use the D3D terminology.</p>
<p>It may seem that I went a bit off-topic, however, just think about how atomic counters can interact so nicely with indirect drawing. There is the instance count field of the indirect draw commands, what if we bind that address as the back-up buffer memory for the atomic counter? Yes, we can save that costly asynchronous query to get the number of visible objects that we did otherwise in case of applying an ICR or Hi-Z map based occlusion culling. You may say that you can achieve the same thing with atomic read/writes as provided by the <a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">GL_EXT_shader_image_load_store</a>. Well, that&#8217;s true, unless the additional performance hit by doing atomic memory writes is acceptable (atomic counters are much, much faster, however, it is true that in case of a GPU based culling algorithm, those few writes shouldn&#8217;t be the bottleneck). But now let us think more deeply into the problem. If we can use atomic read/writes to count the instances, as it is present in the indirect draw command in the buffer object, then what if we count the number of draw commands written into the indirect draw buffer using atomic counters? And here we are, we have the first building block of a GPU based culling algorithm that can handle arbitrary data sets.</p>
<h4>Multi-Draw-Indirect phase 1</h4>
<p>Now let&#8217;s say we somehow managed to generate an indirect draw buffer object with the list of the instanced draw command arguments necessary to render the visible objects, no matter whether we used the OpenGL toolset as in my demos or we used some compute API like OpenCL. Now somehow we have to initiate the drawing. We can do this by issuing several DrawArraysIndirect or DrawElementsIndirect command based on how many instanced draw command arguments we&#8217;ve generated.</p>
<p>But what if we could do this with a single command? This is where <a title="GL_AMD_multi_draw_indirect" href="http://www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt?referer=');">GL_AMD_multi_draw_indirect</a> comes into picture and that&#8217;s what AMD implemented for us. We can actually do this by using one of the MultiDraw*Indirect commands introduced by the extension.</p>
<p>The best thing in it is that in case of lack of hardware support for it, the driver can still implement it by simply making a loop that calls the appropriate Draw*Indirect commands so every hardware that supports <a title="GL_ARB_draw_indirect" href="http://www.opengl.org/registry/specs/ARB/draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_indirect.txt?referer=');">GL_ARB_draw_indirect</a> can support <a title="GL_AMD_multi_draw_indirect" href="http://www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt?referer=');">GL_AMD_multi_draw_indirect</a>, and in case the hardware actually supports the functionality, then we can get a slight performance increase for free.</p>
<h4>Multi-Draw-Indirect phase 2</h4>
<p>While the new extension adds quite some flexibility to the existing indirect drawing mechanism, it still lacks an important feature to become the Holy Grail of GPU based culling and scene management algorithms. We still have to perform an asynchronous query or otherwise determine the number of records written into the indirect draw buffer.</p>
<p>Of course, we can alleviate the problem by always initializing the indirect draw buffer with zero values (so that if one would issue an indirect draw command using any of the data in the buffer no actual rendering would take place) and then simply using a MultiDraw*Indirect command passing a primcount argument that is equal to the theoretical maximum of generated records. However, this might result in a performance decrease, especially if this theoretical maximum value is much bigger than the actual draw commands present in the buffer.</p>
<p>In order to circumvent this problem, we need some mechanism that allows us to also source the primcount argument of the MultiDraw*Indirect commands from buffer object memory. While such functionality is not exposed yet by any of the major graphics APIs (and may not be supported by current hardware) this could be the next major step towards a fully self-feeding renderer that handles graphics related data on a much higher level beyond triangles and pixels.</p>
<h2>Conclusion</h2>
<p>While the indirect drawing mechanism introduced with OpenGL 4.0 is just a very little part of the feature set introduced by Shader Model 5.0 GPUs, it has still a lot of room for improvement and evolution ahead. AMD made the first step with <a title="GL_AMD_multi_draw_indirect" href="http://www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt?referer=');">GL_AMD_multi_draw_indirect</a> and I really hope that indirect drawing and other GPU self-feed mechanisms will gain more developer attention in the near future.</p>
<p>Finally, I would like to thank to Graham Sellers, the creator of the extension, Pierre Bourdier for his support on promoting the new functionality and all the engineers at AMD who have contributed to the specification and the implementation work behind it. I&#8217;m really glad to see that they take the word of the developers in which direction they improve their OpenGL support.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2011/06/multi-draw-indirect-is-here/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Frei-Chen edge detector</title>
		<link>http://rastergrid.com/blog/2011/01/frei-chen-edge-detector/</link>
		<comments>http://rastergrid.com/blog/2011/01/frei-chen-edge-detector/#comments</comments>
		<pubDate>Sun, 30 Jan 2011 15:27:43 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[detection]]></category>
		<category><![CDATA[edge]]></category>
		<category><![CDATA[filter]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=532</guid>
		<description><![CDATA[In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3&#215;3 texel footprint]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2011%252F01%252Ffrei-chen-edge-detector%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2Fehkb4E%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Frei-Chen%20edge%20detector%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 160px"><img title="Frei-Chen edge detector" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/frei-chen.png" alt="Frei-Chen edge detector" width="150" height="150" /><p class="wp-caption-text">Frei-Chen edge detector</p></div>
<p>In this article, I would like to present you an edge detection algorithm that shares similar performance characteristics like the well-known Sobel operator but provides slightly better edge detection and can be seamlessly extended with little to no performance overhead to also detect corners alongside with edges. The algorithm works on a 3&#215;3 texel footprint similarly like the Sobel filter but applies a total of nine convolution masks over the image that can be used for either edge or corner detection. The article presents the mathematical background that is needed to implement the edge detector and provides a reference implementation written in C/C++ using OpenGL that showcases both the Frei-Chen and the Sobel edge detection filter applied to the same image.</p>
<p><span id="more-532"></span>I met with the algorithm during my computer graphics studies when one of my homeworks was to implement the Frei-Chen edge detector. As I already mentioned it in an earlier post, I am willing to provide source code for more basic graphics algorithms after seeing the success of <a title="Efficient Gaussian blur with linear sampling" href="http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/">my former post</a> about the Gaussian blur filter. This one is a very similarly basic article, taking in consideration it shows only how to apply a particular convolution filter based algorithm on a still image, while the possibilities this edge detection algorithm brings is a more complex topic that is out of the scope of this article.</p>
<p>As the provided reference implementation also showcases applying the Sobel operator on an image, I would like to present that first and then continue with the presentation of the Frei-Chen masking set. Those who are already well familiar with edge detection and the Sobel operator can skip the following two sections.</p>
<h2>Edge detection</h2>
<p>Before getting deep into how to implement edge detectors, let&#8217;s first talk about what is an edge detector and why we need it.</p>
<p>In general, edge detection is one of the most fundamental image processing tools, particularly used in the areas of feature detection and feature extraction. The aim of the technique is to identify points of a digital image at which the intensity changes sharply. The reason of these intensity changes can be either discontinuities in depth, surface orientation, lighting condition changes and many other factors. In the ideal case, the result of applying an edge detector to an image leads us to a set of connected lines or curves that indicate the boundaries of objects.</p>
<p>Not going that far, what an edge detector gives us from the very beginning is a gray-scale image where each pixel intensity tries to approximate the likelihood of whether that pixel belongs to an object boundary. How well a particular algorithm can detect such pixels depends on many factors and usually it is better to try multiple edge detectors in order to choose one that fits most for the particular use case.</p>
<p>After we got this gray-scale image we usually have to define a threshold value that will be used as an acceptance criteria for edge pixels. If the intensity value previously calculated is above this threshold then we accept the pixel as an edge otherwise we don&#8217;t. This part is the so called binarization stage. Additionally, subsequent image processing algorithms can be used to further interpret the edge image.</p>
<p>In computer graphics, edge detection is usually used to implement various image decoration algorithms. Maybe the most popular applications of edge detectors nowadays are non-photorealistic rendering (NPR) and screen-space anti-aliasing techniques.</p>
<h2>Sobel filter</h2>
<p>The Sobel edge detection filter works on a 3&#215;3 texel footprint and applies two convolution masks to the image that are intended to detect horizontal and vertical gradients of the image. The filter weights can be seen in on the figure below:</p>
<p style="text-align: center;"><img class="   aligncenter" title="Sobel masks" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/sobel-masks.png" alt="Sobel masks" width="457" height="119" /></p>
<p>These masks are applied to the intensities gathered from the 3&#215;3 footprint of the image and then are accumulated to produce the final gradient value in the following way:</p>
<p style="text-align: center;"><img class="aligncenter" title="Sobel gradient" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/sobel-grad.png" alt="Sobel gradient" width="321" height="84" /></p>
<p>The actual algorithm can be seen in the accompanying demo that provides a GLSL based implementation. The algorithm is defined to work on one channel image, however it can be easily extended to be applied either separately on a usual three-channel RGB image or by first calculating a gray-scale value based on the color component values. The former is more computationally intensive but usually provides better results by defining the threshold criteria in a way that a pixel is accepted as boundary point if the gradient value is larger than the threshold for either of the color channels. The reference implementation, however is based on the later approach for the sake of simplicity so for each pixel first an intensity value is calculated simply by taking the length of the vector comprised of the RGB components.</p>
<h2>Frei-Chen filter</h2>
<p>The Frei-Chen edge detector also works on a 3&#215;3 texel footprint but applies a total of nine convolution masks to the image. Frei-Chen masks are unique masks, which contain all of the basis vectors. This implies that a 3&#215;3 image area is represented with the weighted sum of nine Frei-Chen masks that can be seen below:</p>
<p style="text-align: center;"><img class="aligncenter" title="Frei-Chen masks" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/frei-chen-masks.png" alt="Frei-Chen masks" width="650" height="237" /></p>
<p>The first four Frei-Chen masks above are used for edges, the next four are used for lines and the last mask is used to compute averages. For edge detection, appropriate masks are chosen and the image is projected onto it. The projection equation is given below:</p>
<p style="text-align: center;"><img class="aligncenter" title="Frei-Chen equation" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/frei-chen-eq.png" alt="Frei-Chen equation" width="631" height="108" /></p>
<p>When we are using the Frei-Chen masks for edge detection we are searching for the cosine defined above and we use the first four masks as the elements of importance so the first sum above goes from one to four.</p>
<p>The application of a threshold and applying the filter to multi-channel images works exactly the same way like in case of the Sobel filter. Similarly, the reference implementation applies the filter on the image as it would be a single-channel image by first calculating the intensity value for each texel in the same fashion like with the previously presented filter.</p>
<h2>Comparison</h2>
<p>Based on my experience, the Frei-Chen edge detector looks better than the Sobel filter as it is less sensitive to noise and is able to detect edges that have small gradients and thus are not found by the basic Sobel filter. For a comparison, you can check the figure below:</p>
<div class="wp-caption aligncenter" style="width: 610px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/ed-comparison.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2011/01/ed-comparison.png?referer=');"><img title="Comparison of edge detectors" src="http://www.rastergrid.com/blog/wp-content/uploads/2011/01/ed-comparison-thumb.png" alt="Comparison of edge detectors" width="600" height="200" /></a><p class="wp-caption-text">Comparison of edge detectors: original image (left), Sobel filter (middle), Frei-Chen filter (right).</p></div>
<p>The reason why the Frei-Chen edge detector seems to work better is because its construction includes a normalization factor as well as other factors that are meant to exclude all other features except edges. A normalization factor can be also added to the Sobel filter by having a third mask that is equivalent with the ninth Frei-Chen mask and is used to normalize the gradients. This could help in reducing the number of undetected edges and the amount of noise that arises from the fact that the Sobel filter calculates absolute gradients rather than relative ones.</p>
<p>From performance point of view, the Frei-Chen edge detector is much more heavyweight as it uses nine masks instead of two, however, in practice, the performance difference between the two is much less taking in consideration that both use the same sized texel footprint and the computational performance of today&#8217;s GPUs is usually much higher than their texture fetching performance.</p>
<h2>Conclusion</h2>
<p>We managed to present an alternative algorithm for the Sobel filter in the form of the Frei-Chen edge detector that, even though having little impact on the performance compared to the Sobel operator, provides better edge detection quality. Having little to no difference in the way how the input data has to be organized and how the result is output, the Frei-Chen edge detector can be easily used as a drop-in replacement for implementations that used the Sobel filter before.</p>
<p><strong>Source code</strong> and <strong>Win32 binary</strong> can be acquired in the <a title="Frei-Chen Edge Detector" href="http://rastergrid.com/blog/downloads/frei-chen-edge-detector/">downloads section</a>.</p>
<p>I would like to encourage those who read this article to add the Frei-Chen edge detector into their software for making a comparison about whether it yields to better results than the Sobel filter for applications that rely on the output of the edge detection filter. I would be interested how the filter works in real-life computer graphics scenarios.</p>
<p>Thanks in advance and hope you enjoyed the article!</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2011/01/frei-chen-edge-detector/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Suggestions for OpenGL 4.2 and beyond</title>
		<link>http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/</link>
		<comments>http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/#comments</comments>
		<pubDate>Sun, 14 Nov 2010 17:15:23 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[callback]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[uniform buffer]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=504</guid>
		<description><![CDATA[The Khronos Group did a great job in the last few years to once again prove that OpenGL is still in game and that it can become the ultimate graphics API of choice, if it is not that already. However, we must note that it is not quite yet true that OpenGL 4.1 is a]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F11%252Fsuggestions-for-opengl-4-2-and-beyond%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FdymyU0%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Suggestions%20for%20OpenGL%204.2%20and%20beyond%22%20%7D);"></div>
<p>The Khronos Group did a great job in the last few years to once again prove that OpenGL is still in game and that it can become the ultimate graphics API of choice, if it is not that already. However, we must note that it is not quite yet true that OpenGL 4.1 is a superset of its competitor, DirectX 11. We still have some holes that still have to be filled and I think the ARB should not stop just there as there is much more potential in the current hardware architectures than that is currently exposed by any graphics API so establishing the future of OpenGL should start by going one step further than DX11. In this article I would like to present my vision of items of importance that should be included in the next revision of the specification and how I see the future of OpenGL.</p>
<p><span id="more-504"></span>Since the original OpenGL Longs Peak announcement, graphics developers were really excited to get their hands on the completely revised OpenGL 3 specification. Still, due to severe backward compatibility and portability issues the original plan seemed to be failed and developers expressed their great sense of disappointment about the ARB&#8217;s decision to choose rather a more evolutionary move away from the legacy API instead of the radical rewrite, the Khronos Group has proved that the decision was not necessarily bad for OpenGL and in fact we got now a pretty powerful API, even though the coexistence of the legacy and the new design greatly increased the complexity of the specification.</p>
<p>What we have now is an API that can really compete with DirectX 11 but I strongly believe that this is not the end of the story yet as we still have a lot of things to do in ahead of us. I mean this both from point of view of exposing more hardware capabilities as well as streamlining the API language itself to increase the productivity of the developers who use it. My plan is to target both of these issues in this article, also trying to focus on hardware functionalities that are not even exposed by other graphics APIs yet.</p>
<h2>Exposing more hardware capabilities</h2>
<p>In this chapter of the article I will talk about some familiar and some not so familiar hardware features and corresponding OpenGL extensions that should be included in the next revision of the specification in order to be able to confidently say that OpenGL is a strict superset of the competing graphics APIs. The extensions listed here are not in any particular priority order, they are just listed in a way that ease the discussion about their functionality.</p>
<h3><a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">GL_EXT_shader_image_load_store</a></h3>
<p>This extension provides GLSL built-in functions allowing shaders to load from, store to, and perform atomic read-modify-write operations to a single level of a texture from any shader stage. Also, the extension also indirectly enables the same operations for buffer objects by using texture buffers. This enables developers to implement more sophisticated algorithms using shaders that require more complex data structures than just plain arrays.</p>
<p>An example use case can be the implementation of Order-Independent Transparency (OIT) using fragment linked lists as presented by <a title="OIT And Indirect Illumination Using Dx11 Linked Lists" href="http://www.slideshare.net/hgruen/oit-and-indirect-illumination-using-dx11-linked-lists" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/hgruen/oit-and-indirect-illumination-using-dx11-linked-lists?referer=');">AMD at GDC10</a>. Of course, there are a lot of other techniques that could benefit from hardware accelerated random access images (called UAV textures/buffers in DX11 terminology) including algorithms related to global illumination, ray tracing, and my personal favorite: scene management.</p>
<p>As the introduction of new write operations to fragment shaders besides the traditional framebuffer writes makes the execution of the shaders sensitive to whether early-Z is used or not by the hardware, the extension also introduces a new fragment shader input layout qualifier called &#8220;early_fragment_tests&#8221; to force OpenGL to use early depth and stencil test. Otherwise the specification language is valid stating that the depth and stencil tests are performed after fragment shader execution.</p>
<p>Finally, the extension enables some form of control over the order of image loads, stores, and atomics relative to other pipeline operations accessing the same memory region both using the OpenGL API and from within shaders.</p>
<p>The API itself provides a DSA-style binding mechanism that enables binding to so called &#8220;image units&#8221; that are separate from that of texture image units. In the same style, the specification language and GLSL refers to the introduced read-write textures with the term &#8220;image&#8221;.</p>
<p>In my opinion this is one of the most important extensions that should be made core with OpenGL 4.2 and I&#8217;m pretty sure this will actually happen.</p>
<h3><a title="GL_NV_texture_barrier" href="http://www.opengl.org/registry/specs/NV/texture_barrier.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/texture_barrier.txt?referer=');">GL_NV_texture_barrier</a></h3>
<p>This extension relaxes the restrictions of OpenGL on rendering to a currently bound texture and provides a mechanism to avoid read-after-write problems. More precisely, the extension allows rendering to a currently bound texture in the following cases:</p>
<ul>
<li>If the reads and writes are from/to disjoint sets of texels (after accounting for texture filtering rules) so it should work unless the drawn areas overlap, or</li>
<li>If there is only a single read and write of each texel, and the read is in the fragment shader invocation that writes the same texel (e.g. using texelFetch2D).</li>
</ul>
<p>Some of these situations were already supported implicitly like rendering to a texture level and fetching from another texture level. But the extension goes further and provides an API function to put an explicit barrier between draw calls to ensure proper rendering.</p>
<p>The extension can be used to accomplish a limited form of programmable blending and can eliminate the need of any image or buffer data copy in case we can live with the restrictions mentioned above.</p>
<p>One may ask why we need this extension if we have the <a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">GL_EXT_shader_image_load_store</a> extension as this one is just a subset of the functionality provided by that. The answer is simple: performance. While read-write textures can mimic the same functionality they usually use different hardware paths that are slower than regular read-only texture accesses. So it would be a definite benefit to having also this extension in core OpenGL.</p>
<h3>GL_ARB_shader_atomic_counters</h3>
<p>This extension does not have public specifications yet, however it can be found in the extension lists of the latest Catalyst driver releases sometimes with EXT, sometimes with ARB prefix. The extension itself provides API to access a number of hardware atomic counters that provide efficient counter operations on a GPU global scale.</p>
<p>Atomic counters come handy when one has to read or write individual elements of a buffer or texture. As an example, this extension is needed to be able to efficiently implement the OIT algorithm mentioned earlier as, when constructing the fragment linked list, we need to have unique offsets to the linked list buffer. This unique offset can be, of course, acquired by using atomic read-modify-write operations but those perform much slower than hardware atomic counters.</p>
<p>Besides the mentioned example, atomic counters are useful in many algorithms from many domains, one important use case is to perform feedback operations similar to that provided by transform feedback. Such feedback operations can be used to perform various scene management or culling mechanisms.</p>
<p>The extension provides access to these atomic counters from GLSL and also makes it possible to back them up with buffer objects so after OpenGL draw calls the value of the counters is conserved in these buffers for subsequent use.</p>
<h3><a title="GL_AMD_conservative_depth" href="http://www.opengl.org/registry/specs/AMD/conservative_depth.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/conservative_depth.txt?referer=');">GL_AMD_conservative_depth</a></h3>
<p>Early depth test is a common optimization for hardware accelerated graphics that can skip the evaluation of fragment shaders for fragments that end up being discarded because they don&#8217;t pass the depth test. The problem is that in case the fragment shader modifies the depth value of the fragment then the early depth test is disabled. One can force early depth test with the functionality introduced by the extension <a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">GL_EXT_shader_image_load_store</a> but that can lead to some rendering artifacts as the modified depth value output by the fragment shader is not taken into account.</p>
<p>This extension allows the application to pass enough information to the GL implementation to activate some early depth test optimizations safely while still preserving the ability to account the final depth value in the depth test. In order to solve this, the extension introduces four new fragment shader input layout qualifiers called &#8220;depth_unchanged, &#8220;depth_any&#8221;, &#8220;depth_greater&#8221; and &#8220;depth_less&#8221;. The most interesting ones are the latest two that provide the ability to do early-Z and hierarchical-Z tests from one direction to discard some groups of fragments and still allow the fragment shader to safely modify the depth value.</p>
<p>This technique comes very handy in case of rendering volumetric particles, decals or billboards. Without this extension one have to sacrifice the possibility to do early rejection of fragments in order to be able to create the volumetric primitives mentioned.</p>
<p>As far as I know this feature is also present in DirectX 11 so it should be a must for OpenGL 4.x also. As the extension is an AMD one, I don&#8217;t know whether NVIDIA GPUs do support anything like this in hardware but even if not, they can simply ignore the new layout qualifiers and do late depth test instead. Of course, it would result in lower performance but if only functionality is concerned it should be just okay.</p>
<h3>GL_ARB_instanced_arrays2</h3>
<p>OpenGL provides two means to perform geometry instancing via the extensions <a title="GL_ARB_draw_instanced" href="http://www.opengl.org/registry/specs/ARB/draw_instanced.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_instanced.txt?referer=');">GL_ARB_draw_instanced</a> and <a title="GL_ARB_instanced_arrays" href="http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/instanced_arrays.txt?referer=');">GL_ARB_instanced_arrays</a>. While this (yet non-existent) extension would extend both, it is more relevant in case of the extension mentioned later so I named it accordingly.</p>
<p>The extension should trivially add the possibility to specify a &#8220;first instance&#8221; parameter for the instanced draw commands. Whether this is accomplished by introducing new variants of the glDrawElement* and glDrawArrays* draw commands or having a separate command for specifying the new parameter is up to the ARB. The extension should also interact with <a title="GL_ARB_draw_indirect" href="http://www.opengl.org/registry/specs/ARB/draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_indirect.txt?referer=');">GL_ARB_draw_indirect</a> which already mentions the lack of the parameter in GL and reserved already a field in the indirect draw command structure for specifying the &#8220;first instance&#8221; parameter.</p>
<p>This extension itself would be much more a bug fix rather than a completely new feature as this functionality should have been already exposed at the first time instancing was introduced to OpenGL.</p>
<h3>GL_ARB_draw_indirect2</h3>
<p>This is one of the extensions I would be the most happy to see in the next release of the OpenGL specification. It would be a functional addition to the <a title="GL_ARB_draw_indirect" href="http://www.opengl.org/registry/specs/ARB/draw_indirect.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/draw_indirect.txt?referer=');">GL_ARB_draw_indirect</a> extension that currently only allows the execution of a single instanced draw command that sources its parameter from a buffer object.</p>
<p>The new extension would add a new buffer binding point called e.g. GL_DRAW_INDIRECT_PRIMITIVE_COUNT that would specify the source of the &#8220;primcount&#8221; parameter to the following newly introduced draw commands:</p>
<pre>    void <strong>MultiDrawArraysIndirect</strong>( enum <em>mode</em>, sizei stride,
                                  const void *<em>indirect</em>,
                                  const void *<em>primcount</em> );
    void <strong>MultiDrawElementsIndirect</strong>( enum <em>mode</em>, enum <em>type</em>, sizei stride,
                                    const void *<em>indirect</em>,
                                    const void *<em>primcount</em> );</pre>
<p>This would not just allow for executing multiple indirect draw commands at once, without further CPU action, but also would source the &#8220;primcount&#8221; parameter from a buffer object thus if the draw commands are generated using transform feedback, read-write buffers or OpenCL (e.g. based on some GPU based scene management algorithm) then the application does not have to use asynchronous queries or other means that may introduce sync points in the rendering to be able to feed the &#8220;primcount&#8221; parameter.</p>
<p>Some people said that this is quite a futuristic feature to expect and most probably such functionality will be available only on newer generation of GPUs and maybe with OpenGL 5. I was not that pessimistic so I decided to raise my question to the relevant ARB members of NVIDIA and AMD. While I did not receive any answer from NVIDIA, I did received some good news from AMD as they said that this functionality can be implemented for Shader Model 5.0 level hardware.</p>
<p>What this extension would give developers is a way to efficiently implement GPU based scene management where the GPU bakes together all the rendering commands for the current frame using atomic counters and buffer writes, and the CPU just have to issue a few or maybe just a single MultiDraw*Indirect command to render the whole scene. But of course, the feature can increase draw command throughput also in case of CPU based scene management.</p>
<p>So my message to the Khronos Group is please, start working on such an extension as this would not just make developers happy, but you can also strengthen OpenGL&#8217;s position in the industry by putting something into the specification that even DirectX 11 cannot do.</p>
<h3><a title="GL_AMD_transform_feedback3_lines_triangles" href="http://www.opengl.org/registry/specs/AMD/transform_feedback3_lines_triangles.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/transform_feedback3_lines_triangles.txt?referer=');">GL_AMD_transform_feedback3_lines_triangles</a></h3>
<p>OpenGL 4.0 introduced the extension <a title="GL_ARB_transform_feedback3" href="http://www.opengl.org/registry/specs/ARB/transform_feedback3.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback3.txt?referer=');">GL_ARB_transform_feedback3</a> that further extended the transform feedback capabilities provided by earlier extensions to allow ouput to separate vertex streams. However there is one caveat: separate vertex streams are only supported for point primitives.</p>
<p>This new AMD extension does nothing more than just simply removes that restrictions for separate output streams allowing the same set of primitive types to be used with multiple transform feedback streams as with a single stream as long as the primitive types are the same for all output streams.</p>
<p>Limiting the possible output primitive types for transform feedback into multiple streams should not be a problem unless you want also to rasterize some triangles at the same time you output. Without relaxing this restriction can do this only by issuing two separate draw commands that incurs a performance hit.</p>
<p>I don&#8217;t know if the restriction is present in the ARB extension because NVIDIA does not support this in hardware but if this is not the case then I think this extension should be included in the next release of the specification. Otherwise, please NVIDIA include this feature in your next GPU generation.</p>
<h3><a title="GL_NV_copy_image" href="http://www.opengl.org/registry/specs/NV/copy_image.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/copy_image.txt?referer=');">GL_NV_copy_image</a></h3>
<p>OpenGL 3.1 already introduced a method to provide GPU accelerated copy of buffer data. This NVIDIA extension provides a similar functionality that can be used to execute efficient image data transfer between image objects (i.e. textures and renderbuffers).</p>
<p>While there are already methods to perform image data copies between textures e.g. using the <a title="GL_EXT_framebuffer_blit" href="http://www.opengl.org/registry/specs/EXT/framebuffer_blit.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/framebuffer_blit.txt?referer=');">GL_EXT_framebuffer_blit</a> extension promoted to core with OpenGL 3.0 these require expensive framebuffer object operations and they also lack direct support for transferring 3D image data.</p>
<p>This extension simply introduces a single command that allows such image data copies for every type of textures (including cube maps, 3D textures and array textures) without the need to bind the image objects or otherwise configure the rendering.</p>
<h3><a title="GL_AMD_depth_clamp_separate" href="http://www.opengl.org/registry/specs/AMD/depth_clamp_separate.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/depth_clamp_separate.txt?referer=');">GL_AMD_depth_clamp_separate</a></h3>
<p>The extension <a title="GL_ARB_depth_clamp" href="http://www.opengl.org/registry/specs/ARB/depth_clamp.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/depth_clamp.txt?referer=');">GL_ARB_depth_clamp</a> promoted to core with OpenGL 3.2 introduced the ability to control the clamping of the depth value for both the near and far clip planes. This eliminates artifacts like seeing inside an object happening when the object&#8217;s geometry is clipped by the near clip plane.</p>
<p>This new extension provides a mean for the application to enable depth clamp separately for the near and the far clip plane. This increases the flexibility of depth clamping and can save some fill-rate in certain situations.</p>
<h3><a title="GL_EXT_texture_filter_anisotropic" href="http://www.opengl.org/registry/specs/EXT/texture_filter_anisotropic.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/texture_filter_anisotropic.txt?referer=');">GL_EXT_texture_filter_anisotropic</a></h3>
<p>I don&#8217;t think that I have to talk too much about this extension as it should be familiar to all of you. It simply enables the possibility to use anisotropic filtering on a per-texture basis. I really wonder how this extension didn&#8217;t make its way into core as it is supported by hardware since more than a decade.</p>
<p>I know that the extension itself is supported by all relevant graphics driver vendors but really, why we can&#8217;t just simply include it in the core specification?</p>
<h3>GL_ARB_texture_gather_lod</h3>
<p>This is another yet non-existent extension that would extend <a title="GL_ARB_texture_gather" href="http://www.opengl.org/registry/specs/ARB/texture_gather.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_gather.txt?referer=');">GL_ARB_texture_gather</a> by adding GLSL built-in functions called textureGatherLod that would allow gathered fetches with explicit LOD. I&#8217;m not sure if these functions are missing from the specification because of lack of hardware support or just because the ARB thought they might not be of any use. Anyway, if the hardware supports it then OpenGL should expose it to developers as there are certain situations when one has to use explicit LOD and could benefit from the increased fetching performance enabled by gathered fetches.</p>
<h3><a title="GL_ARB_shader_stencil_export" href="http://www.opengl.org/registry/specs/ARB/shader_stencil_export.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/shader_stencil_export.txt?referer=');">GL_ARB_shader_stencil_export</a></h3>
<p>This extension was published at the time the OpenGL 4.1 specification came out and provides the ability for the fragment shader to output the stencil reference value that was otherwise configurable only using API calls. This enables a great level of flexibility to existing and future stencil buffer based algorithms making it possible also to directly write independent values to the stencil buffer on a per-fragment basis.</p>
<p>The predecessor of the extension is <a title="GL_AMD_shader_stencil_export" href="http://www.opengl.org/registry/specs/AMD/shader_stencil_export.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/shader_stencil_export.txt?referer=');">GL_AMD_shader_stencil_export</a> and as such it indicates that maybe it is only supported in hardware on AMD GPUs. However, if this is not the case and NVIDIA could support this also then I think it worths to promote this feature also to core OpenGL.</p>
<h2>Streamlining the API</h2>
<p>After discussing the long list of functional features that would be nice to be included into the next release of OpenGL let&#8217;s focus on the API improvement extensions and ideas that are necessary to improve the usability of the API itself. Actually this part could go way longer than I&#8217;ll discuss because as we get more and more features to OpenGL, developers struggle with the increased complexity of the API. I&#8217;ll try to focus on the most crucial issues.</p>
<h3><a title="GL_EXT_direct_state_access" href="http://www.opengl.org/registry/specs/EXT/direct_state_access.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/direct_state_access.txt?referer=');">GL_EXT_direct_state_access</a></h3>
<p>This is the extension what all OpenGL developers are waiting for a long time now. Direct state access eliminates the OpenGL API&#8217;s stupid &#8220;bind-to-modify&#8221; nature.</p>
<p>For a very long time the only vendor supporting the extension was NVIDIA. Fortunately, since Catalyst 10.7 AMD also exposes the extension to developers. Still, I have one problem: this extension is very poorly designed.</p>
<p>The main problem with the extension is that the functions were designed in a way that a naive implementation could be done by simply using &#8220;bind-to-modify&#8221; under the hood. That&#8217;s what resulted in crazy API functions like MultiTexParameter* and friends. Also, enabling DSA for all of the deprecated functionalities would result in an explosion of the API specification and as a consequence it would result in bloated specification language. Finally, I would also like to object somewhat the lack of creativity of the contributors regarding to the awkward naming conventions present in the current DSA extension.</p>
<p>In my opinion the Khronos Group has to address the issue by creating a new ARB version of the DSA extension that focuses strictly on core functionalities, throwing away DSA support for deprecated features (if somebody needs to use deprecated features they can still use the EXT version) and provide a naming convention that fits much better into the current API language.</p>
<p>Anyway, I completely agree with the other developers out there and scream for DSA. I think the Khronos Group has to eliminate the problem of the &#8220;bind-to-modify&#8221; semantics as soon as possible otherwise, even though the core specification exposes more and more hardware features, developers will not be attracted to use OpenGL.</p>
<h3>GL_ARB_explicit_sampler_location</h3>
<p>The ARB moved in the right direction when they introduced the <a title="GL_ARB_explicit_attrib_location" href="http://www.opengl.org/registry/specs/ARB/explicit_attrib_location.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/explicit_attrib_location.txt?referer=');">GL_ARB_explicit_attrib_location</a> extension by eliminating the need to use dummy API calls to bind vertex attributes and output buffers to shader variables but they should not stop here. One of the most important addition could be adding a similar language syntax to GLSL that would allow us to bind sampler uniforms to texture image units. Obviously, the same goes for read-write images if <a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">GL_EXT_shader_image_load_store</a> is included.</p>
<h3>GL_ARB_explicit_uniform_block_index</h3>
<p>Similar to the previous request, uniform block indices should be as well explicitly specifiable in the shaders themselves. This extension would add exactly such functionality. The implementation is also straightforward: just a simple uniform block layout qualifier has to be added.</p>
<h3>Other API clarifications</h3>
<p>Besides the major issues the current specification language also has some bugs and unclear parts that should be addressed as well:</p>
<ul>
<li>Program pipeline objects are created by binding the object name which is not in align with the rest of the API language.</li>
<li>No language is about whether program pipeline objects are shared among contexts or not which suggests that they aren&#8217;t which is not in align with the fact that program and shader objects are shared.</li>
</ul>
<p>Most probably there are a lot more issues with the specification language but for now just these came into my mind. Maybe some of you can extend the list with tons of other specification mistakes.</p>
<h2>OpenGL 4.2 and beyond</h2>
<p>While my feature requests cover most of the needed functionality that should be included in the next revision of the OpenGL specification, there are a lot of other things that could be very useful for developers but are very unlikely to get their way into the specification any soon. I will talk about these features in this section of the article as these raise much more questions than just to be able to simply include it in OpenGL 4.2.</p>
<h3>Affinity contexts</h3>
<p>We have multi-GPU designs like SLI and CrossFire for a long time now. Fortunately, we have also vendor specific extensions to create affinity contexts that are associated with a single GPU of a multi-GPU configuration. We have <a title="WGL_AMD_gpu_association" href="http://www.opengl.org/registry/specs/AMD/wgl_gpu_association.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/wgl_gpu_association.txt?referer=');">WGL_AMD_gpu_association</a> and <a title="WGL_NV_gpu_affinity" href="http://www.opengl.org/registry/specs/NV/gpu_affinity.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/gpu_affinity.txt?referer=');">WGL_NV_gpu_affinity</a> for Windows and <a title="GLX_AMD_gpu_association" href="http://www.opengl.org/registry/specs/AMD/glx_gpu_association.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/glx_gpu_association.txt?referer=');">GLX_AMD_gpu_association</a> on GLX based platforms. I have just two problems with this:</p>
<ul>
<li>First, these are vendor specific extensions.</li>
<li>Second, NVIDIA exposes its affinity context support only on Windows and just for their professional cards, leaving consumer hardware owners without affinity context support.</li>
</ul>
<p>I would be pleased to see in the future extensions like <span style="text-decoration: underline;">WGL_ARB_gpu_affinity_context</span> and <span style="text-decoration: underline;">GLX_ARB_gpu_affinity_context</span> that will be supported by both NVIDIA and AMD, and that are supported on both professional and consumer hardware.</p>
<h3>Command buffers</h3>
<p>I would like to see something similar in OpenGL that what we have in OpenCL. Having several separate command buffers for a single OpenGL context can have its performance benefits as some of the implicit sync points that are otherwise present in OpenGL draw commands could be eliminated. Another solution would be to use simply multiple GL contexts but it is much more complicated and context switches are quite heavy-weight operations. This would be something like how framebuffer objects replaced pbuffers.</p>
<p>Also this could go that far as we can encapsulate state manipulation data into command buffers in a similar way how display lists allowed this in many cases just in a more efficient and hardware centric manner.</p>
<h3>Immutable state objects</h3>
<p>Another thing strongly related to the previous idea would be immutable state objects. If state management data could not be efficiently stored in such a command buffer we could use instead immutable state objects that would be very similar in nature to display lists that are hiding the underlying representation of the commands.</p>
<p>Display lists are deprecated and I don&#8217;t think it was a wrong decision. It made the API language complex and you&#8217;ve never knew which command compiles into display lists and how. I remember the time I was making an OpenGL app on my GeForce2 and used DrawElements calls inside display lists that referenced buffer object data. Funnily it was working on NVIDIA hardware, even though the specification says otherwise, and I was wondering why I my app crashes on ATI cards.</p>
<p>Anyway, display lists are gone, but we need some complex state objects that could fill those holes that were left after them.</p>
<h3>More callbacks</h3>
<p>I was very happy to see the appearance of an extension that introduced the callback concept into OpenGL (<a title="GL_AMD_debug_output" href="http://www.opengl.org/registry/specs/AMD/debug_output.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/debug_output.txt?referer=');">GL_AMD_debug_output</a>). Since that, the functionality was promoted to an ARB extension meaning that the ARB has accepted the fact that we need callbacks.</p>
<p>What I would like to see in the future is more OpenGL callbacks. One of the most trivial things I can think of are asynchronous queries. It would so much easier if we would be able to receive a callback from OpenGL when the results of our asynchronous queries are available, rather than having to manually poll it for result in various phases of the rendering.</p>
<p>Actually, I could imagine callbacks for every rendering command issued that will be called by the driver as soon as the actual rendering is complete on the GPU side.</p>
<h3>Programmable blending</h3>
<p>This is one another thing that developers are screaming for. Fortunately now we have indirect methods to solve most of the issues of programmable blending via the extensions <a title="GL_EXT_shader_image_load_store" href="http://www.opengl.org/registry/specs/EXT/shader_image_load_store.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/EXT/shader_image_load_store.txt?referer=');">GL_EXT_shader_image_load_store</a> and <a title="GL_NV_texture_barrier" href="http://www.opengl.org/registry/specs/NV/texture_barrier.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/texture_barrier.txt?referer=');">GL_NV_texture_barrier</a>, however a more general solution would be welcomed.</p>
<p>I don&#8217;t know whether this would be actually possible on current hardware but if not, then this is a message to hardware vendors to solve the issue in the near future.</p>
<h2>Summary</h2>
<p>We&#8217;ve seen that even though OpenGL is on track and the Khronos Group is keeping up the pace with its competitors, still there are lots of room for improvement regarding to the OpenGL specification from both functional point of view as well as from API design point of view.</p>
<p>I would like to end the article with a summary of what I expect to be part of the OpenGL 4.2 specification and my personal wish-list beyond those in some kind of priority order.</p>
<p><strong>My expectations for OpenGL 4.2:</strong></p>
<ul>
<li>GL_EXT_shader_image_load_store</li>
<li>GL_ARB_shader_atomic_counters</li>
<li>GL_ARB_instanced_arrays2</li>
<li>GL_ARB_explicit_sampler_location</li>
<li>GL_ARB_explicit_uniform_block_index</li>
</ul>
<p><strong>My personal wish-list for OpenGL 4.2:</strong></p>
<ul>
<li>GL_ARB_draw_indirect2</li>
<li>GL_ARB_direct_state_access</li>
<li>GL_NV_texture_barrier</li>
<li>GL_AMD_conservative_depth</li>
<li>GL_ARB_texture_gather_lod</li>
<li>GL_NV_copy_image</li>
<li>GL_EXT_texture_filter_anisotropic</li>
<li>GL_ARB_shader_stencil_export</li>
<li>GL_AMD_depth_clamp_separate</li>
<li>GL_AMD_transform_feedback3_lines_triangles</li>
</ul>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/11/suggestions-for-opengl-4-2-and-beyond/feed/</wfw:commentRss>
		<slash:comments>29</slash:comments>
		</item>
		<item>
		<title>Texture and buffer access performance</title>
		<link>http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/</link>
		<comments>http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/#comments</comments>
		<pubDate>Tue, 02 Nov 2010 20:44:57 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Multiprocessing]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[uniform buffer]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=475</guid>
		<description><![CDATA[Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F11%252Ftexture-and-buffer-access-performance%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Texture%20and%20buffer%20access%20performance%22%20%7D);"></div>
<p>Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management taking advantage of the local data store (LDS) available on today&#8217;s hardware. In this article I&#8217;ll present the memory access performance characteristics of AMD&#8217;s Evergreen-class GPUs focusing on what this all means from OpenGL point of view. While most of the data is about the HD5870, the general principles and relative performance characteristics are valid for other GPUs, including ones from other vendors.</p>
<p><span id="more-475"></span></p>
<h2>Introduction</h2>
<p>Traditional CPU based applications don&#8217;t have to worry too much about where they put their data as they have a simple set of possibilities: registers and global memory (accessed through a series of linear caches called L1, L2 and on newer architectures also L3). While this and its details can be already quite cumbersome to utilize efficiently, GPU based algorithms need even more investigation as their architecture is based on a more complex multi-level memory design.</p>
<p>Typical questions an OpenGL graphics developer could ask nowadays are:</p>
<ul>
<li>Where should I put my per-object data?</li>
<li>From where should I source animation data?</li>
<li>Should I use uniform buffers, texture buffers or vertex buffers for my per-instance data?</li>
<li>What does it mean from performance point of view if I use read-write buffers or textures?</li>
</ul>
<p>Of course, the list could continue and answering the individual questions is not easy and often requires performance measurements to prove our suspicions. Instead of trying to answer all these questions it is easier to take a look at the actual hardware performance characteristics and solve the individual issues based on that.</p>
<p>I&#8217;ve already touched the topic in the past with the article <a title="Uniform Buffers VS Texture Buffers" href="http://rastergrid.com/blog/2010/01/uniform-buffers-vs-texture-buffers/">Uniform Buffers VS Texture Buffers</a> where I&#8217;ve presented the key differences between the two data access method and a few examples when to use one or the other. In this article I&#8217;ll go further and try to provide more accurate data about how various memory access methods perform in practice.</p>
<p>Earlier there were little to no detailed information about the actual performance of API level memory access methods but fortunately the increasing popularity of OpenCL made vendors to provide more technical details about the architecture and performance of their products to enable software developers to fully leverage the power of today&#8217;s GPUs. While these documents focus on OpenCL or other compute APIs, most of the data applies indirectly to OpenGL as well.</p>
<h2>The Evergreen architecture</h2>
<p>In order to be able to provide some actual performance data, I&#8217;ve selected as reference AMD&#8217;s Evergreen architecture and the Radeon HD5870 as the target hardware. Note that most of the presented details roughly apply to all other modern GPUs, including NVIDIA&#8217;s Fermi architecture. Each time there is a clear difference between the two, I&#8217;ll try to point it out. However, I cannot be 100% sure what are these differences as <a title="ATI Stream SDK OpenCL Programming Guide" href="http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf?referer=');">ATI&#8217;s OpenCL programming guide</a> is somewhat more talkative about actual performance details than that of <a title="NVIDIA OpenCL Programming Guide" href="http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf?referer=');">NVIDIA&#8217;s OpenCL programming guide</a>.</p>
<div class="wp-caption aligncenter" style="width: 510px"><img class=" " title="OpenCL Platform Model" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/11/opencl_platform_model.png" alt="OpenCL Platform Model" width="500" height="273" /><p class="wp-caption-text">OpenCL Platform Model</p></div>
<p>From OpenCL platform model&#8217;s point of view the Radeon HD5870 is structured in the following way:</p>
<ul>
<li>Total of 20 compute units.</li>
<li>Each compute unit consists of 16 stream cores.</li>
<li>Each stream core consists of 5 processing elements (4 traditional, 1 transcendental).</li>
</ul>
<p>This sums up to a total of 1600 processing elements on the Radeon HD5870.</p>
<p>The basic OpenCL architecture applies in the same way to NVIDIA GPUs, however, there is are differences between AMD&#8217;s and NVIDIA&#8217;s GPU architecture. AMD uses a special super-scalar architecture since their HD2000 series that allows them to execute 5 separate instructions in each core.</p>
<div class="wp-caption aligncenter" style="width: 439px"><img class="   " title="ATI super-scalar architecture" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/11/ati_superscalar.gif" alt="ATI super-scalar architecture" width="429" height="237" /><p class="wp-caption-text">ATI super-scalar architecture consisting of one transcendental unit (left), four traditional units and a dedicated branch execution unit (right).</p></div>
<p>What this already reveals us from OpenGL point of view is that AMD&#8217;s architecture groups together 16 stream cores so fragment shaders are most probably running on 4&#215;4 tiles of fragments in sync. As an example, it is important to note this in case we use heavy dynamic branching in shaders as we should be aware of that in case the branch selection is not coherent for the specified fragment neighborhood, performance can drop due to the fact that hardware masks out those processing elements that did not select the appropriate branch.</p>
<p>Also, it is important to note that usually one out of four or five processing elements (depending on hardware generation and vendor) are capable of executing transcendental instructions such as logarithm, exponential or trigonometric functions.</p>
<h2>Memory capacity and performance</h2>
<p>AMD is very clear about the memory capacity and performance details in their OpenCL programming guide. The figure below showcases these hardware characteristics of the Radeon HD5870:</p>
<table style="width: 100%;" border="0">
<tbody>
<tr>
<td style="text-align: center;"><strong>OpenCL Memory Type</strong></td>
<td style="text-align: center;"><strong>Hardware Resource</strong></td>
<td style="text-align: center;"><strong>Size/CU</strong></td>
<td style="text-align: center;"><strong>Size/GPU</strong></td>
<td style="text-align: center;"><strong>Peak Read Bandwidth / Stream Core</strong></td>
</tr>
<tr>
<td>Private</td>
<td>GPRs</td>
<td style="text-align: center;">256KB</td>
<td style="text-align: center;">5MB</td>
<td style="text-align: center;">48 bytes/cycle</td>
</tr>
<tr>
<td>Local</td>
<td>LDS</td>
<td style="text-align: center;">32KB</td>
<td style="text-align: center;">640KB</td>
<td style="text-align: center;">8 bytes/cycle</td>
</tr>
<tr>
<td rowspan="3">Constant</td>
<td>Direct-addressed constant</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">48KB</td>
<td style="text-align: center;">16 bytes/cycle</td>
</tr>
<tr>
<td>Same-indexed constant</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">4 bytes/cycle</td>
</tr>
<tr>
<td>Varying-indexed constant</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">~0.6 bytes/cycle</td>
</tr>
<tr>
<td rowspan="2">Images</td>
<td>L1 Cache</td>
<td style="text-align: center;">8KB</td>
<td style="text-align: center;">160KB</td>
<td style="text-align: center;">4 bytes/cycle</td>
</tr>
<tr>
<td>L2 Cache</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">512KB</td>
<td style="text-align: center;">~1.6 bytes/cycle</td>
</tr>
<tr>
<td>Global</td>
<td>Global Memory</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">1GB</td>
<td style="text-align: center;">~0.6 bytes/cycle</td>
</tr>
</tbody>
</table>
<p><strong>GPRs</strong> &#8211; General Purpose Registers<br />
<strong>LDS</strong> &#8211; Local Data Store<br />
<strong>Direct-addressed constant</strong> &#8211; a constant accessed using a constant address.<br />
<strong>Same-indexed constant</strong> &#8211; a varying-indexed constant where each processing element accesses the same index.<br />
<strong>Varying-indexed constant</strong> &#8211; a varying-indexed constant where the processing elements access different indices.</p>
<p>Of course, consider this data for fetches that are properly aligned. In case of unaligned data access the actual throughput can be much lower. In order to be able to reach the peak bandwidth we have to align our data usually to multiples of 4, 8 or 16 bytes (depending on actual hardware).</p>
<p>As it can be seen, constant storage can also fall into three different access performance categories so do buffers and images. While actual numbers differ on various platforms, the guidelines apply to most of modern GPUs: use a particular addressing method wisely and take in consideration access locality in order to get optimum performance.</p>
<p>These numbers are no different in case of OpenGL terminology either, just replace the word &#8220;constant&#8221; with uniform buffers and think about images and global data as texture images or buffer objects. The only exception is that there is no direct alternative for local memory in OpenGL.</p>
<p>An additional thing to consider since Shader Model 5.0 hardware is read-write images and buffers. AMD refers to the two memory access method as FastPath and CompletePath. This means that in case of read-only textures or buffers the GPU uses the FastPath that is able to take full advantage of the L2 cache while read-write textures and buffers usually use the so called CompletePath that sacrifices the advantages of the L2 cache to enable the use of atomic operations on global memory objects. This, of course, has a quite huge performance effect reducing the throughput of the GPU about five times on the Radeon HD5870:</p>
<table style="width: 100%;" border="0">
<tbody>
<tr>
<td><strong>Kernel</strong></td>
<td style="text-align: center;"><strong>Effective Bandwidth</strong></td>
<td style="text-align: center;"><strong>Ratio to Peak Bandwidth</strong></td>
</tr>
<tr>
<td>copy 32-bit 1D FastPath</td>
<td style="text-align: center;">96 GB/s</td>
<td style="text-align: center;">63%</td>
</tr>
<tr>
<td style="text-align: left;">copy 32-bit 1D CompletePath</td>
<td style="text-align: center;">18 GB/s</td>
<td style="text-align: center;">12%</td>
</tr>
</tbody>
</table>
<h2>Summary</h2>
<p>Well, now we&#8217;ve seen that how various OpenCL memory types perform in reality, let&#8217;s see how all these information translate to the OpenGL world. Here are my <strong>top-10 recommendations</strong> about when and how to use the various data acquiring possibilities present in modern OpenGL:</p>
<ol>
<li>Align your data to multiples of 16 bytes and fetch them accordingly.</li>
<li>Use direct-addressing of data in uniform buffers and try to avoid indexing into uniform buffers.</li>
<li>If you must use indexing into uniform buffers, make sure that the indices are coherent across processing elements working in sync.</li>
<li>If you heavily use indexed data consider using texture buffers instead of uniform buffers to take advantage of the L1 and L2 cache.</li>
<li>Texture and buffer caches are linear so consider this when planning you access patterns.</li>
<li>Bind textures and buffers for read-write mode only when it is really necessary, use regular texture binding otherwise to ensure optimum performance.</li>
<li>A single atomic buffer operation forces the shader to use the slow path so use atomic operations wisely.</li>
<li>Do not use atomic buffer operations to implement atomic counters, use built-in hardware atomic counters instead as they are much faster.</li>
<li>Consider using dynamic branching to avoid costly memory operations as often as possible.</li>
<li>Try to make your branch selection coherent across processing elements working in sync (e.g. 4&#215;4 fragment tile in case of a fragment shader).</li>
</ol>
<div class="wp-caption aligncenter" style="width: 505px"><img class="  " title="Memory Access Performance" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/11/mem_perf.png" alt="Memory Access Performance" width="495" height="233" /><p class="wp-caption-text">Relative performance characteristics of memory access methods (higher is better).</p></div>
<p><em>Note: This article may contain inaccurate data and some advices may not apply to other hardware platforms. I&#8217;ve made this article with the hope that it may prove useful for some developers out there. For accurate details or more information, please contact your hardware vendor.</em></p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>GPU based dynamic geometry LOD</title>
		<link>http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/</link>
		<comments>http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/#comments</comments>
		<pubDate>Mon, 25 Oct 2010 19:35:13 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LOD]]></category>
		<category><![CDATA[occlusion culling]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[tessellation]]></category>
		<category><![CDATA[vertex buffer]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=428</guid>
		<description><![CDATA[Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F10%252Fgpu-based-dynamic-geometry-lod%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2F9M4KeD%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22GPU%20based%20dynamic%20geometry%20LOD%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 210px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/10/mountains.png"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-thumb.png" alt="OpenGL 4.0 - Mountains demo" width="200" height="150" /></a><p class="wp-caption-text">OpenGL 4.0 - Mountains demo</p></div>
<p>Dynamic geometry level-of-detail (LOD) algorithms are very popular and powerful algorithms that provide a great level of rendering performance optimization while preserving detail by using less detailed geometry for objects that are far away, too small or otherwise less significant in the quality of the final rendering. Many of these are used since the very beginning of computer graphics technologies and are present in some form in current CAD softwares, video games and other graphics applications. While determining the appropriate geometry LOD was previously the task of the CPU, with todays hardware it is possible to also offload this to the GPU which excels at handling large amount of objects in parallel.<br />
<span id="more-428"></span></p>
<h2>Introduction</h2>
<p>With the advent of Shader Model 5.0 GPUs and the appearance of programmable tessellation hardware it may seem like the geometry LOD problem is solved once and for all. However, in many cases it is simply not enough as for far away objects even a patch pass-through tessellation shader already produces too much geometry than the added detail worths. As a result, classic geometry LOD algorithms are still a good-to-have feature in the tool-box of the developer. Not to mention that all vendors recommend disabling tessellation shaders at all if we don&#8217;t need any geometry amplification as even a pass-through tessellation shader does have its payload.</p>
<p>This means that there has to be still a conventional rendering path for geometries that should not be tessellated. Then why not to try offloading the geometry LOD determination to the GPU if possible?</p>
<p>This article presents a technique that was already presented by AMD&#8217;s <a title="March of the Froblins" href="http://developer.amd.com/samples/demos/pages/froblins.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/samples/demos/pages/froblins.aspx?referer=');">March of the Froblins</a> demo and by NVIDIA&#8217;s <a title="NVIDIA DX10 Samples" href="http://developer.download.nvidia.com/SDK/10/direct3d/samples.html" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.download.nvidia.com/SDK/10/direct3d/samples.html?referer=');">Skinned Instancing</a> demo and allows GPU based dynamic geometry LOD determination using a geometry shader that selects the most appropriate LOD from a group of geometry LODs based on the object&#8217;s distance from camera. While this article and the reference implementation (<a title="OpenGL 4.0 - Mountains demo released" href="http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/">OpenGL 4.0 &#8211; Mountains demo</a>) presents the application of the technique only for instanced geometry, the same method can be easily extended to support heterogeneous objects by taking advantage of the latest functionalities introduced in OpenGL 4.</p>
<h2>The algorithm</h2>
<p>The technique is based on the geometry shader&#8217;s ability to emit or deny the emission of primitives into a transform feedback buffer as done in the mentioned DX based implementations. One major improvement compared to earlier approaches is that the LOD determination is done in a single pass rather than requiring a separate pass for each geometry LOD. Additionally, this LOD determination pass can be also merged together with other visibility determination passes like <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">Instance Cloud Reduction</a> or <a title="Hierarchical-Z map based occlusion culling" href="http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/">Hierarchical-Z map based occlusion culling</a> as it is done in the reference implementation. This was made possible thanks to the latest transform feedback capabilities introduced in OpenGL 4.0 (see the extension <a title="GL_ARB_transform_feedback3" href="http://www.opengl.org/registry/specs/ARB/transform_feedback3.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback3.txt?referer=');">ARB_transform_feedback3</a>) that enables the geometry shader to output data to separate primitive streams.</p>
<div class="wp-caption aligncenter" style="width: 660px"><img class="    " title="Culling and dynamic LOD in the March of the Froblins demo" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/froblin-lod.png" alt="Culling and dynamic LOD in the March of the Froblins demo" width="650" height="340" /><p class="wp-caption-text">Flow-chart presenting the culling and dynamic LOD algorithms used in AMD&#39;s March of the Froblins demo. The implementation needs five passes for culling and separating three detail levels and performs two asynchronous queries meanwhile. Requires OpenGL 3 compliant hardware.</p></div>
<div class="wp-caption aligncenter" style="width: 660px"><img title="Culling and dynamic LOD in the Mountains demo" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-lod.png" alt="Culling and dynamic LOD in the Mountains demo" width="650" height="281" /><p class="wp-caption-text">Flow-chart presenting the culling and dynamic LOD algorithm used in our Mountains demo. The implementation requires only one pass for culling and separating three detail levels without the need to use asynchronous queries. Requires OpenGL 4 compliant hardware.</p></div>
<p>The algorithm itself is very simple and straightforward. For each object instance determine the appropriate geometry LOD based on it&#8217;s distance from the camera and the LOD distances passed as uniform to the shader. After this, output the instance&#8217;s data to the output stream ID that corresponds to the determined LOD&#8217;s index. Here you can see a GLSL implementation of the algorithm:</p>
<pre class="brush:c">#version 400 core

uniform mat4 ModelViewMatrix;
uniform vec2 LodDistance;

layout(points) in;
layout(points, max_vertices = 1) out;

in vec3 InstancePosition[1];

layout(stream=0) out vec3 InstPosLOD0;
layout(stream=1) out vec3 InstPosLOD1;
layout(stream=2) out vec3 InstPosLOD2;

void main() {
  float distance = length(ModelViewMatrix * vec4(InstancePosition[0], 1.0));
  if ( distance &lt; LodDistance.x ) {
    InstPosLOD0 = InstancePosition[0];
    EmitStreamVertex(0);
  } else
  if ( distance &lt; LodDistance.y ) {
    InstPosLOD1 = InstancePosition[0];
    EmitStreamVertex(1);
  } else {
    InstPosLOD2 = InstancePosition[0];
    EmitStreamVertex(2);
  }
}</pre>
<p>Additionally, the geometry LOD determination pass has to be executed with primitive queries enabled for all the relevant output streams to acquire the number of instances for each geometry LOD index:</p>
<pre class="brush:cpp">for (int i=0; i&lt;NUM_LOD; i++)
  glBeginQueryIndexed(GL_PRIMITIVES_GENERATED, i, lodQuery[i]);

glBeginTransformFeedback(GL_POINTS);
  glDrawArrays(GL_POINTS, 0, instanceCount);
glEndTransformFeedback();

for (int i=0; i&lt;NUM_LOD; i++)
  glEndQueryIndexed(GL_PRIMITIVES_GENERATED, i);</pre>
<p>Finally, the only thing what is left is to issue an instanced draw call for each geometry LOD index to draw all the instances:</p>
<pre class="brush:cpp">for (int i=0; i&lt;NUM_LOD; i++) {
  glGetQueryObjectiv(lodQuery[i], GL_QUERY_RESULT, instanceCountLOD[i]);
  if ( instanceCountLOD[i] &gt; 0 )
    glDrawElementsInstanced(..., instanceCountLOD[i]);
}</pre>
<p>That&#8217;s all, and what you get as a result is a fully GPU based geometry LOD selection algorithm.</p>
<h2>The Mountains demo</h2>
<p>The reference implementation provided as part of the <a title="OpenGL 4.0 - Mountains demo" href="http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/">OpenGL 4.0 &#8211; Mountains demo</a> that is available with full source code and Windows executable in the <a title="Mountains Demo download" href="http://rastergrid.com/blog/downloads/mountains-demo/">downloads section</a>. The demo application implements the same visibility determination algorithms that were presented in the <a title="SIGGRAPH 2008 Course Notes about the March of the Froblins" href="http://developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf?referer=');">SIGGRAPH 2008 Course Notes</a> besides the dynamic geometry LOD algorithm presented here in a single pass.</p>
<p>Dynamic LOD can be enabled in the demo by using the F3 key. After enabled, the demo separates the various geometry detail levels according to the LOD distances configured. As it can be seen, there is almost no visible difference between the scene rendered with dynamic geometry LOD enabled and disabled. Also, by setting the LOD distances appropriately, the algorithm provides seamless transition between subsequent geometry detail levels as the camera is moved.</p>
<table style="width: 100%;" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="background-color: #ffffff;" align="center">
<div class="wp-caption alignnone" style="width: 338px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/lod-comp.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/lod-comp.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/lod-comp-thumb.png" alt="Close-up view to compare image quality without and with dynamic LOD" width="328" height="160" /></a><p class="wp-caption-text">Close-up view of distant objects to compare the image quality without (left) and with (right) dynamic LOD.</p></div></td>
<td style="background-color: #ffffff;" align="center">
<p><div class="wp-caption alignnone" style="width: 223px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/visual-lod.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/visual-lod.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/visual-lod-thumb.png" alt="LOD visualization" width="213" height="160" /></a><p class="wp-caption-text">Geometry LOD visualization: LOD 0 (red), LOD 1 (green), LOD 2 (blue).</p></div></td>
</tr>
</tbody>
</table>
<p>When dyamic LOD is enabled, the demo also makes it possible to visualize the various geometry detail levels by pressing the F4 key. The highest detail LOD is marked with red, mid-level with green and the lowest detail geometries are marked as blue. It can be seen that as the camera moves the renderer automatically adjusts the detail of each individual instance.</p>
<p>Besides maintaining a constant quality without the viewer to observe any transitions between the various detail levels, the algorithm provides a huge performance gain in case of complex geometries as it can be seen on the figure below:</p>
<p><div class="wp-caption aligncenter" style="width: 654px"><img class="   " src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-fps.png" alt="Performance comparison of various culling and LOD techniques in frames per second on a Radeon HD5770 (higher is better)" width="644" height="224" /><p class="wp-caption-text">Performance comparison of the demo in frames per second on a Radeon HD5770 (higher is better): no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).</p></div>
<h2>Conclusion</h2>
<p>We&#8217;ve seen how straightforward is to implement GPU based dynamic geometry LOD determination using geometry shaders on OpenGL 4.0 compliant hardware providing also a reference implementation that uses the algorithm to efficiently determine detail levels for large number of instanced geometry. We also briefly mentioned that the algorithm can be extended to handle arbitrary object sets. We discussed about a possible OpenGL 3 based implementation but we did not provide one as it requires several rendering passes to perform all the operations that can be implemented in a single pass on Shader Model 5.0 hardware.</p>
<p>Even though the algorithm is already extremely efficient, it still involves the use of asynchronous primitive queries that may induce some latency. Of course, this latency can be easily hidden by performing other operations on the CPU/GPU until the results are available.</p>
<p>Furthermore, taking full advantage of Shader Model 5.0 GPUs it would be possible to eliminate the need of asynchronous queries by using atomic counters and indirect rendering, however the core OpenGL specification does not expose yet such functionality so this improvement is left for a future release of the demo.</p>
<p>Classic dynamic geometry LOD algorithms are still first class citizens of every rendering system and even though the introduction of hardware tessellation somewhat subsumes the need for these classic techniques, practice shows that the best way to implement a full-fledged dynamic LOD system is by using geometry LOD selection and tessellation together rather that one instead of the other.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Hierarchical-Z map based occlusion culling</title>
		<link>http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/</link>
		<comments>http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/#comments</comments>
		<pubDate>Tue, 19 Oct 2010 19:13:32 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[depth buffer]]></category>
		<category><![CDATA[fragment shader]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LOD]]></category>
		<category><![CDATA[mipmap]]></category>
		<category><![CDATA[occlusion culling]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[transform feedback]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=397</guid>
		<description><![CDATA[Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F10%252Fhierarchical-z-map-based-occlusion-culling%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FaGM0Fs%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Hierarchical-Z%20map%20based%20occlusion%20culling%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 210px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/10/mountains.png"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-thumb.png" alt="OpenGL 4.0 - Mountains demo" width="200" height="150" /></a><p class="wp-caption-text">OpenGL 4.0 - Mountains demo</p></div>
<p>Hierarchical-Z is a well known and standard feature of modern GPUs that allows them to speed up depth testing by rejecting large group of incoming fragments using a reduced and compressed version of the depth buffer that resides in on-chip memory. The technique presented in this article uses the same basic idea to allow batched occlusion culling for large amount of individual objects using a geometry shader without the need of any CPU intervention that is unavoidable using traditional occlusion queries. The article also provides a reference implementation in the form of the OpenGL 4.0 Mountains demo that uses the technique for culling thousands of object instances.</p>
<p><span id="more-397"></span></p>
<h2>Introduction</h2>
<p>Occlusion culling is a visibility determination algorithm that is used to identify those objects that did reside in the view volume but still aren&#8217;t visible on the screen due to occlusion. That means they are hidden by such objects that reside closer to the camera.</p>
<p>For several generations now GPUs allow hardware accelerated methods to perform occlusion culling in the form of occlusion queries. OpenGL provides the functionality via the extension <a title="GL_ARB_occlusion_query" href="http://www.opengl.org/registry/specs/ARB/occlusion_query.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/occlusion_query.txt?referer=');">ARB_occlusion_query</a>. Occlusion queries are very simple: when you draw an object with occlusion query enabled the query returns the number of samples that passed the depth test (or simply return true or false based on whether any samples of the objects passed the depth test or not as it is provided by the OpenGL extension <a title="GL_ARB_occlusion_query2" href="http://www.opengl.org/registry/specs/ARB/occlusion_query2.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/occlusion_query2.txt?referer=');">ARB_occlusion_query2</a>).</p>
<p>So actually performing occlusion culling using occlusion queries means simply the following:</p>
<ol>
<li>Draw the object while occlusion query is enabled.</li>
<li>If the query result is that the object is visible then draw the object.</li>
</ol>
<p>At first, this may sound stupid as you have to draw the object in order to tell whether it is visible or not. While in this form it really sounds silly, in practice occlusion query can save a lot of work for the GPU. Think about you have a complex object with several thousands of triangles. If you would like to determine the visibility of it using occlusion query you would simply render e.g. the bounding box of the object and if the bounding box is visible (occlusion query returns that some samples have passed) then it means the object itself is most probably visible. This way you can save the GPU from the unnecessary processing of large amount of geometry.</p>
<p>I have to mention here that I intentionally used the expression &#8220;most probably visible&#8221; as occlusion queries provide just a conservative estimate on whether the object is visible or not rather than an exact result. This is because the bounding box occupies a different (larger) portion of the screen than the original geometry. So what we expect from an occlusion culling algorithm is to give one of the following results: the object is not visible or the object is most probably visible. The bigger this probability is the better the occlusion culling effectiveness is.</p>
<p>While we would always want an occlusion culling algorithm to be as effective as possible usually we have to make a trade-off between effectiveness and efficiency. In the above example if we would like to have 100% effectiveness then we would have to draw the whole object and that would defeat most of the goals of occlusion culling. The algorithm presented in this article is somewhat even more conservative but enables the use of occlusion culling for much larger datasets.</p>
<h2>Motivation</h2>
<p>While hardware accelerated occlusion query is a powerful tool to use in visibility determination it puts a quite reasonable burden on the application to manage the occlusion queries and to draw the objects based on the results when they are available (taking in consideration the asynchronous nature of occlusion queries). The most naive use of occlusion queries would be to execute the query right before we have to draw the object. While this seems like a feasible idea, it does not perform well in practice as the CPU has to be stalled until the result of the query is available and that involves also empty cycles on the GPU as well thus results in unacceptable performance. In order to resolve this, the application has to fill the time between the query execution and the drawing of the object based on the query result. While there are techniques to accomplish this, it definitely comes at a cost as the implementation becomes more complex.</p>
<p>The aforementioned problem is somewhat resolved by using conditional rendering introduced in OpenGL 3 (<a title="GL_NV_conditional_render" href="http://www.opengl.org/registry/specs/NV/conditional_render.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/NV/conditional_render.txt?referer=');">NV_conditional_render</a> extension). However, this extension does nothing just in case the results of the query are not available yet then we simply draw the object no matter if it is visible or not. This can avoid the stalling of the rendering pipeline and can be done in software if the extension is not available, however, it somewhat defeats the purpose of occlusion culling.</p>
<p>Another deficit when using occlusion queries is that there is still need for CPU intervention in order to make a decision about the visibility of the object. For today&#8217;s hardware where proper batching is one of the most crucial aspects of the renderer such an approach is rather ineffective.</p>
<p>The occlusion culling technique presented in this article solves both these issues by providing an implementation that is very simple to integrate into any renderer, does put little to no burden on the renderer and makes decision about the visibility of objects entirely on the GPU.</p>
<h2>The algorithm</h2>
<p>As in case of many other GPU based culling algorithm presented by me and others, the hierarchical-Z map based occlusion culling uses the geometry shader&#8217;s ability to deny the emission of primitives that are determined to be invisible on the final rendering. The shader will only emit data for those objects that are visible and this data is streamed out into a buffer object using transform feedback.</p>
<p>The algorithm itself is similar in spirit to the hierarchical Z testing that is implemented in modern GPUs. After rendering all the occluders in the scene, we construct a hierarchical depth image from the depth buffer which we will refer to as the Hi-Z map. This texture map is a mip-mapped, screen resolution image where each texel in mip level <em>i</em> contains the maximum depth of all corresponding texels in mip level <em>i-1</em>. This depth information can be collected during the main rendering pass for the occluding objects as we need a texture of the same resolution so we don&#8217;t need a separate depth pass. This can be simply accomplished using OpenGL framebuffer objects.</p>
<p>After the construction of the Hi-Z map, occlusion culling can be performed by comparing depth value of the object&#8217;s bounding volume and the depth information stored in the Hi-Z map. This is when the hierarchical mip-mapped structure of the Hi-Z map comes handy as we can do conservative depth comparisons with less texture fetches by sampling directly from a particular mip level.</p>
<p>This is why we constructed the Hi-Z map using a &#8220;store maximum depth&#8221; policy. This will work with a usual depth buffer setup where the depth comparison function is either GREATER or GEQUAL. For a reverse directed depth buffer the &#8220;store minimum depth&#8221; policy has to be used.</p>
<h3>Hi-Z map construction</h3>
<p>In case of single-sample rendering, one can use the Hi-Z map as the main depth buffer for rendering the scene. The technique extends also to multi-sampled rendering but in this case a separate full-screen quad pass is needed to calculate the maximum depth of each individual sample in the multi-sampled depth buffer and store it in the single-sampled Hi-Z map. This is possible since OpenGL 3.2 or using the extension <a title="GL_ARB_texture_multisample" href="http://www.opengl.org/registry/specs/ARB/texture_multisample.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_multisample.txt?referer=');">ARB_texture_multisample</a>. Besides this additional step, the algorithm remains the same.</p>
<p>The Hi-Z map can be constructed using OpenGL framebuffer objects by rendering a full-screen quad pass for each mip level where the previous mip level is bound as the input texture and the current mip level is bound as render target. As OpenGL does allow rendering from and to the same texture object as far as we don&#8217;t access the same mip level for both reading and writing, the algorithm simply looks like the following:</p>
<pre class="brush:cpp">// bind depth texture
glBindTexture(GL_TEXTURE_2D, depthTexture);
// calculate the number of mipmap levels for NPOT texture
int numLevels = 1 + (int)floorf(log2f(fmaxf(SCREEN_WIDTH, SCREEN_HEIGHT)));
int currentWidth = SCREEN_WIDTH;
int currentHeight = SCREEN_HEIGHT;
for (int i=1; i&lt;numLevels; i++) {
  // calculate next viewport size
  currentWidth /= 2;
  currentHeight /= 2;
  // ensure that the viewport size is always at least 1x1
  currentWidth = currentWidth &gt; 0 ? currentWidth : 1;
  currentHeight = currentHeight &gt; 0 ? currentHeight : 1;
  glViewport(0, 0, currentWidth, currentHeight);
  // bind next level for rendering but first restrict fetches only to previous level
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, i-1);
  glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, i-1);
  glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT,
                         GL_TEXTURE_2D, depthTexture, i);
  // draw full-screen quad
  ............
}
// reset mipmap level range for the depth image
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_BASE_LEVEL, 0);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAX_LEVEL, numLevels-1);</pre>
<p>It is very important not to forget about the step when we ensure that the viewport size is always at least 1&#215;1 as in case of non-power-of-two (NPOT) textures due to rounding problems. I forgot this first and I was wondering an hour why my last mip level didn&#8217;t get filled.</p>
<p>While one may wonder how this technique can be efficient after so many full-screen quad passes, it is in fact very efficient and it constructs the Hi-Z map on my Radeon HD5770 in less than <strong>0.2 milliseconds</strong>. The measurement should be quite accurate as I&#8217;ve done it using OpenGL timer queries (see the extension <a title="GL_ARB_timer_query" href="http://www.opengl.org/registry/specs/ARB/timer_query.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/timer_query.txt?referer=');">ARB_timer_query</a>).</p>
<p>The fragment shader used for the construction of the Hi-Z map is very straightforward except one thing. We use an NPOT depth texture due to the aspect ratio of the window and as NPOT textures use a &#8220;floor&#8221; convention to determine the size of subsequent mip levels (see the extension <a title="GL_ARB_texture_non_power_of_two" href="http://www.opengl.org/registry/specs/ARB/texture_non_power_of_two.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_non_power_of_two.txt?referer=');">ARB_texture_non_power_of_two</a>) we need predicated fetches as in case of reduction from odd-sized mip levels we should not forgot about the edge texels:</p>
<pre class="brush:c">#version 400 core

uniform sampler2D LastMip;
uniform ivec2 LastMipSize;

in vec2 TexCoord;

void main(void)
{
  vec4 texels;
  texels.x = texture( LastMip, TexCoord ).x;
  texels.y = textureOffset( LastMip, TexCoord, ivec2(-1, 0) ).x;
  texels.z = textureOffset( LastMip, TexCoord, ivec2(-1,-1) ).x;
  texels.w = textureOffset( LastMip, TexCoord, ivec2( 0,-1) ).x;

  float maxZ = max( max( texels.x, texels.y ), max( texels.z, texels.w ) );

  vec3 extra;
  // if we are reducing an odd-width texture then fetch the edge texels
  if ( ( (LastMipSize.x &amp; 1) != 0 ) &amp;&amp; ( int(gl_FragCoord.x) == LastMipSize.x-3 ) ) {
    // if both edges are odd, fetch the top-left corner texel
    if ( ( (LastMipSize.y &amp; 1) != 0 ) &amp;&amp; ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
      extra.z = textureOffset( LastMip, TexCoord, ivec2( 1, 1) ).x;
      maxZ = max( maxZ, extra.z );
    }
    extra.x = textureOffset( LastMip, TexCoord, ivec2( 1, 0) ).x;
    extra.y = textureOffset( LastMip, TexCoord, ivec2( 1,-1) ).x;
    maxZ = max( maxZ, max( extra.x, extra.y ) );
  } else
  // if we are reducing an odd-height texture then fetch the edge texels
  if ( ( (LastMipSize.y &amp; 1) != 0 ) &amp;&amp; ( int(gl_FragCoord.y) == LastMipSize.y-3 ) ) {
    extra.x = textureOffset( LastMip, TexCoord, ivec2( 0, 1) ).x;
    extra.y = textureOffset( LastMip, TexCoord, ivec2(-1, 1) ).x;
    maxZ = max( maxZ, max( extra.x, extra.y ) );
  }

  gl_FragDepth = maxZ;
}</pre>
<p>I was experimenting with using texture gather lookups to reduce the number of texture fetches from 4-to-7 fetches per fragment down to 1-to-3 fetches per fragment (see the extension <a title="GL_ARB_texture_gather" href="http://www.opengl.org/registry/specs/ARB/texture_gather.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/texture_gather.txt?referer=');">ARB_texture_gather</a>) it seems that texture gather works only if the image is linearly sampled and to avoid the additional burden involved by switching filtering state during rendering I stuck to simple texture lookups as using texture gather lookups did not show any visible effect on the construction time of the Hi-Z map.</p>
<div class="wp-caption aligncenter" style="width: 602px"><img title="Various mip levels of the Hi-Z map" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/depth-lods.png" alt="Various mip levels of the Hi-Z map" width="592" height="144" /><p class="wp-caption-text">Various mip levels of the Hi-Z map. The Hi-Z map size is 1024x768 and the displayed mip levels are: level 4 (left), level 5 (middle) and level 6 (right).</p></div>
<p>For debugging and demonstration purposes the Mountains demo has built-in function to display the content of the various mip levels of the Hi-Z map. This is available by pressing the F4 key while Hi-Z map based occlusion culling is enabled. The + and &#8211; keys can be used to switch between the mip levels.</p>
<p>In order to better visualize the depth information in the depth buffer I converted the non-linear depth values stored in the depth texture into linear depth values as presented in <a title="[GeeXLab] How to Visualize the Depth Buffer in GLSL" href="http://www.geeks3d.com/20091216/geexlab-how-to-visualize-the-depth-buffer-in-glsl/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.geeks3d.com/20091216/geexlab-how-to-visualize-the-depth-buffer-in-glsl/?referer=');">[GeeXLab] How to Visualize the Depth Buffer in GLSL</a>.</p>
<h3>Culling with the Hi-Z map</h3>
<p>Once we have constructed the Hi-Z map, we can perform the actual occlusion culling by fetching the 2&#215;2 texel neighborhood corresponding to the screen area occupied by the bounding volume of the object whose visibility has to be determined. In the demo I used bounding boxes but any other bounding volume can be used (e.g. a bounding sphere is usually accurate enough for this technique).</p>
<p>First, we have to calculate the clip space bounding rectangle of the bounding volume. In the bounding box case this is done by transforming the bounding box vertices into clip space and then calculate the minimum and maximum X and Y coordinates. This bounding rectangle will be used for two things: it defines the texture coordinates that we&#8217;ll have to use for the Hi-Z map lookup and it helps determining the appropriate LOD for the texture lookup.</p>
<p>In order to determine the texture LOD that we&#8217;ll have to fetch we have to calculate the screen space size of the bounding square corresponding to the clip space bounding rectangle determined previously. This can be simply done by calculating the width and height of the bounding rectangle in clip space and then transforming this into screen space:</p>
<pre class="brush:c">float ViewSizeX = (BoundingRect[1].x-BoundingRect[0].x) * Transform.Viewport.y;
float ViewSizeY = (BoundingRect[1].y-BoundingRect[0].y) * Transform.Viewport.z;</pre>
<p>After this, the texture LOD can be simply calculated using the following formula:</p>
<pre class="brush:c">float LOD = ceil( log2( max( ViewSizeX, ViewSizeY ) / 2.0 ) );</pre>
<p>Finally, as we have the texture coordinates (the vertices of the clip space bounding rectangle) and the texture LOD, we simply have to make four texture lookups into the Hi-Z map using these parameters, calculate the maximum of the four depth values returned and compare it to the depth value corresponding to the object (this is the object&#8217;s front-most point&#8217;s depth value that comes also from the clip space coordinates of the bounding box). If the object depth is greater than the reference depth the object is occluded and so it is culled by the geometry shader as usual.</p>
<p>One may ask why we use a 2&#215;2 texel footprint for calculating the reference depth value why not just fetch the next mip level only once (as there we also get the maximum values of a 2&#215;2 texel footprint due to the Hi-Z map construction method). That&#8217;s what I&#8217;ve also asked myself at first sight but quickly figured out the reason (see the figure below).</p>
<div class="wp-caption aligncenter" style="width: 530px"><img class=" " title="Comparison of four texel fetches and one texel fetch for depth comparison" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/fetch-modes.png" alt="Comparison of four texel fetches and one texel fetch for depth comparison" width="520" height="256" /><p class="wp-caption-text">Comparison of number of fetches used for occlusion culling. Both figures show the magnified screen coverage of a single Hi-Z map texel at mip level N, texel coverage for mip level N-1 is in cyan and texel coverage for mip level N-2 is in blue. Object is show as red and yellow indicates the fetched texels.</p></div>
<p>In case of four texels not just the determination of the texture LOD is much easier but also it better encompasses the actual object bounding rectangle. In case of one texture fetch the computation of texture LOD is more complicated and expensive but the main problem is that a larger LOD has to be fetched and it is not always the LOD determined in the case of four fetches plus one. In the most extreme situation (if the bounding rectangle is right at the middle of the screen) it is possible that we have to fetch the largest LOD. This does not result in any false culling but it severely degrades the effectiveness of the culling.</p>
<p>Of course, it is possible to use more complex screen space bounding polygon as well as more fetches but those would increase the effectiveness of the culling much less than the additional burden and expensive operations worth.</p>
<h2>Conclusion</h2>
<p>We&#8217;ve seen how traditional hardware occlusion culling works by using occlusion queries. We also discussed that we sometimes need a better algorithm that does the occlusion culling for large amount of objects without CPU intervention.</p>
<p>The article also described a way to implement such an occlusion culling algorithm by using a hierarchical-Z map and geometry shaders. We&#8217;ve also managed to provide a reference implementation in the form of the demo called Mountains that can be downloaded with full source code in the <a title="OpenGL 4.0 - Mountains demo download" href="http://rastergrid.com/blog/downloads/mountains-demo/">downloads section</a>.</p>
<p>The algorithm performs very well in practice on current hardware. The Hi-Z map construction takes less than 0.2 milliseconds and the actual culling comes at almost no cost for even thousands of objects. For more detail about performance comparison between rendering with and without hierarchical-Z map based occlusion culling read the article about the <a title="OpenGL 4.0 - Mountains demo released" href="http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/">OpenGL 4.0 Mountains Demo</a>.</p>
<p>While the demo uses the technique only for culling instances of the same object, the technique can be easily extended to work for heterogeneous set of objects as the actual culling algorithm works on a per-object basis and is completely indifferent regarding to the method used for rendering the actual geometry.</p>
<p>This technique can be thought of as the next step towards a completely GPU based visibility determination and scene management system.</p>
<p>Acknowledgements go to Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk and their <a title="SIGGRAPH 2008 Course Notes about the March of the Froblins" href="http://developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf?referer=');">SIGGRAPH 2008 Course Notes</a> that inspired this work.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>OpenGL 4.0 &#8211; Mountains demo released</title>
		<link>http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/</link>
		<comments>http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/#comments</comments>
		<pubDate>Mon, 11 Oct 2010 21:19:21 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[culling]]></category>
		<category><![CDATA[geometry instancing]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GLEW]]></category>
		<category><![CDATA[GLM]]></category>
		<category><![CDATA[GLSL]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[LOD]]></category>
		<category><![CDATA[occlusion culling]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[SFML]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=339</guid>
		<description><![CDATA[OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn&#8217;t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F10%252Fopengl-4-0-mountains-demo-released%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FawWubV%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22OpenGL%204.0%20-%20Mountains%20demo%20released%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 210px"><a href="http://rastergrid.com/blog/wp-content/uploads/2010/10/mountains.png"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-thumb.png" alt="OpenGL 4.0 - Mountains demo" width="200" height="150" /></a><p class="wp-caption-text">OpenGL 4.0 - Mountains demo</p></div>
<p>OpenGL 3.0 capable GPUs introduced a level of processing power and programming flexibility that isn&#8217;t comparable with any earlier generations. After that, OpenGL 4.0 and the hardware supporting it even further pushed the limits of what previously seemed to be impossible. Thanks to these features nowadays more and more possibilities are available for the graphics developers to implement GPU based scene management and culling algorithms. The Mountains demo showcases some of these rendering techniques that, as far as I know, were never implemented so far using OpenGL. In this article I will present the key features of the demo that will be discussed in more detail in subsequent articles. Demo binaries with full source code are also published.</p>
<p><span id="more-339"></span>The demo itself is mainly inspired by the <a title="March of the Froblins" href="http://developer.amd.com/samples/demos/pages/froblins.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/samples/demos/pages/froblins.aspx?referer=');">March of the Froblins</a> demo released by AMD and the <a title="Chapter03-SBOT-March_of_The_Froblins.pdf" href="http://developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/documentation/presentations/legacy/Chapter03-SBOT-March_of_The_Froblins.pdf?referer=');">SIGGRAPH 2008 Course Notes</a> by Jeremy Shopf, Joshua Barczak, Christopher Oat and Natalya Tatarchuk presenting the actual implementation in detail. That demo targeted the Radeon HD4800 series and presented several practical GPU based culling algorithms implemented using DirectX10. The Mountains demo implements these techniques in OpenGL and further improves the technique used in AMD&#8217;s demo by unleashing the new features introduced by Shader Model 5.0 hardware and OpenGL 4.0.</p>
<p>While this article briefly presents the demo and the used rendering techniques, the details of each individual technique will be presented in subsequent articles as the thorough examination of them needs a longer discussion that would render this article simply too long and overwhelming.</p>
<h2>Introduction</h2>
<p>The Mountains demo renders a tiled terrain block with thousands of high detail tree models (the full detail tree model is over five thousand triangles). Due to the view distance used in the demo is quite large, several tiles of the terrain block are potentially visible on the screen and this results in a huge explosion in the number of triangles the GPU has to render. Also, with traditional methods the rendering of the terrain blocks and the several thousand tree models would need loads of draw calls. In order to solve this problem, the demo renders the trees using geometry instancing to minimize the number of draw calls.</p>
<p>In a traditional rendering engine CPU based culling methods would be used. While that would even work in practice, it is more convenient to perform the culling on the GPU as every information needed to do it is available there. Nevertheless, culling is a typical algorithm that can easily take advantage of the highly parallel architecture of the GPU. Also, performing the culling on the CPU would make geometry instancing barely beneficial.</p>
<p>Another problem with a scene like this is that a simple per-object view frustum culling would not solve the problem completely as most of trees in the view frustum are not visible due that they are hidden by the terrain. In traditional OpenGL the way how to solve this problem would be the use of per-object occlusion queries and rendering of bounding volumes. While this may work in practice, it involves too much CPU intervention even if we take advantage of conditional rendering and nevertheless, this also breaks instancing.</p>
<p>These are the issues that motivated me in creating this demo and I established the following goals for the project:</p>
<div class="wp-caption alignright" style="width: 210px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains2.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains2.png?referer=');"><img class="  " title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains2-thumb.png" alt="View from above" width="200" height="150" /></a><p class="wp-caption-text">View from above</p></div>
<ul>
<li>All the object-level information must stay on the GPU and the CPU should not make decisions on a per-object basis.</li>
<li>The renderer should use as few draw calls as possible in order to solve the problem of visibility determination.</li>
<li>Don&#8217;t draw anything that is not inside the view frustum or is occluded by terrain.</li>
</ul>
<p>The result is a renderer that does little to no scene management on the CPU, instead uses the GPU for visibility determination that is, in most cases, able to reduce the scene&#8217;s geometric complexity from over 400 million triangles under one million triangles providing an interactive experience on a Radeon HD5770 with around 200 frames per second.</p>
<h2>Implementation</h2>
<p>The scene consists of a tiled terrain with over 130 thousands of triangles and more than 1400 tree instances each with almost 6 thousands of triangles. This sums up to 8 million triangles for a single tile block of terrain. As the view range is needed to be quite large we actually deal with a 7&#215;7 tile of terrain that is dynamically placed in a way that the camera always resides in the middle block of the tile. What all this means that even though we dynamically generate the scenery around the camera, we still have to deal with a scene consisting of over 400 million triangles. This is simply too much for the GPU to deal with.</p>
<p>The first step done in order to reduce the geometric complexity of the scene is done on the CPU by performing a view frustum culling on a per-terrain-block basis. This will limit our 7&#215;7 tile to a smaller subset that contains only those blocks that are lying within the view frustum. The result is a scene usually around 50 million triangles.</p>
<p>While this is already a reasonable amount of simplification, in order to further reduce the amount of geometry we have to render we have to do per-object culling. But as mentioned before, we would not like to do such fine grained scene management on the CPU so we need some sophisticated methods to do it on the GPU.</p>
<p>In order to accomplish this, we will take advantage of the geometry shader&#8217;s capability of discarding geometry. We will use it to do the per-object decisions in order to cull the tree instances that are not visible. The three techniques implemented in the culling geometry shader and the accompanying vertex shader are the following:</p>
<ul>
<li><strong>Instance Cloud Reduction (ICR)</strong> &#8211; This method does view frustum culling on a per-instance basis based on the bounding box of the instanced geometry, in this case the tree. The technique was first presented in my previous article titled <a title="Instance culling using geometry shaders" href="http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/">Instance culling using geometry shaders</a> and then further improved according to the instructions presented in <a title="Instance Cloud Reduction reloaded" href="http://rastergrid.com/blog/2010/06/instance-cloud-reduction-reloaded/">Instance Cloud Reduction reloaded</a>. In this case, the technique allows us to do a more fine grained yet still high level view frustum culling of the tree instances than that allowed by the simple per-tile culling performed on the CPU.</li>
<li><strong>Hierarchical-Z Map based Occlusion Culling</strong> &#8211; This technique allows for conservative per-instance occlusion culling completely done and evaluated on the GPU using a similar algorithm that the hardware depth buffer uses to hierarchically reject fragments based on their depth values. Using this technique, a coarse occlusion culling can be performed on the instances without the need of occlusion queries and CPU intervention. <strong>Update!</strong> The technique is discussed in detail in the article <a title="Hierarchical-Z map based occlusion culling" href="http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/">Hierarchical-Z map based occlusion culling</a>.</li>
<li><strong>Dynamic Level-of-Detail Determination</strong> &#8211; This method allows us to dynamically select a suitable geometry level-of-detail on a per-instance basis completely on the GPU based on the application provided LOD parameters and the distance of the instance from the camera. The Mountains demo uses three LOD levels for the tree object: one with 5811 triangles, another with 2893 triangles and the lowest detailed version contains 1492 triangles. <strong>Update!</strong> The technical details of the algorithm are presented in the article <a title="GPU based dynamic geometry LOD" href="http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/">GPU based dynamic geometry LOD</a>.</li>
</ul>
<p>While in the Mountains demo all these techniques are used to determine the visibility and the LOD of static scenery (as trees are unlikely to move) the truth is that these methods apply with no modification also to dynamic scenery. This is a very important thing to note as usually dynamic objects are those that makes many of the CPU based scene management and visibility determination algorithms difficult to use or simply inefficient.</p>
<p>The key improvement compared to how these techniques are used in AMD&#8217;s demo is that my implementation applies all the algorithms to the instance set in a single rendering pass compared to the several passes needed by the original implementation. This is because the Mountains demo takes advantage of the latest technologies introduced by OpenGL 4.0 and the supporting hardware (in this case the functionality provided by the extension <a title="GL_ARB_transform_feedback3" href="http://www.opengl.org/registry/specs/ARB/transform_feedback3.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/transform_feedback3.txt?referer=');">GL_ARB_transform_feedback3</a>).</p>
<p>By using these techniques the GPU is able to reduce the geometric complexity of the scene from 50 million triangles down to around a few millions, sometimes even under a million. Of course, the actually reduction efficiency is heavily influenced by the view position and direction.</p>
<p>Besides the scene management and visibility determination techniques, the demo also showcases a few simple visual effects:</p>
<div class="wp-caption alignright" style="width: 210px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains3.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains3.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains3-thumb.png" alt="View horizon and sky" width="200" height="150" /></a><p class="wp-caption-text">View horizon and sky</p></div>
<ul>
<li>A simple infinitely far skybox generated using a geometry shader.</li>
<li>Simple diffuse lighting applied to the tree instances.</li>
<li>Global illumination-like effect that simulates the terrain to cast shadows over the trees even though no shadow rendering technique is applied.</li>
<li>Fog effect to smooth out the disappearance of the terrain at the far clip plane.</li>
<li>Simplistic fake depth-of-field effect that makes far away objects look blurry.</li>
</ul>
<p>Maybe I will present also some of these techniques in detail in another article if there is interest for it.</p>
<p>As I mentioned, I used a geometry shader to render the skybox and so I did when rendering full screen quads to apply image space algorithms. I&#8217;ve done this because I always feel kind of stupid when I have to put such a simple geometry like a skybox or a full screen quad into a vertex buffer. In these situations I feel like I would simply use immediate mode to draw that damn little piece of geometry but I want to stick to core OpenGL so I quickly change my mind. As a simple alternative, I rather used geometry shaders to emit these simple geometric objects that are used so often that I even wonder how OpenGL does not have e.g. a glDrawScreenQuad-like command. Of course, the geometry shaders don&#8217;t start by themselves so I used dummy draw commands to make the geometry shader do its job.</p>
<h2>Performance</h2>
<p>Now let&#8217;s see how our GPU based optimizations perform in practice. I&#8217;ve collected results from typical view positions from where a moderate number of trees are visible. The tests were done on a Radeon HD 5770. Other configuration parameters are not really relevant as the demo is clearly GPU bound as only a few state changes and render commands are executed on the CPU. Of course, this is kind of a synthetic demo as you would usually want to balance the workload between the CPU and the GPU but usually you have AI, physics and other things for the CPU so transferring as much work to the GPU as possible usually gives a great benefit.</p>
<div class="wp-caption aligncenter" style="width: 654px"><img class="   " src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-fps.png" alt="Performance comparison of various culling and LOD techniques in frames per second on a Radeon HD5770 (higher is better)" width="644" height="224" /><p class="wp-caption-text">Performance comparison of the demo in frames per second on a Radeon HD5770 (higher is better): no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).</p></div>
<p>As you can see on the figure above, using all the optimizations clearly shows its benefits on the frame rate of the demo, even though the Hi-Z map based occlusion query requires several additional draw passes due to the construction of the Hi-Z map. It is also clearly visible that in a scene like this where there are a lot of occluders, ICR is simply not sufficient on its own. One final note that the application of dynamic LOD has a more significant effect without Hi-Z as occlusion culling removes the largest ratio of the instances.</p>
<div class="wp-caption aligncenter" style="width: 654px"><img src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains-mtris.png" alt="Amount of visible geometry after culling in millions of triangles: no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red)." width="644" height="224" /><p class="wp-caption-text">Amount of visible geometry after culling in millions of triangles: no culling (bottom), instance cloud reduction (middle), ICR + Hi-Z map based occlusion culling (top), no geometry LOD (blue), dynamic geometry LOD (red).</p></div>
<p>Our next chart shows the amount of geometry that is finally drawn after culling in millions of triangles. On this figure we see exactly the inverse of the previous chart and it is not surprising as obviously we have a geometry throughput bottleneck. It also clearly shows how important dynamic LOD is even if we don&#8217;t perform more sophisticated visibility determination algorithms.</p>
<table style="width: 100%;" border="0">
<tbody>
<tr>
<td></td>
<td style="text-align: center;"><strong>No LOD</strong></td>
<td style="text-align: center;"><strong>Dynamic LOD</strong></td>
</tr>
<tr>
<td><strong>No culling</strong></td>
<td style="text-align: center;">17 draw calls</td>
<td style="text-align: center;">19 draw calls</td>
</tr>
<tr>
<td><strong>Instance cloud reduction</strong></td>
<td style="text-align: center;">17 draw calls</td>
<td style="text-align: center;">19 draw calls</td>
</tr>
<tr>
<td><strong>ICR + Hi-Z map based occlusion query</strong></td>
<td style="text-align: center;">27 draw calls</td>
<td style="text-align: center;">29 draw calls</td>
</tr>
</tbody>
</table>
<p>Finally, in the table above we&#8217;ve listed the number of draw calls needed by each technique from the reference point of view. The techniques applied do not have a significant effect on the amount of draw calls: we have a fixed number of draw calls and additionally two draw calls if we use LOD. The only exception is when we use Hi-Z map based occlusion culling as the Hi-Z map is a full mipmap chain and we need ten additional draw calls to generate all the mip-levels.</p>
<h2>Conclusion</h2>
<p>The techniques presented are rather simple to implement and can provide huge performance increases. Nevertheless, they allow the renderer to offload even some of the object-level algorithms from the CPU to the GPU and obviously this is the direction to go in the future.</p>
<p>We&#8217;ve also met mostly our goals set at the beginning. Of course not fully as the occlusion culling performed is rather a coarse culling method and does not eliminate completely all the instances that will not contribute to the final image.</p>
<h2>Future work</h2>
<p>While the implementation almost completely eliminates all need of CPU intervention during the rendering phase, I still had to use a few asynchronous queries to get the amount of visible instances for each geometry LOD, although the latency incurred by the use of query objects is hidden in the demo by rendering the skybox between the initiation of the queries and the retrieving of the results.</p>
<div class="wp-caption alignright" style="width: 210px"><a href="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains4.png" onclick="pageTracker._trackPageview('/outgoing/www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains4.png?referer=');"><img title="Click to enlarge" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/10/mountains4-thumb.png" alt="Deep in the forest" width="200" height="150" /></a><p class="wp-caption-text">Deep in the forest</p></div>
<p>As soon as we get atomic counters into core OpenGL and consequently when we&#8217;ll have drivers supporting it, I will further improve the technique using indirect rendering and atomic counters so even the need for these queries will be eliminated.</p>
<p>Additionally, as mentioned several times, I plan to write detailed articles about the individual techniques I used in the demo. I decided to go in this direction as a thorough description of all the details of the demo would be simply too long in one piece.</p>
<h2>Running the demo</h2>
<p>The demo uses OpenGL 4.0 so a Shader Model 5.0 capable graphics card is a must. Even though most of the used techniques makes it possible to create an implementation running on OpenGL 3.x, this time I wanted to stick to GL 4.0 as I took advantage of the new features of it to even further improve the implementation.</p>
<p>First, don&#8217;t be afraid if after startup the demo will run on very low frame rates. This is because by default all GPU based optimizations are disabled.</p>
<p>You can use the SPACE button to switch between the various culling methods:</p>
<ul>
<li>No culling at all</li>
<li>Instance cloud reduction</li>
<li>ICR with Hi-Z map based occlusion culling</li>
</ul>
<p>Finally, you can turn dynamic LOD on and off using the F3 key.</p>
<p>There are a few other controls present in the demo that you may figure out if you read the code, but I don&#8217;t want to go into the details of them as they will be presented in the upcoming articles where I will present Hi-Z map based occlusion culling and dynamic LOD in detail. So stay tuned: <a title="Follow me on twitter" href="http://www.twitter.com/aqnuep" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.twitter.com/aqnuep?referer=');">follow me on twitter</a> or <a title="RSS Feeds" href="http://rastergrid.com/blog/feed/">subscribe to the RSS feed</a>.</p>
<p>The demo can be downloaded with full source code in the <a title="Downloads" href="http://rastergrid.com/blog/downloads/mountains-demo/">downloads section</a>.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/10/opengl-4-0-mountains-demo-released/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>History of hardware tessellation</title>
		<link>http://rastergrid.com/blog/2010/09/history-of-hardware-tessellation/</link>
		<comments>http://rastergrid.com/blog/2010/09/history-of-hardware-tessellation/#comments</comments>
		<pubDate>Wed, 29 Sep 2010 21:14:33 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[geometry shader]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[tessellation]]></category>
		<category><![CDATA[tessellation control shader]]></category>
		<category><![CDATA[tessellation evaluation shader]]></category>
		<category><![CDATA[tessellator]]></category>
		<category><![CDATA[transform feedback]]></category>
		<category><![CDATA[TruForm]]></category>
		<category><![CDATA[vertex shader]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=324</guid>
		<description><![CDATA[With the introduction of Shader Model 5.0 hardware and the API support provided by OpenGL 4.0 made GPU based geometry tessellation a first class citizen in the latest graphics applications. While the official support from all the commodity graphics card vendors and the relevant APIs are quite recent news, little to no people know that]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F09%252Fhistory-of-hardware-tessellation%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2F9lvtay%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22History%20of%20hardware%20tessellation%22%20%7D);"></div>
<div class="wp-caption alignleft" style="width: 260px"><img title="Geometry Tessellation" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/tessellation.png" alt="Geometry Tessellation" width="250" height="124" /><p class="wp-caption-text">Geometry Tessellation</p></div>
<p>With the introduction of Shader Model 5.0 hardware and the API support provided by OpenGL 4.0 made GPU based geometry tessellation a first class citizen in the latest graphics applications. While the official support from all the commodity graphics card vendors and the relevant APIs are quite recent news, little to no people know that hardware tessellation has a long history in the world of consumer graphics cards. In this article I would like to present a brief introduction to tessellation and discuss about its evolution that resulted in what we can see in the latest technology demos and game titles.<br />
<span id="more-324"></span></p>
<h2>Introduction</h2>
<p>Geometry tessellation is a graphics technique used to amplify the geometric details of a particular mesh. This is done by subdividing the polygons of the mesh into smaller polygons and, if needed, alter the position of the generated vertices to better fit the theoretical shape of the object that is being modeled by the mesh.</p>
<p>Tessellation was a commonly used technique in offline rendering softwares to add a greater level of realism to computer modeled objects as well as it has been often used as a preprocessing technique for real-time graphics applications. However, due to the increased number of geometry data, the usage of tessellated geometry was very limited in the early eras of real-time computer graphics as it needed huge amount of disk/memory storage as well as much higher processing capabilities in order to achieve interactive frame rates.</p>
<p>The key problem of an offline tessellation preprocessing and using a detailed mesh in real-time graphics is that even the latest generation of GPUs lack of the needed memory size and bandwidth to make this a practical approach and we are not even talking about additional costs that are involved by having a much larger dataset that has to be run through possibly complex vertex processing steps like skeletal animation. Having the tessellation technology integrated into the GPU makes it possible to overcome most of these restrictions.</p>
<p>While hardware tessellation as a generic feature made its way to the relevant APIs only in the recent past, there were a few earlier efforts already made by various hardware generations in order to make this technology popular. In order to present the evolution of hardware tessellation I will go through the relevant technologies in a chronological order to better see the reasons why this great feature didn&#8217;t make its way to the core API specifications until now.</p>
<h2>TruForm</h2>
<p>The first consumer graphics card featuring hardware tessellation that made its way to the market was the ATI Radeon 8500 in 2001. The tessellation feature of the GPU got known as TruForm and soon became available in OpenGL via the extension <a title="GL_ATI_pn_triangles" href="http://www.opengl.org/registry/specs/ATI/pn_triangles.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ATI/pn_triangles.txt?referer=');">GL_ATI_pn_triangles</a> but the functionality never made its way into core due to the lack of any similar hardware support from other graphics card vendors.</p>
<p>The tessellation hardware present in the Radeon 8500 was a completely fixed function component that had predefined tessellation evaluation modes even though the GPU had already support for both programmable vertex and fragment processing. It is also interesting that the tessellator operated on vertices emitted by the vertex shader if one was present.</p>
<div class="wp-caption aligncenter" style="width: 548px"><img class="   " title="Simplified rendering pipeline of the ATI Radeon 8500" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/sm14_tess.png" alt="Simplified rendering pipeline of the ATI Radeon 8500" width="538" height="160" /><p class="wp-caption-text">Simplified rendering pipeline of the ATI Radeon 8500</p></div>
<p>The tessellator itself has one configurable parameter: the tessellation level. This controls the amount of cuts that are performed over each edge of the input primitive which in case of TruForm must be always a triangle (whether it comes from a list, strip of fan). As the support for the extension has been removed a few years ago, unfortunately I cannot tell the upper limit for the tessellation level supported by TruForm but I remember as it was about 15 or so (I hope somebody can confirm it or correct me).</p>
<p>Beside that, the tessellation evaluator has a few other configurable parameters that control the way how vertex positions and normals are evaluated after the geometry amplification. For normals there is a linear and a quadratic interpolation mode, for vertex positions linear and cubic interpolation is available. All the rest of the vertex attributes are linearly interpolated over the tessellated geometry.</p>
<p>The good thing in TruForm is that it can be very simply added to an existing rendering engine implementation just by adding a few API calls but taking into account that the functionality of the hardware component can be only managed using rendering state limits tessellation parameter control to a per object basis and also means that changing the tessellation configuration breaks batches as well.</p>
<p>Another advantage of TruForm is that it works on transformed vertices which means that we can safely use tessellation with complex vertex processing techniques like skeletal animation without worrying about huge transformation costs that in case of a post-tessellation vertex shader would be inherent.</p>
<p>Another issue that is a must to be mentioned when one talks about hardware tessellation is crack-free rendering. As usually tessellation works on individual primitives there is often no guarantee that no cracks will appear between adjacent polygons after tessellation is applied. In case of TruForm this is relevant only if cubic position interpolation is used as only that mode alters the vertex positions themselves. In case this vertex position evaluation mode is used the artists must ensure that vertices on common edges have the same normal. This is quite a limiting factor in certain situations but should not cause any problems in the most of the common use cases.</p>
<p>A huge deficit of the original N-Patch implementation of ATI is that the tessellation evaluation is not programmable and has little to no options to control how the resulting vertices will look like. This meant that novel graphics techniques like displacement mapping were not possible to be implemented with it. While this is a quite severe limiting factor, TruForm was still a great feature for increasing the detail of already existing and upcoming game titles.</p>
<p>Unfortunately TruForm wasn&#8217;t that welcome by the developer community due to the additional burden brought to artists and the lack of flexibility from programming side. Still, I think the most important factor was the lack of wide adoption of the feature from other relevant vendors.</p>
<h2>Beyond TruForm</h2>
<p>After the original appearance of hardware tessellation there were several further efforts to make geometry tessellation a popular feature in real-time graphics. Besides ATI, Matrox also released GPUs with N-Patch support and ATI has also improved his TruForm feature with the appearance of the Radeon 9700. These cards were able to do two very important things that the original TruForm was lacking.</p>
<p>First, they provided means to do the tessellation evaluation based on a texture which enabled the implementation of displacement mapping. Second, and in my opinion even more important, that they supported adaptive tessellation which means that the tessellation factor was calculated dynamically based on the distance from the camera. Finally, the new tessellation implementations allowed also continuous tessellation mode thus allowing seamless transition between various tessellation levels.</p>
<p>Unfortunately I don&#8217;t know any OpenGL extensions that exposed this functionality and that means also that I&#8217;ve never had a closer look at them so if you are interested in these technologies you&#8217;ll have to do a little bit of search around the internet.</p>
<h2>Geometry Shaders</h2>
<p>After the failure of the early attempts to introduce hardware tessellation to the general public, the appearance of Shader Model 4.0 capable graphics cards made many developers think that we&#8217;re gonna see hardware tessellation in the form of geometry shaders. While actually some cards really had a new generation of tessellation hardware on the market this time that had nothing to do with geometry shaders, but I will talk about it later&#8230;</p>
<p>Many developers have incorrectly seen a practical tessellator in the form of the geometry shader at its appearance. While it is true that a geometry shader can in fact be used to perform geometry amplification, several hardware limitations in fact make this approach rather inefficient in practice. Anyway, first I will talk about how geometry shaders can be used for tessellation and after that I will tell why not to do so.</p>
<p>The geometry shader is a new programmable stage introduced by Shader Model 4.0 that operates on whole primitives after vertex processing and before primitive assembly. They have a fixed input and output primitive type that doesn&#8217;t have to match. This means it is possible to emit triangles even though the input primitives were points.</p>
<p>The greatest feature of geometry shaders is that they can output a dynamically adjustable amount of geometric primitives based on the input primitive including even the possibility to discard the current primitive. The first allows us to do a certain amount of geometry amplification with them and evaluate the output primitives as we wish (of course, within the boundaries of the possibilities of a shader). The only limiting factor is the upper limit of the output buffer available on the target hardware. This, in fact is a rather limiting factor, especially in case of large number of vertex attributes.</p>
<div class="wp-caption aligncenter" style="width: 454px"><img title="Simplified rendering pipeline of Shader Model 4.0" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/sm40_tess.png" alt="Simplified rendering pipeline of Shader Model 4.0" width="444" height="160" /><p class="wp-caption-text">Simplified rendering pipeline of Shader Model 4.0</p></div>
<p>In this use case scenario, the geometry shader acts like both the tessellator and the evaluator as it is used for both the execution of the geometry amplification as well as the interpolation of the vertex attributes. This provides almost complete flexibility over how we would like to implement our tessellation algorithm. We can choose the tessellation factor as part of the programmable stage so adaptive tessellation is no problem. Also we can easily add displacement mapping or any other technique to control how our newly generated primitives will be positioned and oriented.</p>
<p>Now, as we have seen how easy and flexible is a geometry shader based tessellation implementation, let&#8217;s see the dark side of it&#8230;</p>
<p>First of all, as the geometry shader is a revolutionary feature compared to earlier programmable GPU capabilities it suffer from the fact that it doesn&#8217;t really fit into the existing architecture. Previously, every fixed-function and programmable hardware component on the GPU had a fixed amount of input and output data making it possible to create a kind of a synchronous pipeline architecture. This way it was rather easy for the execution dispatcher to share workload over the computing units and keep them all the time busy (at least most of the time).</p>
<p>The programmable amount of output data made possible by geometry shaders somewhat breaks this synchronous architecture. This means that a more dynamic dispatching mechanism is required to control the consumption of the data output by it. In order to achieve this there are two important issues:</p>
<p>First, there should be a temporary buffer that will hold the output as we cannot guarantee that the outputs can be immediately fed to the subsequent stages of the rendering pipeline. This has been implemented by memory buffers and/or caches by the various vendors.</p>
<p>Second, due to the geometry shader can be executed in parallel (at least in theory) and various instances of the geometry shader can output various amount of primitives, there can be problems with the synchronization of data emissions and the order in what output primitives will take place in the output buffer.</p>
<p>AMD solved these problems by introducing a new cache that is meant to handle the special nature of primitive emissions executed by the geometry shader. Unfortunately NVIDIA&#8217;s implementation is much more limited and, as far as I can tell, it may result in that geometry shader instances are executed only on one or just a few computing units which can severely degrade performance in case of tessellation. This is the reasoning behind why we have to specify in GLSL the maximum number of primitives that our geometry shader can output. This is used as an input for NVIDIA drivers to plan the necessary storage strategy for the geometry shader and in fact they have no any effect in case of AMD GPUs. So if you want your geometry shader to run faster on AMD GPUs, just set this maximum limit as high as possible <img src='http://rastergrid.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>There is another problem with a geometry shader based tessellator implementation: the geometry amplification is done iteratively within a single shader which is quite a waste in case of a highly parallelized processor architecture like that of the GPU. This results in a reasonable amount of delay.</p>
<p>Back to the topic that geometry shaders break the synchronous nature of GPUs, I would like to talk about how the number and type of emitted primitives affect the overall performance of the rendering pipeline (not even considering the aforementioned negative factors).</p>
<p>The best performance can be achieved in case both the input and output primitive type of the geometry shader is the same (e.g. triangle -&gt; triangle). Besides that, usually GPUs have an accelerated path for outputting four vertices for one vertex input (e.g. point -&gt; triangle strip) that is useful for rendering point sprites or billboards. All the other combinations should be avoided if possible.</p>
<p>I hope I was clear enough to convince all of you that geometry shaders are not meant for tessellation as I really gone mad when I&#8217;ve seen that everybody was just talking about this particular use case when in fact geometry shaders are much more useful in other situations.</p>
<h2>Tessellation on HD2000 series</h2>
<p>The true successor of the original hardware tessellation feature reappeared with the Xbox360&#8242;s GPU and then for PC with the introduction of the AMD Radeon HD2000 series. This hardware generation came equipped with a fixed function hardware tessellator similar of that of the Radeon 8500 but with added programming flexibility. The functionality is accessible in OpenGL through the extension <a title="GL_AMD_vertex_shader_tessellator" href="http://www.opengl.org/registry/specs/AMD/vertex_shader_tessellator.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/AMD/vertex_shader_tessellator.txt?referer=');">GL_AMD_vertex_shader_tessellator</a> but, again, it didn&#8217;t make it its way into core OpenGL, neither into DX10 due to the lack of support on NVIDIA GPUs. The extension in fact turns the traditional vertex shader into a tessellation evaluation shader (or a domain shader in DX terminology) and even though the extension does not explicitly names it as such I will sometimes refer it this way.</p>
<p>Still, one important restriction has to be mentioned regarding to the presented functionality, namely that this tessellation mechanism cannot be used together with geometry shaders due to hardware limitations. My guess is that most probably the tessellator output is emitted to the same cache that is used by geometry shaders (somebody from AMD can confirm this or correct me).</p>
<div class="wp-caption aligncenter" style="width: 454px"><img class=" " title="Simplified rendering pipeline of the AMD Radeon HD2900 with tessellation" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/hd2000_tess.png" alt="Simplified rendering pipeline of the AMD Radeon HD2900 with tessellation" width="444" height="160" /><p class="wp-caption-text">Simplified rendering pipeline of the AMD Radeon HD2900 with tessellation</p></div>
<p>The upgraded vertex shader introduced by the extension is provided with barycentric coordinates generated by the tessellator and with the control point indices (three indices in case of a triangle and four in case of a quad). The actual control point data is then fetched from within the vertex buffers used.</p>
<p>One important disadvantage of the tessellation architecture provided by the extension is that there is no programmable stage before the tessellator which does not allow us to do expensive per-vertex operations on the control cage (e.g. skeletal animation). Fortunately, it is very easy to overcome this limitation as on this hardware generation we already have transform feedback (stream out in DX terminology) and auto draw at our disposal. This way we can simply use an additional rendering step and an auxiliary buffer to make things working as expected.</p>
<div class="wp-caption aligncenter" style="width: 643px"><img title="Using transform feedback to add real vertex shaders to the rendering pipeline in case tessellation is used on AMD Radeon HD2900" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/hd2000_tess2.png" alt="Using transform feedback to add real vertex shaders to the rendering pipeline in case tessellation is used on AMD Radeon HD2900" width="633" height="160" /><p class="wp-caption-text">Using transform feedback to add real vertex shaders to the rendering pipeline in case tessellation is used on AMD Radeon HD2900</p></div>
<p>The maximum tessellation level is 15 and there is a discrete and a continuous tessellation mode that is configurable using API calls.</p>
<p>While this looks already almost like the tessellation mechanism introduced by DX11, the key disadvantage is the lack of adaptive tessellation, that is the possibility to algorithmically define the tessellation level on the GPU. This makes it rather impractical for dynamic LOD based tessellation level selection as the required API calls would be batch breakers.</p>
<p>Still, I think this feature should have caught the attention of developers but it seems that it remained only the tool of tech demos as developers have rather waited for the appearance of DX11 that forced NVIDIA to finally implement their own hardware tessellator.</p>
<h2>Tessellation shader</h2>
<p>Finally, with the advent of Shader Model 5.0 GPUs we have our &#8220;official&#8221; hardware tessellation. The functionality is exposed in OpenGL via the extension <a title="GL_ARB_tessellation_shader" href="http://www.opengl.org/registry/specs/ARB/tessellation_shader.txt" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.opengl.org/registry/specs/ARB/tessellation_shader.txt?referer=');">GL_ARB_tessellation_shader</a> what was introduced as part of the fourth major revision of the specification. The extension introduces two new shader types: the tessellation control shader (referred to as hull shader in DX11) and the tessellation evaluation shader (domain shader in DX terminology).</p>
<p>The new feature enables programmable tessellation levels up to 64 via the newly introduced tessellation control shader that allows us to process our control points in parallel yet synchronized manner.</p>
<div class="wp-caption aligncenter" style="width: 663px"><img class=" " title="Simplified rendering pipeline of Shader Model 5.0 GPUs" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/09/sm50_tess.png" alt="Simplified rendering pipeline of Shader Model 5.0 GPUs" width="653" height="160" /><p class="wp-caption-text">Simplified rendering pipeline of Shader Model 5.0 GPUs</p></div>
<p>This final revision of the feature allows us to use all the advanced techniques needed for a novel tessellation based renderer like adaptive continuous tessellation and displacement mapping. Also the vertex shader is completely separate and even though many of the vertex shader tasks have to be moved to the tessellation evaluation shader, complex operations like skeletal animation can be simply kept in the vertex shader.</p>
<p>Another thing is that Shader Model 5.0 hardware relaxes the limitation about the concurrent usage of tessellation and geometry shaders so one can use both to implement their algorithms that rely on geometry shader usage freely.</p>
<p>Still, the implementation of a tessellation evaluator shader does not really differ from that used in case of Shader Model 4.0 tessellation as we use the barycentric coordinates generated by the tessellator in the same style, just we don&#8217;t need explicit vertex fetches but the primitive data is baked for us straight from the beginning.</p>
<p>Crack-free rendering is still an issue however, especially in case of the programmable evaluators available in the last two versions. OpenGL 4.0 addresses this issue by introducing a precise qualified that restricts the shader compiler to use any optimizations like operation reordering or fused multiply-add that may introduce floating point round errors thus at least guaranteeing that the same sequence of operations will result in the same number. Any further steps against cracks introduced by tessellation are the responsibility of the programmer.</p>
<h2>Conclusion</h2>
<p>We&#8217;ve seen how various GPU generations addressed the issue of hardware tessellation as well as in what form they are available in various OpenGL implementations. I&#8217;ve also tried to collect the most relevant advantages and disadvantages of the implementations in various hardware generations.</p>
<p>There was also a completely separate discussion about geometry shaders and their use for geometry amplification and I hope I managed to convince everybody that it is not the way to go.</p>
<p>We&#8217;ve also briefly mentioned some of the major issues that may arise concerns regarding to the use of tessellation, however the thorough examination of these issues needs a much longer discussion that is out of the scope of this article, still, an interesting topic for a future one.</p>
<p>Unfortunately, I didn&#8217;t prepare any sample application demonstrating the usage of the various tessellation implementations due to the lack of time so this has to be also postponed to a future article.</p>
<p>For further reading and especially for sample applications I recommend you to check out the following links:</p>
<p><a title="Hardware Tessellation on Radeon in OpenGL @ Geeks3D" href="http://www.geeks3d.com/20100209/test-hardware-tessellation-on-radeon-in-opengl-part-12/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.geeks3d.com/20100209/test-hardware-tessellation-on-radeon-in-opengl-part-12/?referer=');">Hardware Tessellation on Radeon in OpenGL @ Geeks3D</a><br />
<a title="First Contact with OpenGL 4.0 GPU Tessellation @ Geeks3D" href="http://www.geeks3d.com/20100730/test-first-contact-with-opengl-4-0-gpu-tessellation/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.geeks3d.com/20100730/test-first-contact-with-opengl-4-0-gpu-tessellation/?referer=');"> First Contact with OpenGL 4.0 GPU Tessellation @ Geeks3D</a></p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/09/history-of-hardware-tessellation/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

