<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>RasterGrid Blog &#187; Multiprocessing</title>
	<atom:link href="http://rastergrid.com/blog/category/programming/multiprocessing/feed/" rel="self" type="application/rss+xml" />
	<link>http://rastergrid.com/blog</link>
	<description>A technical blog from Daniel Rákos (aka aqnuep)</description>
	<lastBuildDate>Fri, 04 Nov 2011 18:10:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Texture and buffer access performance</title>
		<link>http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/</link>
		<comments>http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/#comments</comments>
		<pubDate>Tue, 02 Nov 2010 20:44:57 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Graphics]]></category>
		<category><![CDATA[Multiprocessing]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[OpenGL]]></category>
		<category><![CDATA[texture buffer]]></category>
		<category><![CDATA[uniform buffer]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=475</guid>
		<description><![CDATA[Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F11%252Ftexture-and-buffer-access-performance%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Texture%20and%20buffer%20access%20performance%22%20%7D);"></div>
<p>Currently there are several ways to feed data to the GPU no matter of what API we use and what type of application we develop. In case of OpenGL we have uniform buffers, texture buffers, texture images, etc. The same is true for OpenCL and other compute APIs that even provide more fine-grained memory management taking advantage of the local data store (LDS) available on today&#8217;s hardware. In this article I&#8217;ll present the memory access performance characteristics of AMD&#8217;s Evergreen-class GPUs focusing on what this all means from OpenGL point of view. While most of the data is about the HD5870, the general principles and relative performance characteristics are valid for other GPUs, including ones from other vendors.</p>
<p><span id="more-475"></span></p>
<h2>Introduction</h2>
<p>Traditional CPU based applications don&#8217;t have to worry too much about where they put their data as they have a simple set of possibilities: registers and global memory (accessed through a series of linear caches called L1, L2 and on newer architectures also L3). While this and its details can be already quite cumbersome to utilize efficiently, GPU based algorithms need even more investigation as their architecture is based on a more complex multi-level memory design.</p>
<p>Typical questions an OpenGL graphics developer could ask nowadays are:</p>
<ul>
<li>Where should I put my per-object data?</li>
<li>From where should I source animation data?</li>
<li>Should I use uniform buffers, texture buffers or vertex buffers for my per-instance data?</li>
<li>What does it mean from performance point of view if I use read-write buffers or textures?</li>
</ul>
<p>Of course, the list could continue and answering the individual questions is not easy and often requires performance measurements to prove our suspicions. Instead of trying to answer all these questions it is easier to take a look at the actual hardware performance characteristics and solve the individual issues based on that.</p>
<p>I&#8217;ve already touched the topic in the past with the article <a title="Uniform Buffers VS Texture Buffers" href="http://rastergrid.com/blog/2010/01/uniform-buffers-vs-texture-buffers/">Uniform Buffers VS Texture Buffers</a> where I&#8217;ve presented the key differences between the two data access method and a few examples when to use one or the other. In this article I&#8217;ll go further and try to provide more accurate data about how various memory access methods perform in practice.</p>
<p>Earlier there were little to no detailed information about the actual performance of API level memory access methods but fortunately the increasing popularity of OpenCL made vendors to provide more technical details about the architecture and performance of their products to enable software developers to fully leverage the power of today&#8217;s GPUs. While these documents focus on OpenCL or other compute APIs, most of the data applies indirectly to OpenGL as well.</p>
<h2>The Evergreen architecture</h2>
<p>In order to be able to provide some actual performance data, I&#8217;ve selected as reference AMD&#8217;s Evergreen architecture and the Radeon HD5870 as the target hardware. Note that most of the presented details roughly apply to all other modern GPUs, including NVIDIA&#8217;s Fermi architecture. Each time there is a clear difference between the two, I&#8217;ll try to point it out. However, I cannot be 100% sure what are these differences as <a title="ATI Stream SDK OpenCL Programming Guide" href="http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf?referer=');">ATI&#8217;s OpenCL programming guide</a> is somewhat more talkative about actual performance details than that of <a title="NVIDIA OpenCL Programming Guide" href="http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf" target="_blank" onclick="pageTracker._trackPageview('/outgoing/developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf?referer=');">NVIDIA&#8217;s OpenCL programming guide</a>.</p>
<div class="wp-caption aligncenter" style="width: 510px"><img class=" " title="OpenCL Platform Model" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/11/opencl_platform_model.png" alt="OpenCL Platform Model" width="500" height="273" /><p class="wp-caption-text">OpenCL Platform Model</p></div>
<p>From OpenCL platform model&#8217;s point of view the Radeon HD5870 is structured in the following way:</p>
<ul>
<li>Total of 20 compute units.</li>
<li>Each compute unit consists of 16 stream cores.</li>
<li>Each stream core consists of 5 processing elements (4 traditional, 1 transcendental).</li>
</ul>
<p>This sums up to a total of 1600 processing elements on the Radeon HD5870.</p>
<p>The basic OpenCL architecture applies in the same way to NVIDIA GPUs, however, there is are differences between AMD&#8217;s and NVIDIA&#8217;s GPU architecture. AMD uses a special super-scalar architecture since their HD2000 series that allows them to execute 5 separate instructions in each core.</p>
<div class="wp-caption aligncenter" style="width: 439px"><img class="   " title="ATI super-scalar architecture" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/11/ati_superscalar.gif" alt="ATI super-scalar architecture" width="429" height="237" /><p class="wp-caption-text">ATI super-scalar architecture consisting of one transcendental unit (left), four traditional units and a dedicated branch execution unit (right).</p></div>
<p>What this already reveals us from OpenGL point of view is that AMD&#8217;s architecture groups together 16 stream cores so fragment shaders are most probably running on 4&#215;4 tiles of fragments in sync. As an example, it is important to note this in case we use heavy dynamic branching in shaders as we should be aware of that in case the branch selection is not coherent for the specified fragment neighborhood, performance can drop due to the fact that hardware masks out those processing elements that did not select the appropriate branch.</p>
<p>Also, it is important to note that usually one out of four or five processing elements (depending on hardware generation and vendor) are capable of executing transcendental instructions such as logarithm, exponential or trigonometric functions.</p>
<h2>Memory capacity and performance</h2>
<p>AMD is very clear about the memory capacity and performance details in their OpenCL programming guide. The figure below showcases these hardware characteristics of the Radeon HD5870:</p>
<table style="width: 100%;" border="0">
<tbody>
<tr>
<td style="text-align: center;"><strong>OpenCL Memory Type</strong></td>
<td style="text-align: center;"><strong>Hardware Resource</strong></td>
<td style="text-align: center;"><strong>Size/CU</strong></td>
<td style="text-align: center;"><strong>Size/GPU</strong></td>
<td style="text-align: center;"><strong>Peak Read Bandwidth / Stream Core</strong></td>
</tr>
<tr>
<td>Private</td>
<td>GPRs</td>
<td style="text-align: center;">256KB</td>
<td style="text-align: center;">5MB</td>
<td style="text-align: center;">48 bytes/cycle</td>
</tr>
<tr>
<td>Local</td>
<td>LDS</td>
<td style="text-align: center;">32KB</td>
<td style="text-align: center;">640KB</td>
<td style="text-align: center;">8 bytes/cycle</td>
</tr>
<tr>
<td rowspan="3">Constant</td>
<td>Direct-addressed constant</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">48KB</td>
<td style="text-align: center;">16 bytes/cycle</td>
</tr>
<tr>
<td>Same-indexed constant</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">4 bytes/cycle</td>
</tr>
<tr>
<td>Varying-indexed constant</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">~0.6 bytes/cycle</td>
</tr>
<tr>
<td rowspan="2">Images</td>
<td>L1 Cache</td>
<td style="text-align: center;">8KB</td>
<td style="text-align: center;">160KB</td>
<td style="text-align: center;">4 bytes/cycle</td>
</tr>
<tr>
<td>L2 Cache</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">512KB</td>
<td style="text-align: center;">~1.6 bytes/cycle</td>
</tr>
<tr>
<td>Global</td>
<td>Global Memory</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">1GB</td>
<td style="text-align: center;">~0.6 bytes/cycle</td>
</tr>
</tbody>
</table>
<p><strong>GPRs</strong> &#8211; General Purpose Registers<br />
<strong>LDS</strong> &#8211; Local Data Store<br />
<strong>Direct-addressed constant</strong> &#8211; a constant accessed using a constant address.<br />
<strong>Same-indexed constant</strong> &#8211; a varying-indexed constant where each processing element accesses the same index.<br />
<strong>Varying-indexed constant</strong> &#8211; a varying-indexed constant where the processing elements access different indices.</p>
<p>Of course, consider this data for fetches that are properly aligned. In case of unaligned data access the actual throughput can be much lower. In order to be able to reach the peak bandwidth we have to align our data usually to multiples of 4, 8 or 16 bytes (depending on actual hardware).</p>
<p>As it can be seen, constant storage can also fall into three different access performance categories so do buffers and images. While actual numbers differ on various platforms, the guidelines apply to most of modern GPUs: use a particular addressing method wisely and take in consideration access locality in order to get optimum performance.</p>
<p>These numbers are no different in case of OpenGL terminology either, just replace the word &#8220;constant&#8221; with uniform buffers and think about images and global data as texture images or buffer objects. The only exception is that there is no direct alternative for local memory in OpenGL.</p>
<p>An additional thing to consider since Shader Model 5.0 hardware is read-write images and buffers. AMD refers to the two memory access method as FastPath and CompletePath. This means that in case of read-only textures or buffers the GPU uses the FastPath that is able to take full advantage of the L2 cache while read-write textures and buffers usually use the so called CompletePath that sacrifices the advantages of the L2 cache to enable the use of atomic operations on global memory objects. This, of course, has a quite huge performance effect reducing the throughput of the GPU about five times on the Radeon HD5870:</p>
<table style="width: 100%;" border="0">
<tbody>
<tr>
<td><strong>Kernel</strong></td>
<td style="text-align: center;"><strong>Effective Bandwidth</strong></td>
<td style="text-align: center;"><strong>Ratio to Peak Bandwidth</strong></td>
</tr>
<tr>
<td>copy 32-bit 1D FastPath</td>
<td style="text-align: center;">96 GB/s</td>
<td style="text-align: center;">63%</td>
</tr>
<tr>
<td style="text-align: left;">copy 32-bit 1D CompletePath</td>
<td style="text-align: center;">18 GB/s</td>
<td style="text-align: center;">12%</td>
</tr>
</tbody>
</table>
<h2>Summary</h2>
<p>Well, now we&#8217;ve seen that how various OpenCL memory types perform in reality, let&#8217;s see how all these information translate to the OpenGL world. Here are my <strong>top-10 recommendations</strong> about when and how to use the various data acquiring possibilities present in modern OpenGL:</p>
<ol>
<li>Align your data to multiples of 16 bytes and fetch them accordingly.</li>
<li>Use direct-addressing of data in uniform buffers and try to avoid indexing into uniform buffers.</li>
<li>If you must use indexing into uniform buffers, make sure that the indices are coherent across processing elements working in sync.</li>
<li>If you heavily use indexed data consider using texture buffers instead of uniform buffers to take advantage of the L1 and L2 cache.</li>
<li>Texture and buffer caches are linear so consider this when planning you access patterns.</li>
<li>Bind textures and buffers for read-write mode only when it is really necessary, use regular texture binding otherwise to ensure optimum performance.</li>
<li>A single atomic buffer operation forces the shader to use the slow path so use atomic operations wisely.</li>
<li>Do not use atomic buffer operations to implement atomic counters, use built-in hardware atomic counters instead as they are much faster.</li>
<li>Consider using dynamic branching to avoid costly memory operations as often as possible.</li>
<li>Try to make your branch selection coherent across processing elements working in sync (e.g. 4&#215;4 fragment tile in case of a fragment shader).</li>
</ol>
<div class="wp-caption aligncenter" style="width: 505px"><img class="  " title="Memory Access Performance" src="http://www.rastergrid.com/blog/wp-content/uploads/2010/11/mem_perf.png" alt="Memory Access Performance" width="495" height="233" /><p class="wp-caption-text">Relative performance characteristics of memory access methods (higher is better).</p></div>
<p><em>Note: This article may contain inaccurate data and some advices may not apply to other hardware platforms. I&#8217;ve made this article with the hope that it may prove useful for some developers out there. For accurate details or more information, please contact your hardware vendor.</em></p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/11/texture-and-buffer-access-performance/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Synchronizable objects for C++</title>
		<link>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/</link>
		<comments>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/#comments</comments>
		<pubDate>Tue, 02 Feb 2010 19:01:56 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Multiprocessing]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Samples]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[lock]]></category>
		<category><![CDATA[macro]]></category>
		<category><![CDATA[multithreading]]></category>
		<category><![CDATA[mutex]]></category>
		<category><![CDATA[OOP]]></category>
		<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[synchronization]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=120</guid>
		<description><![CDATA[Previously I talked about how one can easily take advantage of multiprocessing using OpenMP. Even if the C pragmas introduced by the parallel programming API standard is very straightforward for simple programs, it simply doesn&#8217;t fit nicely in a complex C++ application that is built from the ground with the OOP in mind. To smoothly]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F02%252Fsynchronizable-objects-for-c%252F%22%2C%20%22shorturl%22%3A%20%22http%3A%2F%2Fbit.ly%2FbbpIPT%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Synchronizable%20objects%20for%20C%2B%2B%22%20%7D);"></div>
<p>Previously I talked about how one can easily take advantage of multiprocessing using OpenMP. Even if the C pragmas introduced by the parallel programming API standard is very straightforward for simple programs, it simply doesn&#8217;t fit nicely in a complex C++ application that is built from the ground with the OOP in mind. To smoothly introduce OpenMP into such projects one need higher level constructs that hide the actual implementation details. This is the first article of a series that will try to provide reference implementations of such an abstraction. First, we will start with synchronizable primitives that try to reflect the functionality provided by the &#8220;synchronized&#8221; statement of Java.</p>
<p><span id="more-120"></span>This article is highly inspired by an article written by <a title="A &quot;synchronized&quot; statement for C++ like in Java" href="http://www.codeproject.com/KB/threads/cppsyncstm.aspx" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.codeproject.com/KB/threads/cppsyncstm.aspx?referer=');">Achilleas Margaritis</a><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;"> and is mostly equivalent with his thoughts. My article tries to provide a portable reference implementation of a slightly modified version of the trick presented by Margaritis that uses OpenMP as the multiprocessing API back-end.</span></p>
<h2>Motivation</h2>
<p><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;">According to the OO paradigm, classes and consequently objects provide an abstract interface to the underlying internal data or services of the modeled entity or entity class. When it comes to parallel programing we should provide facilities to enable concurrent access to shared resources that are in this case objects. Using plain OpenMP can be satisfactory, however when used extensively the OpenMP pragmas and API function calls introduced can greatly affect the readability and the maintainability of the code. Nevertheless, there can be platforms that use other APIs for handling race conditions. It is obvious that we need to encapsulate these facilities and provide an abstract tool-set instead.</span></p>
<h2>Implementation</h2>
<p><span style="line-height: normal; -webkit-border-horizontal-spacing: 5px; -webkit-border-vertical-spacing: 5px; font-size: small;">The very first building block of such a framework can be a mutex class that provides mutually exclusive access to certain resources. In the world of OpenMP this should look like something similar to the following:</span></p>
<pre class="brush: cpp">class Mutex {
public:
    Mutex() { omp_init_lock(&amp;_mutex); }
    ~Mutex() { omp_destroy_lock(&amp;_mutex); }
    void lock() { omp_set_lock(&amp;_mutex); }
    void unlock() { omp_unset_lock(&amp;_mutex); }
private:
    omp_lock_t _mutex;
};</pre>
<p>This seems already enough for us to make our Java-like &#8220;synchronized&#8221; statement, however we would like to create a framework that makes usage as easy and safe as possible. In order to get closer to this goal we apply the RAII (Resource Acquisition Is Initialization) design pattern to create our lock class:</p>
<pre class="brush: cpp">class Lock {
public:
    Lock(Mutex&amp; mutex) : _mutex(mutex), _release(false) { _mutex.lock(); }
    ~Lock() { _mutex.unlock(); }
    bool operator() const { return !_release; }
    void release() { _release = true; }
private:
    Mutex&amp; _mutex;
    bool _release;
};</pre>
<p>Our goal is to provide an inheritable interface for such objects that needs synchronization. However, this step has to involve severe considerations regarding to the provided interface as we explicitly need to conform to the following requirements:</p>
<ul>
<li>The interface shall not expose the interface of the underlying synchronization primitive, in our case the mutex class methods.</li>
<li>The interface shall be available only to the synchronizable objects but not for the external world as we would like to not just hide the implementation details of our abstract entity but also prevent the users to synchronize our objects as it should be the responsibility of the object itself.</li>
<li>The interface shall expose methods which are less prone to name collision, for convenience.</li>
</ul>
<p>If we take care of the presented conventions we end up with an interface similar to the following:</p>
<pre class="brush: cpp">class Synchronizable: protected Mutex {
protected:
	void enterSyncBlock() { this-&gt;lock(); }
	void exitSyncBlock() { this-&gt;unlock(); }
};</pre>
<p>Now we are almost at the finish line. We just need to inherit this class in order to have the needed facilities for an object that needs synchronization. However, using this interface directly is not the most comfortable and safe. If we would like to have a Java-like &#8220;synchronized&#8221; statement we have to call for additional help. Fortunately, we have our not so well respected C macro language coming to rescue us as we can use it to make some pseudo-language extensions. The simplest way to define our new statement is using the following line:</p>
<pre class="brush: cpp">#define synchronized(obj)  for(Lock obj##_lock = *obj; obj##_lock; obj##_lock.release())</pre>
<p>From now, we can really use object synchronization in C++ as easy as in Java, we just need the following syntax in the method of our shared objects:</p>
<pre class="brush: cpp">synchronized(this) {
    // some code that needs synchronization
}</pre>
<p>Now it is clearly visible how handy the RAII pattern became in our case. Beside that it is now very straightforward to use this statement it provides additional benefits:</p>
<ul>
<li>It makes the code more readable and as a result it is easier to maintain.</li>
<li>No need to call inconveniently named methods and use lock variables.</li>
<li>The synchronized code has it&#8217;s own scope inside the code.</li>
<li>It is exception-safe as the mutex is unlocked upon destruction.</li>
</ul>
<p>Additionally, we can also take advantage of the inherent problem in C++ regarding to multiple inheritance. If we inherit our object from other two synchronized objects then using a simple type casting we can explicitly specify which ancestor we would like to synchronize in a particular block. Also, to ease this we can define our synchronization statement instead of the Java-like one using the following line:</p>
<pre class="brush: cpp">#define synchronized(cls)  for(Lock obj##_lock = *static_cast&lt;cls*&gt;(this); obj##_lock; obj##_lock.release())</pre>
<p>In this case we pass the class name instead of the object pointer <em>this</em>. Using this later construct we can easily specify the correct ancestor that we would like to synchronize in case when we deal with multiple inheritance situations. Personally I prefer the later syntax as it is much more customized for C++ use cases.</p>
<p>As from now we don&#8217;t need a direct interface for entering and exiting our synchronization block we can simplify our synchronizable interface to the following chunk:</p>
<pre class="brush: cpp">class Synchronizable: protected Mutex {
};</pre>
<p>This is enough from now to provide the facilities needed for a synchronization block but still complies to the requirement that we would like to hide the synchronization primitive related details.</p>
<p>Beside this, Jörg came up with the idea today to replace the for loop in our macro with a single if statement. This seems reasonable as we don&#8217;t have to sacrifice any scoping and safety related benefits of our framework. This simplifies our lock class to the following:</p>
<pre class="brush: cpp">class Lock {
public:
    Lock(Mutex&amp; mutex) : _mutex(mutex) { _mutex.lock(); }
    ~Lock() { _mutex.unlock(); }
    bool operator() const { return true; }
private:
    Mutex&amp; _mutex;
};</pre>
<p>This definition of the lock class is satisfactory if we redefine our synchronized macro to use an if statement instead:</p>
<pre class="brush: cpp">/* Java-like synchronized statement */
#define synchronized(obj)  if (Lock obj##_lock = *obj)
/* alternative synchronized statement to support multiple inheritance */
#define synchronized(cls)  if (Lock obj##_lock = *static_cast&lt;cls*&gt;(this))</pre>
<p>Thanks to the useful comments we even managed to further optimize and minimize the support code needed for our new pseudo-language extension.</p>
<h2>Conclusion</h2>
<p>We have seen an example how one can implement an easy to use synchronizable interface for C++. Also, we&#8217;ve provided a concrete implementation that is based on OpenMP. This library is still far from an API that provides all the necessary constructs that one needs for using parallel programming in their C++ projects, however we made our first step and I will recap on the subject in subsequent articles to further extend this framework.</p>
<p>Credits go to Achilleas Margaritis whose article inspired me to write mine and to Jörg for the useful improvement ideas.</p>
<h3>Full source code</h3>
<p><strong>Language:</strong> C++<br />
<strong> Platform:</strong> cross-platform<br />
<strong> Dependency:</strong> OpenMP<br />
<strong> Download link:</strong> <a title="omp_sync.h" href="/blog/wp-content/uploads/2010/02/files/omp_sync.h" target="_blank">omp_sync.h</a><br />
<strong> Comments:</strong> In order to use it as it is, you will need a C++ compiler supporting OpenMP like GCC 4.2 or Visual C++ 2008.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/02/synchronizable-objects-for-c/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Exploit parallelism with the least effort</title>
		<link>http://rastergrid.com/blog/2010/01/exploit-parallelism-with-the-least-effort/</link>
		<comments>http://rastergrid.com/blog/2010/01/exploit-parallelism-with-the-least-effort/#comments</comments>
		<pubDate>Tue, 19 Jan 2010 21:12:15 +0000</pubDate>
		<dc:creator>Daniel Rákos</dc:creator>
				<category><![CDATA[Multiprocessing]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Fortran]]></category>
		<category><![CDATA[OpenMP]]></category>

		<guid isPermaLink="false">http://rastergrid.com/blog/?p=84</guid>
		<description><![CDATA[Multiprocessing has been there for decades as a premium feature for enterprise applications but adopting this technology still brings huge burden to software companies that still maintain and develop legacy code. Nowadays, as most commodity hardware already have highly parallelized architectures, a modern application is almost unimaginable without proper multi-threading capabilities even if we talk]]></description>
			<content:encoded><![CDATA[
<div class="topsy_widget_data topsy_theme_light-green" style="float: right;margin-left: 0.75em; background: url(data:,%7B%20%22url%22%3A%20%22http%253A%252F%252Frastergrid.com%252Fblog%252F2010%252F01%252Fexploit-parallelism-with-the-least-effort%252F%22%2C%20%22style%22%3A%20%22big%22%2C%20%22title%22%3A%20%22Exploit%20parallelism%20with%20the%20least%20effort%22%20%7D);"></div>
<p>Multiprocessing has been there for decades as a premium feature for enterprise applications but adopting this technology still brings huge burden to software companies that still maintain and develop legacy code. Nowadays, as most commodity hardware already have highly parallelized architectures, a modern application is almost unimaginable without proper multi-threading capabilities even if we talk about text editor or a multimedia application. The transition from traditional software development to multiprocessing is not an easy and painless task. Fortunately we have such tools in our hand like OpenMP.</p>
<p><span id="more-84"></span>Currently the biggest hit is OpenCL as it seems to be the ultimate solution to harness the power of highly parallel architectures like multi-core CPUs, DSPs and probably most important is that it can leverage the huge raw computational capabilities of GPUs. However it is one of the most important standard that came out lately, it is not the answer for all questions. For those who would like to converge their legacy code with multiprocessing technology maybe it&#8217;s a better advice to look around for other solutions.</p>
<p>My intension was not related to this when I started to search around for a multiprocessing framework. I just wanted to find something that provides an easy to use interface to introduce multi-threading and the needed shared memory semantics into my hobby projects. This is how I found <a title="OpenMP Homepage" href="http:/www.openmp.org/" target="_blank">OpenMP</a>.</p>
<h2>What is OpenMP?</h2>
<p>Basically, OpenMP is an API specification for parallel programming that is intended to extend the most preferred programming languages used for computationally heavy and scientific calculations with a tool set that enables cross-platform multi-threading support tightly integrated into the language itself. Namely, OpenMP adds shared memory parallel programming capabilities to the C, C++ and Fortran languages.</p>
<p>While OpenMP is limited to these particular programming languages, it is truly an open and multi-platform API that is very well supported by different compilers (at least as far as I can tell). The standard itself is developed and maintained in a similar fashion like OpenGL as it has it&#8217;s own <a title="OpenMP Architecture Review Board" href="http://www.openmp.org/wp/about-openmp/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.openmp.org/wp/about-openmp/?referer=');">Architecture Review Board</a> with representatives from all major hardware and software vendors like AMD, HP, IBM, Intel, Sun Microsystems, Microsoft and others.</p>
<p>The specification itself is maintained in two different versions: one for C/C++ and another for Fortran. As I was never involved in development with Fortran, I dug deeper only in the C/C++ specific details, however the facilities provided by the API are basically the same for Fortran as well.</p>
<p>The language extensions are introduced using OpenMP specific pragmas and a run-time library. At first sight this does not seem to be the most elegant solution but this fits very well into all versions of the programming language specifications so there are no further interworking issues and the OpenMP standard can be maintained totally separated from the underlying language itself. Looking at the evolution process of the C and the C++ programming languages this makes sense by the way.</p>
<h2>Say Hello World to parallel programming</h2>
<p>I think the best way to show the power and simplicity of OpenMP is to show a basic example on how easy is to add parallel computing capabilities even to the most straightforward algorithms:</p>
<pre class="brush:c">void quicksort(int *a, int lo, int hi) {
    int i=lo, j=hi, h;
    int x=a[(lo+hi)/2];

    do {
        while (a[i] &lt; x) i++;
        while (a[j] &gt; x) j--;
        if (i &lt;= j) {
            h=a[i]; a[i]=a[j]; a[j]=h;
            i++; j--;
        }
    } while (i &lt;= j);

    #pragma omp parallel sections
    {
        #pragma omp section
        if (lo &lt; j) quicksort(a, lo, j);
        #pragma omp section
        if (i &lt; hi) quicksort(a, i, hi);
    }
}</pre>
<p>This is the quick sort algorithm in OpenMP fashion. As you may already observed this function is not really different from the original sequential version of the famous sorting technique. The only added content is the presence of the three OpenMP specific pragmas and an additional block.</p>
<p>I will now explain how we exploited parallel programming with just these few added lines but I don&#8217;t want to go into details as it is always better to read the specification itself before starting to heavily use OpenMP. First, we&#8217;ve created &#8220;parallel sections&#8221; which means that we expressed our intension that we would like to separate the tasks in the next code block between multiple threads. Next we&#8217;ve specified the actual &#8220;sections&#8221; that one thread should execute.</p>
<p>This way each time we&#8217;ve split up the array in two pieces we sort the separate regions using separate threads. Of course, for a very huge this would not mean that the number of threads will exponentially grow as it will be saturated at some point. However, this is just one parameter that is fully controlled by the programmer.</p>
<h2>Parallelize loops with minimal effort</h2>
<p>Many times happens that the performance bottleneck is inside a for loop that moves or does calculations on huge data arrays. One example is an algorithm that interpolates two float arrays to another one. This can be of course parallelized using the &#8220;sections&#8221; semantics presented earlier, however it would need modification to the original algorithm and after this it would not clearly reflect the purpose of that anymore. OpenMP supports also such cases very elegantly:</p>
<pre class="brush:c">#pragma omp parallel for
for(int i = 1; i &lt; size; ++i)
    C[i] = A[i] * alpha + B[i] * (1 - alpha);</pre>
<p>Notice that there are no loop-carried dependencies. This means that one iteration of the loop does not depend upon the result of another iteration of the loop. This makes it appropriate for parallelization. Only by adding a single pragma the time needed to execute this loop may scale down perfectly on multi-core systems.</p>
<p>For more control over how many threads will likely to carry out the results of this for loop one can specify the exact number of threads that should be used for the operation by adding another option to the pragma:</p>
<pre class="brush:c">#pragma omp parallel for num_threads(4)</pre>
<p>Of course there are plenty of other configuration possibilities that control how the parallelized code will actually execute but, again, this article is not meant to be a through guide on the usage of OpenMP instead it&#8217;s just a foretaste to raise interest for getting more details about this prominent tool.</p>
<h2>More than just threads</h2>
<p>We&#8217;ve seen so far that OpenMP enables the introduction of basic work sharing support for an already existing project with minimal effort. However, OpenMP is more than just another way to execute separate threads, it also provide very easy to use facilities for synchronization and shared data handling that can be the building blocks of any multiprocessing application including, but not limited to the following features:</p>
<ul>
<li>Explicitly scoped variables to indicate shared and thread private storage</li>
<li>Atomic operations and critical sections</li>
<li>Execution barriers for fine grained synchronization</li>
</ul>
<p>The best thing in these is that you just specify the appropriate pragmas for the affected statements or variables and the rest is carried out by OpenMP. For more information on the usage of these please refer to the <a title="OpenMP specification" href="http://www.openmp.org/wp/openmp-specifications/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.openmp.org/wp/openmp-specifications/?referer=');">OpenMP specification</a>.</p>
<h2>Compiler support</h2>
<p>One of the best things in OpenMP is that it is well supported by most of the major C/C++ compiler vendors:</p>
<ul>
<li><strong>GCC</strong> version 4.3.2 and later (enabled with the -fopenmp compiler switch)</li>
<li><strong>Visual C++</strong> 2008 and later (enabled with the /openmp compiler switch)</li>
<li><strong>Intel C/C++</strong> compiler version 10.1 and later (using -Qopenmp on Windows or -openmp on Linux or MacOSX)</li>
</ul>
<p>For a <a title="OpenMP compilers" href="http://www.openmp.org/wp/openmp-compilers/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.openmp.org/wp/openmp-compilers/?referer=');">complete list</a> of supported compiler please refer to the official site of OpenMP.</p>
<p>Another advantage that raises from the fact how the actual language integration of OpenMP has been designed is that it usually gracefully degrades on compilers without support for OpenMP as the pragmas can be silently ignored. I intentionally used the word &#8220;usually&#8221; as in case that the business logic of the application is consciously using the multi-threaded semantics then it wouldn&#8217;t execute in the exact same way with or without OpenMP. However, the responsibility to monitor such situations is up to the developer.</p>
<h2>Conclusion</h2>
<p>My personal opinion about OpenMP that it best suites those situations when a gradual transition is needed for legacy code towards a parallelized system or when one searches for the easiest possible way to take advantage multiprocessing capable environments. Still, OpenMP is suitable to fulfill almost all the tasks that are needed to implement completely new applications with parallel programming in mind and so I recommend it to everybody even for general use.</p>

]]></content:encoded>
			<wfw:commentRss>http://rastergrid.com/blog/2010/01/exploit-parallelism-with-the-least-effort/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

