Exploit parallelism with the least effort

DISCLAIMER: This article was migrated from the legacy personal technical blog originally hosted here, and thus may contain formatting and content differences compared to the original post. Additionally, it likely contains technical inaccuracies, opinions that the author may no longer align with, and most certainly poor use of English. This article remains public for those who may find it useful despite its flaws.

Multiprocessing has been there for decades as a premium feature for enterprise applications but adopting this technology still brings huge burden to software companies that still maintain and develop legacy code. Nowadays, as most commodity hardware already have highly parallelized architectures, a modern application is almost unimaginable without proper multi-threading capabilities even if we talk about text editor or a multimedia application. The transition from traditional software development to multiprocessing is not an easy and painless task. Fortunately we have such tools in our hand like OpenMP.

Currently the biggest hit is OpenCL as it seems to be the ultimate solution to harness the power of highly parallel architectures like multi-core CPUs, DSPs and probably most important is that it can leverage the huge raw computational capabilities of GPUs. However it is one of the most important standard that came out lately, it is not the answer for all questions. For those who would like to converge their legacy code with multiprocessing technology maybe it’s a better advice to look around for other solutions.

My intension was not related to this when I started to search around for a multiprocessing framework. I just wanted to find something that provides an easy to use interface to introduce multi-threading and the needed shared memory semantics into my hobby projects. This is how I found OpenMP.

What is OpenMP?

Basically, OpenMP is an API specification for parallel programming that is intended to extend the most preferred programming languages used for computationally heavy and scientific calculations with a tool set that enables cross-platform multi-threading support tightly integrated into the language itself. Namely, OpenMP adds shared memory parallel programming capabilities to the C, C++ and Fortran languages.

While OpenMP is limited to these particular programming languages, it is truly an open and multi-platform API that is very well supported by different compilers (at least as far as I can tell). The standard itself is developed and maintained in a similar fashion like OpenGL as it has it’s own Architecture Review Board with representatives from all major hardware and software vendors like AMD, HP, IBM, Intel, Sun Microsystems, Microsoft and others.

The specification itself is maintained in two different versions: one for C/C++ and another for Fortran. As I was never involved in development with Fortran, I dug deeper only in the C/C++ specific details, however the facilities provided by the API are basically the same for Fortran as well.

The language extensions are introduced using OpenMP specific pragmas and a run-time library. At first sight this does not seem to be the most elegant solution but this fits very well into all versions of the programming language specifications so there are no further interworking issues and the OpenMP standard can be maintained totally separated from the underlying language itself. Looking at the evolution process of the C and the C++ programming languages this makes sense by the way.

Say Hello World to parallel programming

I think the best way to show the power and simplicity of OpenMP is to show a basic example on how easy is to add parallel computing capabilities even to the most straightforward algorithms:

void quicksort(int *a, int lo, int hi) {
    int i=lo, j=hi, h;
    int x=a[(lo+hi)/2];
 
    do {
        while (a[i] < x) i++;
        while (a[j] > x) j--;
        if (i <= j) {
            h=a[i]; a[i]=a[j]; a[j]=h;
            i++; j--;
        }
    } while (i <= j);
 
    #pragma omp parallel sections
    {
        #pragma omp section
        if (lo < j) quicksort(a, lo, j);
        #pragma omp section
        if (i < hi) quicksort(a, i, hi);
    }
}

This is the quick sort algorithm in OpenMP fashion. As you may already observed this function is not really different from the original sequential version of the famous sorting technique. The only added content is the presence of the three OpenMP specific pragmas and an additional block.

I will now explain how we exploited parallel programming with just these few added lines but I don’t want to go into details as it is always better to read the specification itself before starting to heavily use OpenMP. First, we’ve created “parallel sections” which means that we expressed our intension that we would like to separate the tasks in the next code block between multiple threads. Next we’ve specified the actual “sections” that one thread should execute.

This way each time we’ve split up the array in two pieces we sort the separate regions using separate threads. Of course, for a very huge this would not mean that the number of threads will exponentially grow as it will be saturated at some point. However, this is just one parameter that is fully controlled by the programmer.

Parallelize loops with minimal effort

Many times happens that the performance bottleneck is inside a for loop that moves or does calculations on huge data arrays. One example is an algorithm that interpolates two float arrays to another one. This can be of course parallelized using the “sections” semantics presented earlier, however it would need modification to the original algorithm and after this it would not clearly reflect the purpose of that anymore. OpenMP supports also such cases very elegantly:

#pragma omp parallel for
for(int i = 1; i < size; ++i)
    C[i] = A[i] * alpha + B[i] * (1 - alpha);

Notice that there are no loop-carried dependencies. This means that one iteration of the loop does not depend upon the result of another iteration of the loop. This makes it appropriate for parallelization. Only by adding a single pragma the time needed to execute this loop may scale down perfectly on multi-core systems.

For more control over how many threads will likely to carry out the results of this for loop one can specify the exact number of threads that should be used for the operation by adding another option to the pragma:

#pragma omp parallel for num_threads(4)

Of course there are plenty of other configuration possibilities that control how the parallelized code will actually execute but, again, this article is not meant to be a through guide on the usage of OpenMP instead it’s just a foretaste to raise interest for getting more details about this prominent tool.

More than just threads

We’ve seen so far that OpenMP enables the introduction of basic work sharing support for an already existing project with minimal effort. However, OpenMP is more than just another way to execute separate threads, it also provide very easy to use facilities for synchronization and shared data handling that can be the building blocks of any multiprocessing application including, but not limited to the following features:

Explicitly scoped variables to indicate shared and thread private storage
Atomic operations and critical sections
Execution barriers for fine grained synchronization

The best thing in these is that you just specify the appropriate pragmas for the affected statements or variables and the rest is carried out by OpenMP. For more information on the usage of these please refer to the OpenMP specification.

Compiler support

One of the best things in OpenMP is that it is well supported by most of the major C/C++ compiler vendors:

GCC version 4.3.2 and later (enabled with the -fopenmp compiler switch)
Visual C++ 2008 and later (enabled with the /openmp compiler switch)
Intel C/C++ compiler version 10.1 and later (using -Qopenmp on Windows or -openmp on Linux or MacOSX)

For a complete list of supported compiler please refer to the official site of OpenMP.

Another advantage that raises from the fact how the actual language integration of OpenMP has been designed is that it usually gracefully degrades on compilers without support for OpenMP as the pragmas can be silently ignored. I intentionally used the word “usually” as in case that the business logic of the application is consciously using the multi-threaded semantics then it wouldn’t execute in the exact same way with or without OpenMP. However, the responsibility to monitor such situations is up to the developer.

Conclusion

My personal opinion about OpenMP that it best suites those situations when a gradual transition is needed for legacy code towards a parallelized system or when one searches for the easiest possible way to take advantage multiprocessing capable environments. Still, OpenMP is suitable to fulfill almost all the tasks that are needed to implement completely new applications with parallel programming in mind and so I recommend it to everybody even for general use.

What is OpenMP?

Say Hello World to parallel programming

Parallelize loops with minimal effort

More than just threads

Compiler support

Conclusion

Post Author: Daniel Rákos

COMPANY INFORMATION

LEGAL