Now that your shiny new desktop and/or laptop computers have multiple cores in them, I’ll bet you’re thinking about how to make use of your new-found hardware. Just adding cores doesn’t make programs go faster, you also need to tell the program how to use the cores.

Programming Models

You can tell your program how to use those extra cores in many ways. The merits of each method require a more detailed discussion. In this article we are going to get you started on using OpenMP.

There are many ways to write and design programs. Programming models are abstractions that provide a framework for implementing the model. In a massive oversimplification, there are two major programming models for multiple processors; shared memory and distributed memory.

Shared memory models (sometimes called “shared everything” models) assume you have one or more CPUs/cores that can directly access and change all the variables in your program. Well not exactly this, but we are simplifying the situation.

Distributed memory memory models (sometimes called “shared nothing” models) assume that every CPU has access to its own memory space, and can alter its own local variables.

This seems like a minor semantic difference until you realize that you have to actively move data between CPUs in the distributed memory case, and you can assume a passive motion between CPUs for the shared memory case.

The rest of this article will focus upon shared memory programming using OpenMP, and provide some basic examples of how to get started with it.

Preparation time: about one hour, maximum, before programming if you have to build your own compiler. A few minutes if you can install pre-built compiler packages.

Starting with your computer, let’s assume a Linux-based machine with a few cores. I am using a Pegasus many-core workstation. To see how many cores you have in your system, try this:

grep 'processor.*:' /proc/cpuinfo | wc -l

The Pegasus system reports 8 core (four dual core processors). Yours should show two or more.

We also need to see how much disk space we have. This is fairly easy to do. The “df” command is your tool of choice. A quick df -h will tell us what we have available, though it might not tell us if we can use it.

This tells us that our home directory has as much as 33GB available. So lets make ourselves a workspace, and continue.

mkdir ~/workspace
cd ~/workspace

We would like to know how much physical RAM we have, so try this:

grep "MemTotal:" /proc/meminfo

The Pegasus machine has 8184008 kB, or about 8GB RAM. For my editor, I may use Vim or Pico. Use whichever you are most comfortable with.

We also need a compiler. Intel, the Portland Group, and others produce excellent commercial compilers that support OpenMP. As it turns out, OpenMP is supported by GCC 4.2 and higher as well. This situation may represent a dilemma for you; why should you purchase the other compilers, if you get OpenMP for free from GCC? In short, the other compilers do offer more in the way of optimization, support, and other added elements of value.

To check your version of gcc enter ggc -v. If you don’t have gcc/gfortran 4.2 (or higher) you may have to build them yourself. Unfortunately, your system administrators may not let you modify the machines with which you want to experiment. So you might be forced to install the packages yourself into your own directories. Even if you have full access to your whole machine, this is probably a good idea in any case as it keeps the default system compilers separate from the GNU 4.2 tool chain.

To build GCC, you need GMP and MPFR. Get GMP, build it and install it as follows (Of course use your own --prefix path):

Compilation required 12.25 minutes on 8 processors. It would take about 8 times longer on 1 processor, which is why we used the -j8 switch for the Makefile (parallel build). If you have a single multicore use -j2. To use this compiler, we need to set an environment variable. This enables the linker to find the libraries. (Again, your path may vary.)

export LD_LIBRARY_PATH=/home/joe/local/lib64/lib

The compiler itself is located in /home/joe/bin/gcc which can be set in the CC variable in the Makefile. One more thing is needed to correct a bug in the installation for 64 bit platforms (shouldn’t be needed on 32 bit platforms).

Now that we have an operational compiler, we can start to explore OpenMP. At its core, OpenMP is a set of compiler hints and function calls to enable you to run sections of your code in parallel on a shared memory parallel computer (multicore).

OpenMP assumes you will work with threads, which are basically processes that share the same memory address space as the other processes in a group of threads for a single program. If one thread makes a change to a variable that all threads can see, then the next access to that variable will use the new value.

As it turns out, this model is fairly easy to think about. Imagine all your variables in a big block of memory, and all CPUs can see all the variables. This situation is not strictly true, but it is a good first approximation. The threads all see this memory and can modify this memory. Again, an oversimplification, but a useful one for the moment.

The threads can all perform IO operations, file, print, and so on. So things as simple as our hello world application may in fact be able to run in parallel, though generally IO operations tend to need to serialize access to global system state (file pointers, etc).

Compiler Hints

OpenMP operates mostly via compiler hints. A compiler hint is a comment that the compiler can ignore if not building for OpenMP. In the case of Fortran, you will typically use !$omp ... to tell the compiler something, and in C and C++ it is a little more complex:

#pragma omp ...
{
...code...
}

where the ...code... is called the parallel region. That is the area of the code that you want to try to run in parallel, if possible.

When the compiler sees the start of the parallel region, it creates a pool of threads. When the program runs, these threads start executing, and are controlled by what information is in the hints. Without additional hints, we simply have a “bunch” or “bag” of threads.

A way to visualize this is to imagine an implicit loop around your parallel region, where you have N CPU/core iterations of the loop. These iterations all occur at the same time, unlike an explicit loop.

The number of cores is controlled by an environment variable, OMP_NUM_THREADS. If it is not set, it could default to 1 or the number of cores on your machine. Just to be sure, you may want to do the following.

export OMP_NUM_THREADS=`grep 'processor' /proc/cpuinfo | wc -l `

.Now we’re ready to parallelize hello.c. (Note: all source code used in this article is available here.) As a first step, lets put in the explicit compiler hints, and do nothing else.

It should be pointed out as well, that we really should insert a preprocessor directive in our code, in order to be able to pull in function prototypes and constants for use with OpenMP:

#include <omp.h>

If you want you can enclose this in an ifdef construct:

#ifdef _OPENMP
#include <omp.h>
#endif

This way, that code will only be used if the compiler has been told to use OpenMP. Before we do something useful with this, lets explore a few functions that might be helpful.

We can use several OpenMP function calls to query and control our environment. The most frequently used functions are those that return the number of threads operating, and the current thread ID. There are several others that are useful. Our new program incorporates several of these, along with a few “tricks” I have found useful over the years.

The first number you see there is the thread number or thread ID (tid variable in the program). Notice that the output does not come out necessarily in thread order. And if you examine it closely, you might notice that it doesn’t come out in time order either, though that is pretty close. One of the tricks that I have learned and use is to tag each line with either the thread ID or the time, and then sort it.

Now we can tell what thread did what. Though the time ordering is still off. It’s a simple matter of programming to replace the thread ID with a time value that can be sorted. When this happens, I usually change the print format lines to look something like this:

"%-.3f D[%i] ... ",timestamp,tid,...

Once I’ve added this, I can see what happened, in the order that it happened, and still get the thread ID data. Items like this are helpful when debugging parallel programs. Now it’s time to move on to where a majority of the power of OpenMP becomes apparent to end users.

Loops

OpenMP helps you in a number of clever ways, allowing you to add threading to your program without thinking through all the details of thread setup and tear-down. It also will effectively re-engineer loops that you tell it to re-engineer for you.

As usual, it helps to start out with an understanding of where your program is spending its time. Without that, you could wind up parallelizing a loop that takes very little time in the overall execution path, and ignore the time expensive code. Additionally, it is important to test various sizes of runs, so you understand which portions of your code scale well, and which portions may need assistance.

Profiling is the best way to do this, we will use a “poor mans profiler”, basically timing calipers around sections of the code. This technique will give us approximate millisecond resolution data (Specifically, it is limited to the clock timer tick interrupt rate, and could be anywhere from 10 milliseconds to 1 millisecond).

You will not get more accurate single shot data than the timer resolution. In order to get better resolution, you need to iterate enough so that you can calculate an average time per iteration. It is also worth noting that OS jitter (management interrupts for disk I/O, network I/O, and running an OS in general) will also impact your measurements some, so please take this into account if you would like more precise data.

We implement calipers in the code using some library routines, a data structure, a counter, and a loop at the end. Our test code is named rzf.c. It computes the Riemann Zeta Function of order n. It does this by computing a sum using a loop. Take a look at the comments for more details, but it computes the zeta or ζ function:

ζ(n) =

∞∑k=1

1kn

To run the serial version of the code, first make the executable (make -f Makefile.rzf) type and run the executable as shown here:

./rzf.exe -l INFINITY -n n

where INFINITY is an integer value (e.g. 1000000) of how many terms you would like to use for your sum, and n is the argument to the Riemann Zeta Function. If you use 2 for n, then it will calculate π for you as well. The sum is basically a loop that looks like the following:

For numerical stability (round-off error accumulation) reasons, you actually run the sum in the other order (from inf to 1), which we do in the actual code.

OpenMP provides an easy way to parallelize the loop. As before, you set up a parallel region, and then you tell the compiler that you have a loop in the parallel region that you want to parallelize.

We are going to do this, but first I want to point out that this is an example of something called a reduction operation. This loop is a sum reduction. A reduction operation occurs (in a technical sense) when you reduce the rank or number of dimensions of something. In this case, imagine that we have a big long string (or vector) of numbers:

[1-2, 2-2, 3-2, ... , inf-2]

We are going to reduce these numbers to a single number by summing (taking a dot product with [1, 1, 1, ..., 1] )them. OpenMP needs to know if you are going to do this, otherwise you will likely run into a common problem in OpenMP programming (false sharing).

First things first, however, let’s see how the code performs, and specifically, where is the slow area? Lets run it and find out.

Using -n 2 and -l 1000000000, the loop took about 51 seconds to execute. The wallclock and user times are important as well. In this case, the wallclock reports being 3 milliseconds more than the loop time, and the user CPU time reports being a little less than the wallclock. The sum of user and sys times does in fact, pretty nearly equal the wallclock time. This result is good.

It’s important to note that as you scale up your parallel application, this time sum should remain very nearly constant. This idea may sound counter-intuitive, but it means that you are effectively partitioning work among the N CPUs. If this sum increases significantly with increasing processor count, you may not be operating as effectively as you could be, and your code won’t scale to the extent you might like it to.

Can we get any benefit out of running this program in parallel? And how hard is it to convert?

Let’s add in the OpenMP bits as we did before. First, make a copy of the original program and Makefile (Note: the source code contains the program and Makefile with the changes we are making), and rename them rzf_omp.c and Makefile.rzf_omp. Then edit these and add in the -fopenmp compiler flag to CFLAGS, FFLAGS, and LFPAGS. The rzf_omp.c is a little more complex, but not too much.

Here we need to add a pragma omp parallel, and tell it about the for loop and the sum reduction. So, insert this line immediately above the for loop:

#pragma omp parallel for reduction(+: sum)

As indicated, the loop tells the compiler that the following for loop is to be run in parallel and is a sum reduction over the variable named sum.

Remember, we have eight processors/cores (you may have less, or more), and by default, the gcc-4.2 OpenMP will try to use all available processors. The old loop took about 50.95 seconds. For eight processors, if the work is evenly divided among threads, we should get about 6.37 second run time for the loop, with maybe a little more for thread setup and tear down, which the compiler handles automatically.

What we measure is 6.35 seconds. Notice that the estimated error is different than in the single core run. This leads into a whole other discussion, which we will talk very briefly about in a minute. The point is that our code, or at least that loop in our code is now about eight times faster on eight cores. That’s not bad at all. Especially considering how little work this required (and it got the right answer!).

Notice also that the user time is about the same as it was before. This is because we had eight processors all working on the loop. Each processor working for a particular time interval adds its time to the total user time.

Very briefly, the way the round-off error accumulates in these calculations does have an impact upon the final results. Running this sum in the other direction, which most people would normally do, accumulates error in a different order, leading to a different result. That is, when you change the order of your floating point computations, you will change the round-off error accumulation. It it possible to observe a catastrophic loss of accuracy, impacting first and second digits of a summation like this, by not being aware of these issues. But that’s a topic for another time.

Application: Matrix Multiply

We have had presented a fairly “trivial” set of examples, and we have used up about 15 of our 30 minutes. The hello case is great to use as a sanity check, specifically something to use to verify that OpenMP is installed and operational. The rzf code is an example of an effectively embarrassingly parallel code, there is no communication between parallel threads until the very last iteration where a sum reduction is performed.

Note that had we not expressed this in terms of a sum reduction, the compiler would have happily cast the sum variable as global, with eight cores contending for it. By doing this (remove the reduction clause in the rzf_omp code to see what we mean), you get a 1/N effect.

That is, the sum global variable is being updated by N threads, each of which is getting access to the global cache line 1/N of the time on average. When you see this, you typically do not get any performance advantage relative serial execution, and usually get a performance disadvantage or penalty.

This case is important, as the access to the sum variable in the preceding example is a case of a Single Point of Information Flow (SPIF). Also called a point of serialization.

Without getting into a major discussion of Amdahl’s Law, you would like to avoid SPIFs as much as possible. They are not good for performance, rather they are parallel performance killers. This point of serialization we indicated above, this 1/N effect is a SPIF. It is often called “false sharing.”

As it turns out, this is very important for the matmul (matrix multiply) example that follows. We will allow one SPIF at the beginning. The way we got out of the SPIF in rzf was to tell the compiler that it needed to maintain a private, per-thread copy of the variable sum, and then combine them later on with the “+” operator. That is the entire point of the reduction, and yes, you can have other reductions.

The core algorithm in the matmul code is the triply nested for loop. You can see what happens when we run it with reasonable values of DIM in Table One.

N (matrix dimension)

T(N) in seconds

100

0.003

200

0.023

400

0.553

1000

14.46

1500

57.46

2000

137.1

4000

1107

Table One: Matrix size vs run time

A 4000×4000 matrix requires 16 million (16Mi) elements. Each element is 8 bytes (double precision floating point). This size means we have each array being 128Mi bytes, or 122.07 MB (remember a MB is different than an MiB). Looking over the calipers for the 4000×4000 run, we see output similar to this:

This result suggests that the matrix fill time (calipers 1 to 2) and the normalization time (calipers 2 to 3) are under 0.1% of the execution time. So we should focus our parallelization effort upon the matrix multiplication loops rather than the other loops. This is the case, even though random number generators, which need to preserve global state, or at least state per thread, are often SPIFs themselves.

Now that we have established what takes the most time, how do we parallelize this part of the program? As it turns out, we use similar methods to what was used before.

The main multiplication loop looks like this (see the source code for the full program):

You will probably notice immediately that the interior loop is a sum reduction. It is basically a dot product between two vectors. You might be inclined to place the parallelization directives there. Resist that urge.

That inner loop is executed N squared times. This means that the parallel region is set up and torn down N squared times. The setup and teardown are not free: that is, they do have a non-zero time cost. Which would become abundantly clear in the event of placing the parallelization directives around that loop.

In general, for most parallelization efforts, you want to enclose the maximum amount of work within the parallel region. This rule suggests you really want to put the directives outside the outer most loop, or as high up in the loop hierarchy as possible.

In this example, private(i,j,k,dot) tells the compiler which variables are private relative to each thread (i.e. not shared between threads), and shared(a,b,c) indicates which ones are shared across threads. Notice we haven’t specified anything about dimension of the array.

Table Two shows what we observe for the N=4000 case.

Number Cores

Time

Speed-up

Efficiency

1

1024.18

1

100.0%

2

513.89

1.99

99.7%

4

275.27

3.72

93.0%

8

152.24

6.73

84.1%

Table Two: Speed-up vs number of cores for first version of matmul

This result isn’t bad. We measure 6.73 times speedup on eight cores. About 84% efficient. The result is good, but can we make it go any faster? There are two approaches; either use OpenMP or tweak the code slightly. The answer is to use both methods. With a relatively simple code adjustment (that some compilers might do for you if you can coax them), you can see significantly better performance in parallel. What we do is increase the amount of work done in parallel. We do this by unrolling a loop. Our code now looks like this:

With this modification, and a few additional ones to the definition and use of the variable dot (now an array), we now measure the performance for the N=4000 case. The results are in Table Three.

Number Cores

Time

Speed-up

Efficiency

1

307.54

1

100.0%

2

154.93

1.99

99.3%

4

82.59

3.72

93.1%

8

43.16

7.13

89.1%

Table Three: Speed-up vs number of cores for first version of matmul

This version of the program operates on four rows at a time (we unrolled the loop), thus increasing the amount of work done per iteration. We have also reduced the cache miss penalty per iteration by reusing some of the more expensive elements (the b[k][j]). In addition, we added the firstprivate(DIM) OpenMP directive which specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable as it exists before the parallel construct.

Of course, after all of these modifications, we check the results to make sure that the addition of parallelization has not also caused the addition of bugs!

Summary

We have shown how to obtain, build, and use an OpenMP compiler for Linux machines. The OpenMP examples shown range from simple “hello” examples, through parallel matrix multiplication, and all demonstrate excellent performance.

Perhaps it’s a bit more than 30 minutes, but with this you get the sense that OpenMP is both powerful, and easy to use. There is plenty more to learn, but at least you have already started working with OpenMP. As desktop units surge to eight and even 16 or more cores, OpenMP is likely to become increasingly important for applications development.