An OpenACC Example (Part 1)

In this post I’ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I’ll add just a few more lines to push that speedup over 4x.

Example Source Code

You can download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1, step 2). The examples use OpenACC 1.0 syntax. You’ll need a compiler with support for OpenACC. I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

Jacobi Iteration

Let’s look at a somewhat more interesting example. This is still a simple program, but it gives us a bit more room to explore various OpenACC directives and options. Jacobi iteration is a standard iterative method for finding solutions to a system of linear equations. In this post I’ll take an existing, simple sequential CPU implementation of two-dimensional Jacobi iteration and optimize it using OpenMP and OpenACC. The core computational kernel of this example is shown below. The outer while loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, tol. For benchmarking purposes, we have set the variable tol low enough that the outer loop always runs for iter_max iterations (1000). The first set of inner nested loops on lines 5 through 11 apply a 2D Laplace operator at each element of a 2D grid, and the ones on lines 14 through 18 copy the output back to the input for the next iteration. For our benchmark, we use a grid of 4096×4096 elements.

To make sure we’re using all the cores of our CPU for a fair comparison, we use OpenMP parallel for directives (the#pragma statements in the C code and !$omp statements in the Fortran code) on each loop nest. The omp parallelfor directive instructs the compiler to parallelize the following loop using CPU threads (the number of threads is specified in the OMP_NUM_THREADS environment variable). We also use PGI’s -fast compiler option to get the best CPU code optimizations, including generation of CPU vector instructions (e.g. SSE) where possible. This option makes a big difference to performance, and in the case of this code it does not noticeably affect accuracy. The PGI compiler also automatically replaces the innermost copy loop with an optimized memory copy. The following table shows the performance of the C code using 1, 2, and 4 threads on a system with a quad-core Intel Xeon X5550 CPU (2.66 GHz).

Execution

Time (s)

Speedup vs. 1 CPU Thread

CPU 1 thread

34.14

—

CPU 2 threads

22.58

1.51x

CPU 4 threads

21.16

1.61x

As you can see, using all 4 cores of the CPU for this code is not 4 times faster. That’s because it is bandwidth bound. I won’t claim that this code is extremely efficient CPU code—it could be optimized, for example by splitting the loops for cache blocking, among other things. However, I think that the code is representative of C code an average scientist or engineer would write. We are at least using OpenMP and an optimizing compiler to make sure we are using all CPU cores and vector instructions, so it makes a good starting point and comparison for applying GPU directives. As we’ll see in Step 2, below, the PGI Accelerator compiler generates cache blocking GPU code for us automatically by using the GPU’s on-chip shared memory, so we don’t have to think about changing our loops.

STEP 1: Our First GPU Directives

Let’s just drop in a single, simple OpenACC directive before each of our for loop nests in the previous code. Just after the #pragma omp lines, we add the following. (We leave the OpenMP directive in place so we retain portability to multicore CPUs.)

#pragma acc kernels

This tells the compiler to generate parallel accelerator kernels (CUDA kernels in our case) for the loop nests following the directive. To compile this code, we tell the PGI compiler to compile OpenACC directives and target NVIDIA GPUs using the -acc -ta=nvidia command line options (-ta=nvidia means “target architecture = NVIDIA GPU”). We can also enable verbose information about the parallelization with the -Minfo=accel option. If you are interested in the details, check out the Appendix below.

Let’s run the program. Our test PC has an Intel Xeon X5550 CPU and an NVIDIA Tesla M2090 GPU. When I run it, I see that it takes about 75 seconds. Uh oh, that’s a bit of a slow-down compared to my parallel CPU code. What’s going on? We can modify the compilation command to enable built-in profiling by specifying -ta=nvidia,time. Now when I run it I get:

Of the 78 or so seconds spent in the parallel regions, we’re spending over 62 seconds (31,135,910us + 31,385,120us) moving data around. The reason is that the compiler doesn’t know that we are using those arrays repeatedly on the device. So each time through the outer while loop, when it encounters the acc kernels directives, it must copy data used on the accelerator from the host, and then copy it back after the directive region ends. That means we are copying the A and Anew arrays back and forth twice for each of the 1000 iterations of the outer while loop. There’s no need for this—we don’t need the results until after the while loop exits. Let’s fix it by making the compiler aware of the data we need on the device.

Step 2: Efficient data management

We need to move the data copies outside the while loop, and OpenACC makes this easy. We just put an acc data directive just before the while loop:

The data copy directive tells the compiler to copy the arrays A and Anew to and from the device before and after the while loop runs. There is also initialization overhead of 2 seconds shown in the profile above. We can make sure this is not included in our timing loop (large applications will only initialize the GPU once) by callingacc_init(acc_device_nvidia) at startup.

After making these changes, our GPU code is much faster—with just 3 lines of OpenACC directives and an initialization call we have made our code more than twice as fast by running on a GPU, as shown in this table (Times are for the C version.)

Execution

Time (s)

Speedup vs. 1 CPU Thread

Speedup vs. 4 CPU Threads

CPU 1 thread

34.14

—

—

CPU 4 threads (OpenMP)

21.16

1.61x

1.0x

GPU (OpenACC)

9.02

3.78x

2.35x

We Can Do Even Better

We’ve already gotten a speedup with only 3 lines of directives (6 in Fortran), but we can do better with a bit deeper understanding of the OpenACC execution model, and that’s where I’ll pick up in my next Parallel Forall post.

The numbers starting some of the lines are line numbers corresponding to the input code. Line 74 is the #pragma acc kernels directive, and you can see that the compiler has figured out what data it needs to copy from the host (CPU) to the device (GPU) when entering the region (the array A), and what it needs to copy from the device to the host when leaving it (the array Anew). The compiler reports that the loop nests starting on lines 75 and 77 are both parallelizable, and that two accelerator kernels (that is, functions to run on the GPU) were generated. Both pairs of loops are vectorized with width 16. This processes the 4096×4096 grid in blocks of 16×16 elements, which allows the blocks to be cached in the GPU’s fast shared memory (it caches 18×18 blocks of A due to the dependencies of elements on their neighbors).

The compiler also indicates that a max reduction was generated for the variable error. A reduction is a parallel algorithm for producing a single output value from a vector of input values. Note that technically we should have used an explicit reduction clause on directive for the first loop nest, but the early implementation of OpenACC in the PGI compiler I used when writing this post did not yet support it (but the compiler is clearly able to auto-detect the reduction). In this case we need the maximum of all the computed error values, and the OpenACC compiler automatically generates efficient parallel code for this.

Related Posts

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter