Parallel

Easy GPU Parallelism with OpenACC

By Rob Farber, June 11, 2012

An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU

The following commands compile the source code for each application with the PGI C and Fortran compilers. These commands assume the source code has been saved to the file name provided in the comment at the beginning of each example.

The command line arguments to the PGI C compiler (pgcc) and Fortran compiler (pgfortran) are:

-fast: Chooses generally optimal flags for the target platform.

-mp: Interpret OpenMP pragmas to explicitly parallelize regions of code for execution by multiple threads on a multi-processor system.

-acc: Interpret OpenACC pragmas.

-Minfo: Emit useful information to stderr.

-Mconcur: Instructs the compiler to enable auto-concurrentization of loops.

The Portland Group also provides a profiling capability that can be enabled via the PGI_ACC_TIME environment variable. By default, profiling is not enabled. Setting PGI_ACC_TIME to a positive integer value enables profiling while a negative value will disable it. The profiling overhead is minimal because the runtime only reports information collected by the GPU hardware performance counters. The wealth of information gathered by the runtime profiler can be seen in the output generated by matrix-acc-gpu after setting PGI_ACC_TIME=1:

This output from the PGI runtime profiling tells us that the application spent 7 milliseconds transferring data and 43 milliseconds computing the matrix multiply kernel.

It is possible to create a timeline plot using the NVIDIA Visual Profiler (nvvp), which runs on Windows, Linux and Mac computers. (The nvvp application was previously known as computeprof.) The timeline is a new feature in the CUDA 4.2 release and is extremely useful!

[Click image to view at full size]

Figure 1: nvvp timeline for matrix-acc-gpu.

Notice that there are:

Three host to device data transfers at the start of the computation. These transfers correspond to the copyin() clauses for matrices a and b plus the copy() clause for matrix c.

A GPU computation that requires 39.1% of the time for kernel main_24_gpu. A helpful feature of the PGI OpenACC compiler is that it intelligently labels the kernel with the routine name and line number to make these timelines intelligible.

A single data transfer back from the device to the host, which was required by the copy clause for matrix c at the end of the kernel.

The visual profiler provides an interactive display of the timeline. A larger screenshot would show the calls to the driver API for the CUDA context setup and the data transfers along with a host of other information. In addition, the nvvp profiler will analyze the application and provide automated suggestions. This requires running the application many times. It is recommended to look at the timeline first as this only requires running the application once.

For example, the following screenshot shows the initial analysis of the timeline shown in Figure 1:

[Click image to view at full size]

Figure 2: Automated analysis performed by the NVIDIA Visual Profiler.

Matrix Multiply Is an Ideal Case

Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra Subprograms) library. BLAS is the de facto programming interface for basic linear algebra.

BLAS is structured according to three different levels with increasing data and runtime requirements.

Level-1: Vector-vector operations that require O(N) data and O(N) work. Examples include taking the inner product of two vectors, or scaling a vector by a constant multiplier.

Level-2: Matrix-vector operations that require O(N2) data and O(N2) work. Examples include matrix-vector multiplication or a single right-hand-side triangular solve.

The following table describes the amount of work that is performed by each BLAS level assuming that N floating-point values are transferred from the host to the device. This table does not take into account the time required to transfer the data back to the host.

BLAS level

Data

Work

Work per Datum

1

O(N)

O(N)

O(1)

2

O(N2)

O(N2)

O(1)

3

O(N2)

O(N2)

O(N)

Table 2: Work per datum for the three BLAS levels.

Matrix multiply is an ideal example for OpenACC acceleration because the data transfers become less important as the size of the matrices increase. Matrix multiply is a level-3 BLAS operation that performs O(N) work for every floating-point value transferred to the device. The effect of this high computational density can be seen in the following plot of wall clock time on a dedicated system as the problem size increases. Multiplying 1k by 1k square matrices results in a 1.7 speedup of matrix-acc.c over matrix-omp.c when running on an NVIDIA C2050 GPU compared with a 2.65 GHz quad-core Intel Xeon E5630 processor. Increasing the matrix sizes to 11k by 11k shows a 6.4x speedup over OpenMP. This empirically demonstrates the high work per datum runtime behavior of matrix multiply. Similar speedups will occur for other high work-per-datum computations as well.

The Three Rules of Coprocessor Programming

Matrix multiply is an excellent teaching tool but most real-world calculations do not exhibit such ideal behavior. Instead, the programmer must be creative and pay close attention to data transfers and computational density on the OpenACC device(s). High performance can be achieved when the compute intensive portions of the application conform to the following three rules of high-performance coprocesser programming. If not, expect application performance to be either PCIe or device memory bandwidth limited.

Transfer the data across the PCIe bus onto the device and keep it there.

Give the device enough work to do.

Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks.

Using the Create Clause to Allocate on the OpenACC Device

It is easy to create the matrices on the OpenACC device and initialize them. Creating and initializing data on the OpenACC device conforms to the first rule and avoids data transfers. The following example, matrix-acc-create.c demonstrates the use of the create() clause in a kernels region.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!