Parallel

Easy GPU Parallelism with OpenACC

By Rob Farber, June 11, 2012

An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU

This is the first in a series of articles by Rob Farber on OpenACC directives, which enable existing C/C++ and Fortran code to run with high performance on massively parallel devices such as GPUs. The magic in OpenACC lies in how it extends the familiar face of OpenMP pragma programming to encompass coprocessors. As a result, OpenACC opens the door to scalable, massively parallel GPU  accelerating millions of lines of legacy application code without requiring a new language such as CUDA or OpenCL, or fork application source tree to support multiple languages

OpenACC is a set of standardized, high-level pragmas that enables C/C++ and Fortran programmers to utilize massively parallel coprocessors with much of the convenience of OpenMP. A pragma is a form of code annotation that informs the compiler of something about the code. In this case, it identifies the succeeding block of code or structured loop as a good candidate for parallelization. OpenMP is a well-known and widely supported standard that defines pragmas programmers have used since 1997 to parallelize applications on shared memory multicore processors. The OpenACC standard has generated excitement because it preserves the familiarity of OpenMP code annotation while extending the execution model to encompass devices that reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and transfer as well as loop and block parallelism.

The success of GPU computing in recent years has motivated compiler vendors to extend the OpenMP shared memory pragma programming approach to coprocessors. Approved by the OpenACC standards committee in November 2011, the OpenACC version 1.0 standard creates a unified syntax and prevents a "tower of babel" proliferation of incompatible pragmas. Adoption has been rapid by companies such as NVIDIA, PGI (The Portland Group), CAPS Enterprise, and Cray.

Make Your Life Simple

Pragmas and high-level APIs are designed to provide software functionality. They hide many details of the underlying implementation to free a programmer's attention for other tasks.A colleague humorously refers to pragma-based programming as a negotiation that occurs between the developer and the compiler. Note that pragmas are informational statements provided by the programmer to the assist the compiler. This means that pragmas are not subject to the same level of syntax, type, and sanity checking as the rest of the source code. The compiler is free to ignore any pragma for any reason including: it does not support the pragma, syntax errors, code complexity, unresolved (or potentially unresolved) dependencies, edge cases where the compiler cannot guarantee that vectors or matrices do not overlap, use of pointers, and many others. Profiling tools and informational messages from the compiler about parallelization, or an inability to parallelize, are essential to a successful to achieving high performance.

An OpenACC pragma for C/C++ can be identified from the string "#pragma acc" just like an OpenMP pragma can be identified from "#pragma omp". Similarly, Fortran pragmas can be identified by "! $acc". Always ensure that these strings begin all OpenACC (or OpenMP) pragmas. Moreover, it is legal to mix OpenMP, OpenACC, and other pragmas in a single source file.

OpenACC pragmas in C/C++ are somewhat more concise than their Fortran counterparts as the compiler can determine a code block from the curly bracket "{}" notation. The OpenACC specification also requires that the _OPENACC preprocessor macro be defined when compiling OpenACC applications. This macro can be used for the conditional compilation of OpenACC code. The _OPENACC macro name will have a value yyyymm where yyyy is the year and mm is the month designation of the version of the OpenACC directives supported by the implementation.

Table 1 shows a list of OpenACC version 1.0 pragmas and clauses.

!$acc kernels

!$acc parallel

!$acc data

!$acc loop

!$acc wait

#pragma acc kernels

#pragma acc parallel

#pragma acc data

#pragma acc loop

#pragma acc wait

Clauses

Clauses

Clauses

Clauses

if()

if()

if()

collapse()

async()

async()

async()

within kernels region

copy()

num_gangs()

gang()

copyin()

num_workers()

worker()

copyout()

vector_length()

vector()

create()

reduction()

seq()

present()

copyin()

copyin()

private()

present_or_copy()

copyout()

copyout()

reduction()

present_or_copyin()

create()

create()

present_or_copyout()

present()

present()

present_or_create()

present_or_copy()

deviceptr() in .c

deviceptr()

present_or_copyin()

deviceptr() in .f

present_or_copyout()

present_or_create()

deviceptr()

private()

firstprivate()

Table 1. Currently supported OpenACC pragmas.

Two OpenACC environment variables, ACC_DEVICE_TYPE and ACC_DEVICE_NUM can be set by the user:

ACC_DEVICE_TYPE: Controls the default device type to use when executing accelerator parallel and kernels regions, when the program has been compiled to use more than one different type of device.
The allowed values of this environment variable are implementation-defined.
Examples include ACC_DEVICE_TYPE=NVIDIA.

ACC_DEVICE_NUM: Specifies the default device number to use when executing accelerator regions. The value of this environment variable must be a nonnegative integer between zero and the number of devices of the desired type attached to the host.
If the value is zero, the implementation-defined default is used.
If the value is greater than the number of devices attached, the behavior is implementation-defined.
On multi-GPU systems, this variable will avoid the TDR (Timeout Detection and Recovery) watchdog reset for long-running GPU applications by running on the GPU that is not used for the display. (Consult the vendor driver information to see how to modify the TDR time for your operation system.

Vendor specific information can be found on the Nvidia, PGI, CAPS, and Cray websites.

Building, Running and Profiling a First Program

This tutorial uses The Portland Group (PGI) Accelerator C and Fortran compilers release 12.5 with OpenACC support. PGI has been deeply involved in developing pragma-based programming for coprocessors since 2008, plus they are a founding member of the OpenACC standards body. The PGI OpenACC compilers currently target NVIDIA GPUs, but it is important to note that OpenACC can support other coprocessors (such as AMD GPUs and Intel MIC) as well. More information about the PGI compilers is available on the
company's website.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!