Parallel

Creating and Using Libraries with OpenACC

By Rob Farber, October 29, 2012

How to write reusable methods (libraries, subroutines, and functions) that can transparently call optimized CPU and GPU libraries using OpenACC pragmas.

Table 1 makes it easy to view the runtimes of simpleMult.c and fastSimpleMult.c on both the quad-core Xeon and the NVIDIA C2050 GPU. Note that the C2050 maintained high performance regardless of how the nested loops of the matrix multiplication are structured. These results strongly indicate that the GPU hardware is successfully using Thread-Level Parallelism (TLP) and hardware scheduling to hide latency and make the most efficient use of its computational hardware. In contrast, the rearranged loop structure strongly benefits the OpenMP code (providing an average 20x speed-up), which means the C2050 OpenACC version only runs 3x faster than the quad-core Xeon when multiplying 1000x1000 matrices.

It is appropriate to hypothesize that the performance robustness of the GPU against variations in nested loop structure is a generic feature of a TLP-based hardware execution model. TLP was designed to reorder computations to hide latency and increase computational efficiency. Succinctly, how to structure nested loops to most efficiently exploit hardware parallelism is an important question in computer science, and one that lies at the heart of high-performance parallel application design.

The process of pragma-based programming has been described as a "negotiation" with the compiler. Finding the right loop structure can be both difficult and rewarding as seen in even the simple square matrix multiplication example in this tutorial. The frustrating part is that it is never clear if the best performing solution has been found. For example, the OpenMP runtimes reported by simpleMult.c are reasonable and there is no obvious performance concern as both codes show 100% utilization on all the processor cores. Only through comparison against my timings reported earlier (for example, regression testing) was a performance issue even identified. (Regression testing should be a core part of any software development project  no matter how simple the project!) Finally, a non-intuitive solution was only found after a human took sufficient time to find out why one set of nested loops ran much faster than another.

OpenACC programmers have a rather unique opportunity to investigate the performance of nested loops on both GPU and conventional SMP/vector hardware. The comparison can benefit application performance on both architectures. Bottom line: If the GPU results look too good, then re-examine the nested loops in the OpenMP code because this might be a sign that the GPU was able to reorder the computation to be more efficient. Alternatively, just run on the GPU.

Time will tell if OpenACC, which can utilize hardware execution models like GPU thread-level parallelism, will make this negotiation process with the compiler easier and more effective. Those interested in this discussion should also investigate applications that exhibit dynamic parallelism (where the parallel work varies by time, grid location, or some other factor), as these problems further complicate the challenges of finding the best performing loop structures.

Conditional Runtime Execution

Conditional runtime execution is possible in OpenACC by calling the acc_get_device_type() or acc_on_device() runtime library methods. The following source code for initSgemm() demonstrates how to initialize the NVIDIA CUBLAS library, but only when the source code is compiled with OpenACC and when the application runs on an NVIDIA device.

Conditional runtime behavior based on the device type is necessary because OpenACC can run on a variety of device types. Currently, only the host processor and NVIDIA CUDA-enabled GPUs are supported. Section 4 of the OpenACC specification describes how the environment variable ACC_DEVICE_TYPE is used to specify the default device type at runtime. The type of the device can also be specified at compile time. The PGI compiler, for example, creates a host-only binary when "-ta=host" is specified on the command line. Similarly, specifying "-ta=host,nvida" will create a fat binary (referred to as a "Unified Binary" by PGI) that can run on either the host processor or an NVIDIA GPU. Finally, the device type can be specified inside the application with acc_init() or acc_set_device_type(). It is expected that additional OpenACC device types will be added in the future.

Conditional runtime execution means developers can write reusable OpenACC code that will transparently call optimized libraries such as CUBLAS and CUFFT, or external CUDA and OpenCL methods. To support this capability, OpenACC provides two ways to determine the address of an OpenACC memory region that is present on the device so it can be passed to an external method or library:

The acc_malloc() function returns a pointer to memory allocated on the device. The pointer can be used in OpenACC code or passed to a method that expects a device pointer.

The OpenACC use_device clause. As of the OpenACC specification v1.0, this is the only valid clause that can be used with the host_data pragma construct. The use_device clause is not implemented by the PGI compilers as of the 12.6 release.

To complete the ability to interact with external code, the deviceptr(list) clause is used to declare that the pointers in list are device pointers. This clause lets OpenACC developers utilize memory created on a device with CUDA, OpenCL, or an optimized library.

The following code snippet from the doSgemm() method (which will be shown shortly) calls acc_malloc() to allocate device memory at a known location for the transposed At, Bt, and Ct matrices. These pointers contain the address of the matrix on the device, so they can be passed to the CUBLAS sgemm() method while still being accessible to OpenACC. Conditional runtime execution based on device type ensures that the CUBLAS sgemm() method is only called when the binary is utilizing an NVIDIA GPU.

As will be seen in the complete source listing at the end of this article (matrix-lib.c), the doSgemm() method is structured so the ACML sgemm() method is called by default when the OpenACC binary is running on the host processor or the code has not compiled with OpenACC. It is likely that many production library codes will be structured in a similar manner.

Aside from the doSgemm() method, the remainder of the matrix-lib.c test code is straight-forward and utilizes concepts already discussed. For timing purposes, separate doMult_acc() and doMult_omp() methods have been created. A checkMult() method has been provided to verify that each of the matrix multiplication methods produces a result that agrees with an optimized sgemm() call.

It is expected that some of the example methods will calculate slightly different results due to round-off errors. Floating-point arithmetic is only approximate, which means that even slight variations in the order in which floating-point operations are performed (say, when compiling with different optimization flags or running on different devices) can cause the same code to produce slightly different numerical results. For validation purposes, the checkMult() method calculates the Root–Mean-Squared Deviation (RMSD) as a measure of similarity. The RMSD value will be small when the matrices are similar.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!