Tools

The OpenACC Execution Model

In this second part of the introduction to OpenACC  the OpenMP-style library for GPU programming  the execution model is explained and samples are benchmarked against straight, OpenMP parallelism.

The following code copies the Q matrix onto the GPU where it is transposed in parallel into the Qt array. The cost of the transpose should be minimal compared with the 2MN2 operations (where M and N specify the number of columns and rows of the matrix) of the classic Gram-Schmidt algorithm. Experimentation also determined that changing the vector_length to 128 provided a slight performance increase on an NVIDIA C2070.

The speedup of the transposed version is noticeable for all versions (serial, OpenMP, and OpenACC), which highlights how optimizations for OpenACC code can potentially benefit other implementations as well.

1000x1000

Time (sec)

Time with Transpose (sec)

OpenACC (v2)

1.44388

0.227207

OpenACC

1.66707

OpenMP (4-cores)

2.92685

0.75971

Serial

9.22192

2.39342

Table 3: Improved results from better memory system utilization

The NVIDIA Visual Profiler analysis shows that the memory utilization is more efficient as it increased from 3.1% load/12.5% store to 40.5% load/12.5% store.

Per the second rule of high performance accelerator programming, the kernel startup latency appears to be the gating performance limitation on a 1000x1000 matrix as can be seen in the nvvp timeline below. Note the gaps and that the total utilization of all the kernels is 38.5%. The two transpositions only take up 0.4% of the time.

Figure 7: Timeline showing inefficiency due to insufficient work.

The following timeline for a 5000x5000 matrix shows that the additional work results in much higher accelerator efficiency. In this case, the total kernel utilization is 92.3%. Even though the matrix is much larger, the two transpose operations still take a minimal amount of time.

Figure 8: Timeline showing better utilization with larger matrices.

The runtime and speedup compared with the OpenMP version for a 5000x5000 matrix is as follows:

5000x5000

Time (sec)

OpenACC

11.0075

OpenMP (4-cores)

87.3252

OpenACC speedup

7.9x

Table 4: Improved results even with the overhead of two matrix transposes.

Eliminating the use of double-precision does reduce the runtime to 10.4265 seconds on a C2070. It will be interesting to see how the upcoming NVIDIA Kepler K10 and K20 chips accelerate these runtimes both with single-precision and hybrid double-precision.

Conclusion

The OpenACC execution model lets the programmer exploit both the massive parallelism of the accelerator device(s) and the capabilities of the latest generation of sequential processors. In this way, the full potential of the host and accelerator hardware can be exploited. When required, the OpenACC gang, worker, and vector clauses can be utilized to tune an application to a particular hardware configuration.

OpenACC and OpenMP implementations of the Classic Gram-Schmidt (CGS) method provided in this tutorial demonstrated how easy it is to work with OpenACC parallel regions, even when the amount of work per loop can vary. Parallel regions are useful because they let programmers annotate code in a style that is conceptually very similar to OpenMP. Kernel regions allow the compiler to automatically generate CUDA-style kernels, which gives advanced programmers the ability to express legal CUDA kernel launch configurations using portable directive-based OpenACC syntax. Conversations with PGI indicate that the ability to extract maximum performance from GPUs using parallel region annotation is still to be demonstrated, which bodes well for the future.

Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!