Package:

OpenACC provides a high-productivity API for programming GPUs and similar accelerator devices. One of the last steps in tuning OpenACC programs is selecting values for the num_gangs and vector length clauses, which control how a parallel workload is distributed to an accelerator’s processing units. In this paper, we present OptACC, an autotuner that can assist the programmer in selecting high-quality values for these parameters, and we evaluate the effectiveness of two direct search methods in finding solutions. We assess the quality of the the num_gangs and vector_length values found by our autotuner by comparing them to the values found by a bounded exhaustive search; we also compare the kernel execution times to those of the untuned kernel. On a suite of 36 OpenACC kernels, one or both of our autotuner’s direct search methods identified values within the top 5% for 29 of the kernels, within the top 10% for five kernels, and within the top 25% for the remaining two. Eleven of the kernels achieved a speedup greater than 2x over the compiler’s defaults, and the autotuner required only 7-11 runs of the target program, on average.