Enabling Task Parallelism in the CUDA Scheduler

Executive Summary

General purpose computing on Graphics Processing Units (GPUs) introduces the challenge of scheduling independent tasks on devices designed for data parallel or SPMD applications. This paper proposes an issue queue that merges workloads that would underutilize GPU processing resources such that they can be run concurrently on an NVIDIA GPU. Using kernels from microbenchmarks and two applications the authors show that throughput is increased in all cases where the GPU would have been underused by a single kernel. An exception is the case of memory-bound kernels, seen in a Nearest Neighbor application, for which the execution time still outperforms the same kernels executed serially by 12-20%.