IEEE Transactions on Visualization and Computer Graphics (TVCG)

The Spotlight Paper for the September 2013 issue

This figure shows five different benchmarks, whose proximity queries are parallelized by using CPUs and GPUs
within our hybrid parallel framework. Different computations of these queries are automatically distributed by our LPbased
scheduler without any parameter tuning. Compared to using a hexa-core CPU with six CPU threads, our method
achieves one order of magnitude performance improvement by using one additional hexa-core CPU and four different
GPUs with the CPU.

Abstract

We present a novel, Linear Programming (LP) based scheduling algorithm that exploits heterogeneous multi-core
architectures such as CPUs and GPUs to accelerate a wide variety of proximity queries. To represent complicated performance
relationships between heterogeneous architectures and different computations of proximity queries, we propose a simple,
yet accurate model that measures the expected running time of these computations. Based on this model, we formulate an
optimization problem that minimizes the largest time spent on computing resources, and propose a novel, iterative LP-based
scheduling algorithm. Since our method is general, we are able to apply our method into various proximity queries used in five
different applications that have different characteristics. Our method achieves an order of magnitude performance improvement
by using four different GPUs and two hexa-core CPUs over using a hexa-core CPU only. Unlike prior scheduling methods, our
method continually improves the performance, as we add more computing resources. Also, our method achieves much higher
performance improvement compared with prior methods as heterogeneity of computing resources is increased. Moreover, for
one of tested applications, our method achieves even higher performance than a prior parallel method optimized manually for
the application. We also show that our method provides results that are close (e.g., 75%) to the performance provided by a
conservative upper bound of the ideal throughput. These results demonstrate the efficiency and robustness of our algorithm that
have not been achieved by prior methods. In addition, we integrate one of our contributions with a work stealing method. Our
version of the work stealing method achieves 18% performance improvement on average over the original work stealing method.
This result shows wide applicability of our approach.