Abstract

Recently, general-purpose computation on graphics processing units (GPGPU) has become an increasingly
popular field of study as graphics processing units (GPUs) continue to be proposed as high performance
and relatively low cost implementation platforms for scientific computing applications. Among these
applications figure astrophysical N-bodysimulations, which form one of the most challenging problems
in computational science. However, in most reported studies, a simple
\( \mathcal{O}(N^{2})\) algorithm
was used for GPGPUs, and the resulting performances were not observed to be better than those of conventional
CPUs that were based on more optimized
\( \mathcal{O}(N \log N)\) algorithms
such as the tree algorithm or the particle-particle particle-mesh algorithm. Because of the difficulty
in getting efficient implementations of such algorithms on GPUs, a GPU cluster had no practical
advantage over general-purpose PC clusters for N-bodysimulations. In this paper, we report a new
method for efficient parallel implementation of the tree algorithm on GPUs. Our novel tree code allows
the realization of an N-bodysimulation on a GPU cluster at a much higher performance than
that on general PC clusters. We practically performed a cosmological simulation with 562 million
particles on a GPU cluster using 128 NVIDIA GeForce 8800GTS GPUs at an overall cost of 168172 $.
We obtained a sustained performance of 20.1 Tflops, which when normalized against a general-purpose
CPU implementation leads to a performance of 8.50 Tflops. The achieved cost/performance was
hence a mere $19.8 /Gflops which shows the high competitiveness of GPGPUs.