Computational scientists and engineers have begun making use of many-core GPU architectures because these can provide significant gains in the overall performance of many numerical simulations at a relatively low cost. However, to the average computational scientist, these GPUs usually employ a rather unfamiliar and specialized programming model that often requires advanced knowledge of their architecture. In addition, these typically have their own vendor- and platform- specific software development frameworks (SDKs), that are different from the others in significant ways. For example: Nvidia's GPUs use CUDA SDK, AMD's GPUs use Stream SDK, while traditional multi-core processors (from Intel, AMD, IBM) typically employ an OpenMP-based parallel programming model.

In 2009, an open standard was proposed by Apple to bring the software development for all these different processor architectures under a single standard -- the Open Computing Language (OpenCL) -- and all major multi-core processor and GPU vendors (Nvidia, AMD, IBM, Intel) have adopted this standard for their current and future hardware. OpenCL is of tremendous value to the scientific community because it is open, royalty-free and vendor- and platform- neutral. It delivers a high degree of portability across all major forms of current and future compute hardware, without significantly sacrificing performance.

In this project, we make use of OpenCL to harness the massive parallelism offered by many-core architectures like GPUs in order to perform high-resolution and long-duration black hole binary inspiral computations, very efficiently. This plays a critical role in our EMRI Teukolsky Code's ability to achieve the required high level of accuracy and efficiency for such simulations.

Comparative performance using our EMRI Teukolsky Code:

The Table #1 below depicts the relative values for overall performance of our EMRI Teukolsky Code for several variants of current generation CPUs and GPUs. These results suggest that it is relatively straightforward to obtain order-of-magnitude gains in overall code performance by making use of many-core GPUs over multi-core CPUs and this fact is largely independent of the specific hardware architecture and vendor. All the systems used in these performance tests used a variant of the Linux operating system and OpenCL provided by the appropriate vendor. Detailed specifications of the compute hardware are included in the table. The baseline system here has dual AMD Opteron 6200, 8-core, 2.1 GHz CPUs.

It is also noteworthy that the consumer-grade GPU, the AMD Radeon HD 7970, outperforms Nvidia's HPC-oriented, high-end Fermi M2050 GPU, while maintaining a significantly lower cost. The cost effectiveness of consumer-grade compute hardware is nearly an order-of-magnitude higher than the alternatives. This observation is consistent with our earlier findings that evaluated the Sony PlayStation 3consumer gaming console for scientific computing: PS3 Gravity Grid.