Current publication:

In some work that could be considered a continuation of the architecture
specific optimization analysis of GEM, Konstantinos Krommydas and I evaluate
programmability performance tradeoffs across three architectures, an Intel
CPU, Intel Xeon Phi, and an NVIDIA Kepler GPU. Some of the results were
surprising, not the least of which being that when fully optimized the GPU
core code ended up more readable than the highly optimized CPU code.

Abstract:

General-purpose computing on an ever-broadening array of parallel devices has
led to an increasingly complex and multi-dimensional landscape with respect to
programmability and performance optimization. The growing diversity of
parallel architectures presents many challenges to the domain scientist,
including device selection, programming model, and level of investment in
optimization. All of these choices influence the balance between
programmability and performance.
In this paper, we characterize the performance achievable across a range of
optimizations, along with their programma- bility, for multi- and many-core
platforms – specifically, an Intel Sandy Bridge CPU, Intel Xeon Phi
co-processor, and NVIDIA Kepler K20 GPU – in the context of an n-body,
molecular-modeling application called GEM. Our systematic approach to
optimization delivers implementations with speed- ups of 194.98×, 885.18×, and
1020.88× on the CPU, Xeon Phi, and GPU, respectively, over the na ̈ıve serial
version. Beyond the speed-ups, we characterize the incremental optimization of
the code from na ̈ıve serial to fully hand-tuned on each platform through four
distinct phases of increasing complexity to expose the strengths and
weaknesses of the programming models offered by each platform.