UNLEASH LEGACY MPI CODES WITH KEPLER’S HYPER-Q

We said the Kepler architecture-based NVIDIA Tesla K20 GPU would be the highest performance processor the HPC industry has ever seen when we unveiled it at the GPU Technology Conference in May.

But recent performance tests on real-world scientific applications show that the forthcoming GPU easily surpasses even our highest expectations. This GPU simply rocks!

I’m particularly thrilled about Kepler’s new Hyper-Q feature, which helps increase performance for thousands of legacy MPI applications without requiring a major code rewrite.

To illustrate the power of Hyper-Q, I picked a traditionally difficult code for GPUs called CP2K, a popular MPI-based molecular simulations code. Hyper-Q maximizes GPU utilization for the CP2K application, resulting in more than double the performance compared to running the same code without it.

How Hyper-Q Works

A GPU consists of multiple CUDA cores grouped into streaming multiprocessors operating in parallel. A hardware unit called the CUDA Work Distributor (CWD) is responsible for assigning work to the individual multiprocessors.

In the current Fermi architecture, the CWD has a single connection to the host CPU and work from different MPI processes is merged into this single queue. This serialization could easily lead to false dependencies among work from different MPI processes, limiting the amount of work that can be executed concurrently on the GPU. This often results in an underutilized GPU.

Hyper-Q removes this limitation. As shown in the graphic, the new Kepler-based Tesla K20 GPU provides 32 work queues between the host and the GPU, enabling multiple MPI processes to run concurrently on the GPU. Each MPI process can be assigned to a different hardware work queue, maximizing GPU utilization and increasing overall performance.

While MPI developers will be thrilled with the added performance, they’ll be equally enamored with how Hyper-Q makes porting legacy MPI codes to the GPU significantly easier.

Legacy MPI-based codes were often created to run on multicore CPU systems, with the amount of work assigned to each MPI process scaled accordingly. However, this often meant that MPI processes didn’t generate enough work to fully occupy the GPU. To make the code launch enough work to fully utilize the GPU, developers frequently were required to modify their codes significantly.

Hyper-Q reduces recode efforts considerably because developers can now throw many MPI processes with small- and medium-size workloads at a shared GPU. Developers no longer need to modify their codes to put enough work into a single MPI process. Rather, they can send up to 32 MPI processes with variable workloads to the GPU and just let the GPU do all the heavy lifting to maximize performance.

Case In Point: CP2K

CP2K is a widely used atomic and molecular simulation code that runs at many of the world’s supercomputing sites. CP2K is parallelized using MPI and OpenMP, and CUDA is used in some models where GPUs are targeted.

With Fermi-based GPUs, developers actually experienced reduced performance gains when MPI processes were limited to small amounts of work, particularly in strong scaling simulations. While the CPU was highly utilized, the GPU stayed completely inactive in substantial portions of the simulation.

The following benchmark below shows the impact of Hyper-Q.

This small data set of 864 water molecules is usually problematic for GPUs. Without Hyper-Q, only one MPI process runs on each node with GPUs, and the performance curve from 1 to 16 nodes is not much better than with CPU-only simulations.

With Hyper-Q, it is now possible to use the same number of MPI processes per node as in the CPU-only case, which means 16 MPI processes per GPU in this instance. This unlocks the full benefit of the GPU, leading to a speedup of 2.5x with Hyper-Q enabled.

And the best part? No extra coding effort is necessary to enable Hyper-Q. All it takes is a Tesla K20 GPU with a CUDA 5 installation and setting an environment variable to let multiple MPI ranks share the GPU – Hyper-Q is then ready to use.

Be Prepared, Start Today

The Tesla K20 will be the first GPU to feature Hyper-Q. It’s scheduled to be available by the end of the year, but you can start preparing today.

Begin by accelerating your code using OpenACC. With OpenACC directives, developers simply insert compiler hints into the code and the compiler will automatically map compute-intensive portions of the code to the GPU. By using directives within MPI processes, you don’t need to worry about how much workload is created by the OpenACC compiler because Hyper-Q ensures the GPU stays as occupied as possible.

You can get a free, 30-day trial license for an OpenACC compiler on the NVIDIA website. If you currently work with MPI codes or if you have any questions about Hyper-Q, I’d love to hear from you.

Will be salivating to see Optix perf on this baby.. can you urge to post Optix on GK110 perf..

pmessmer

Thanks, rtfss. Keep those suggestions coming!

jipe4153

I’m particularly excited about the improved atomics performance, from what I gather the GTX680 is seeing 3-9 times improvement (speedup in already implemented GPU applications) while the GK110 might deliver even more!

hardbreaker

I am wondering, whether this boost comes only from previous under utilization in conjunction with the new LLVM compiler and the added cores ?
For my applications I would expect a linear increase because of increased core numbers and a little bit from LLVM. Because – in fact – the multiple streams still go through one hole: the PCI bus …

Austin Harris

There’s a lot of talk about using Hyper-Q with MPI, but how will the implementation change if you are using MPI + OpenMP. For example, what types of challenges would be associated with using a single GPU with 2 MPI tasks, each with 4 OpenMP threads?

Geetisudha39

Is it possible to get some example code for Hyper-Q to test on the GK110?

yurtesen

It would be nice if you could tell how you managed to compile CP2K for use with GPU?