GTC 2013: ARM + GPU = GPU'riffic, says Barcelona SCC

In this special guest feature, Dan Olds from Gabriel Consulting writes that the Barcelona Supercomputer Center is making a big bet on ARM processing for HPC.

Over the last few years, we’ve seen a steadily growing buzz surrounding the use of ARM chips in PCs, servers, and supercomputers. Here at GTC 2013, that buzz is even more pronounced due to NVIDIA’s upcoming Project Denver, and advances in their GPU technology that result in even less dependency on having a fast and powerful (read: Xeon) processor feeding the GPU number-crunching beasts. Our pal Rik Myslewski penned a great article on GTC 2013 ARM chatter here.

While most everyone has been debating and speculating about what it would be like to combine ARM processors and GPU accelerators, one organization has put together some hardware in order to separate the theoretical from the real. The Barcelona Supercomputer Center (from the Barcelona in Spain, not the other one) is building clusters to explore the potential advantages that might arise from combining low power ARM processors with fast number-crunching GPUs.

Their first attempt, the Tibadabo, was a proof of concept to determine whether it’s possible to build an all-ARM-based cluster. Could they really put together a cluster based on cell phone processors? And, if they could build it, could they find or adapt enough software for it to do useful work?

They were able to construct a two-rack cluster containing 32 blades, 256 nodes, and a total of 512 Tegra 2 ARM cores. They were able to port 11 scientific apps over to ARM with little difficulty, although they did need to fiddle around with the memory hierarchy to optimize some of the apps.

The performance wasn’t all that great. The total system turned out 512 GFLOPs while consuming 3.4 KW, yielding .015 GGLOPs/watt. For context, the best systems on the most recent Green500 list come in around 2.4 or 2.5 GFLOPs/watt, while the systems at the end of the list are rated at .033 GFLOPs/watt.

So they went back to the drawing board and, using NVIDIA’s CARMA development box, clustered 16 of them together as a learning experience they called Pedraforca v1. This system did much better than the ARM-only Tibadabo on energy efficiency, yielding .78 GFLOPs/watts on DGEMM and 5.04 in SGEMM (matrix multiply double and single precision), so they were making progress.

Limitations in the platform (PCIe max of 400 MB/s plus inability to overlap computation and data transfers) meant it couldn’t be scaled up very well. However, it did lead them to a new breakthrough in their thinking for their next system, which they’ve dubbed Pedraforca V2.

They’ve decided the key to building a highly efficient system isn’t to build an accelerated cluster but to build a cluster of accelerators. While there isn’t much difference in the words, there’s a world of difference between the meanings. With Pedraforca v2, they will be de-coupling the CPUs from the GPUs, meaning that the ratio of CPU-GPU can be changed to fit the workloads. They will also be using direct GPU-GPU data transfers via Mellanox’s ConnectX-3 Infiniband interconnects.

This will take a huge amount of latency out of the system and, accordingly, reduce the amount of work the CPU needs to do to orchestrate GPU communications. The prototype system will have 64 nodes which will utilize a quad-core Tegra 3 CPU that will slide into a 4x PCIe slot on a Mini-ITX carrier. In this configuration, the CPU will only be managing boot and MPI communications, plus minimal traffic cop duty for the GPUs. The point is that you don’t need a hugely fast and powerful processor to fulfill these requirements.

However, Pedraforca v2 will have some processing power in the form of Kepler-based NVIDIA K20 GPUs that can deliver 1,170 GFLOP/s through a PCIe Gen3 slot. The GPUs will be able to communicate with each other at 40 Gb/s via the aforementioned Mellanox-fueled Infiniband interconnect.

Both presenters pointed out that this isn’t a general purpose HPC system – it is intended as a host for apps that are GPU-optimized. While they didn’t discuss any FLOPs/watt estimates or performance predictions, it’s safe to say that this system should be an eye opener when it comes to energy efficiency and even cost per FLOP. It’s definitely a project worth watching.

Resource Links:

Latest Video

Industry Perspectives

Christian Kniep from Docker Inc. gave this talk at the Stanford HPC Conference. "This talk will recap the history of and what constitutes Linux Containers, before laying out how the technology is employed by various engines and what problems these engines have to solve. Afterward, Christian will elaborate on why the advent of standards for images and runtimes moved the discussion from building and distributing containers to orchestrating containerized applications at scale." [Read More...]

White Papers

For this report, DDN performed a number of experimental benchmarks to attain optimal IO rates for Paradigm Echos application workloads. It present results from IO intensive Echos micro-benchmarks to illustrate the DDN GRIDScaler performance benefits and provide some detail to aid optimal job packing in 40G Ethernet clusters. To find out the results download this guide.