9.18.2012

IBM’s Power 775 wins recent HPC Challenge

Starting out as a
government project 10 years ago, IBM Research’s high performance computing
project, PERCS (pronounced “perks”), led to one of the world’s most
powerful supercomputers, the Power 775. This July, the Power 775 continued to
prove its power by earning the top spot in a series of benchmark components of the HPC Challenge suite.

IBM Research
scientist Ram Rajamony, who was the chief performance architect for the Power
775, talks about how the system beat this HPC Challenge.

How did PERCS become
the Power 775?

Ram Rajamony: In 2002, DARPA (U.S.
Defense Advanced Research Projects Agency) put out a call for the creation of
commercially viable high performance computing systems that would also be highly productive.

Our response was named PERCS – Productive
Easy-to-use Reliable Computing System. From the start, our goal was to combine
ease-of-use and significantly higher efficiencies, compared to the
state-of-the-art at the time (Japan’s
Earth Simulator was the top-ranked supercomputer that year with a peak
speed of 41 TFLOPS).

After four years of research, the third phase of the DARPA
project – that started in 2006 – resulted in today’s IBM’s Power 775.

What makes Power 775
unique because of PERCS?

RR: It’s all in the
software and hardware magic we put into the system!

PERCS chip design

PERCS
blazed the trail for a whole set of new technologies in the industry. We
produced the first 8-core, 4-way-Simultaneous Multi-Threaded processor
– the POWER7 chip.

The compute workhorse is the 3.84
GHz POWER7 processor. We house four of these in a ceramic substrate to
create a compute monster that has a peak performance of 982 GFLOPS; a peak
memory bandwidth of 512 GB/s; and a peak bi-directional interconnect bandwidth
of 192 GB/s. These advances resulted directly from the PERCS program.

Then, we coupled each set of four POWER7 chips with an
interconnect Hub
chip codenamed Torrent,
that in turn connects to other Hub chips through 47 copper and optical links,
and moves data over these links in excess of 8 Tbps. (No typo here. That is
indeed eight tera-bits per second!)

Cool features abound, but one in particular is how the Hub
chip can translate program variable addresses in incoming packets into physical
memory addresses. When used in conjunction with a special arithmetic logic unit
in the POWER7 memory controllers, we get amazingly fast atomic operations.

But it’s not just about the hardware. Through PERCS we
added numerous innovations in areas such as the operating system, compilers,
systems management tools, programmer aids, and debuggers. We even have a new
language called X10 that developers can use.

What is the HPC
Challenge, compared to the Top500, Graph500, and others?

Fast Fourier Transform

The FFT is an algorithm
to compute the Fourier transform of a signal; transforming it from one
domain, such as the time domain, to another, such as the frequency domain.
FFTs are the backbone of signal processing and are used in a wide variety of
areas, such as music, medicine and astronomy.

RR: The HPC Challenge suite was constructed to stress
different parts of a system such as compute, memory bandwidth, and communication
capability. It also contains components such as the FFT, which is difficult to make work at high efficiencies on computing systems
– but which is often indicative of how entire classes of workloads will
perform.

The HPC Challenge gives you a nice fingerprint of your
system’s performance across numerous dimensions that show how a system may
perform on a real-world workload.

For comparison, the Top500 rankings order systems based on
their FLOP rate when computing the Linpack Benchmark. These rankings are biased
towards indicating only a system’s compute capability. The newer Graph500
benchmark measures how fast you can traverse a graph and compute metrics
similar to the Bacon
number over a social network.

RR: Giga-Updates per Second (GUPS) and MegaFlops (MFLOPS)
are as different as apples and oranges. (Actually, I should rephrase that
because recent research has shown how apples
and oranges are indeed very much alike, calling into question the validity
of that analogy.)

MFLOPS measure the compute characteristic of a system – the number
of floating point operations (in millions of FLOPs) that can be executed every
second. Systems have a peak FLOP rating as well as a FLOP rating when executing
various different workloads, such as the Top500’s Linpack.

GUPS measure the rate at which the system can perform random
updates to a large set of values that are distributed across the memory in the
system. The idea is to find out how well a system can handle a workload that
requires extremely fine-grained communication with no locality. The lack of
locality in this context refers to the fact that contiguous operations in time
are directed at values stored in very different places. The GUPS workload has traditionally
been brutal on systems, but is representative of workloads that just don't have
the locality characteristics that machines are optimized to handle well.

What did your team do
that put the Power 775 in the #1 position on the HPC Challenge?

RR: Well, we
began many years ago with the goal of disrupting the status quo of
interconnect-intensive workloads. Many of our performance metrics show linear scaling as the system size
increases. In other words, for workloads like GUPS, PTRANS (which is a measure
of the interconnect bisection bandwidth), and FFT (which is a workload that
stresses all three compute, memory and interconnect elements) the system performance
increases linearly with the addition of hardware.

This is unheard of in a typical system. In that sense, the Power
775 has been extremely disruptive as evidenced by the large margins by which we
have taken over the number one position in the HPC Challenge results.

What does being #1 on
this list mean for the Power 775’s capabilities?

RR: People have
always grappled with how to structure large-scale computing systems. If you
look at HPC systems in existence today, there is a spectrum of solutions with
different compute and interconnect characteristics. Each of these solutions
works well for the particular problem that it is used to solve.

The advantage of Power 775 is that it is a general
purpose system. It has a completely homogeneous compute component which leads
to a simple mental model of how the system works. The communication prowess of
the system is forgiving of how programmers write their programs, making it easy
to get high performance programs on the Power 775.

And while the system is
suited for general purpose high-performance computing, it shines
especially well on workloads that need more interconnect performance and capabilities.

Quoting from the definitive source material: Individual "links provide over 9 Tbits/second of raw bandwidth..."[1]. The aggregate bandwidth is obviously a lot higher given the high-dimension mesh interconnect in PERCS/POWER 775 systems.

[1] "The PERCS High-Performance Interconnect", Aramili et al., in the Proceedings of the 2010 IEEE Symposium on High Performance Interconnects. See http://www.unixer.de/publications/img/ibm-percs-network.pdf