What should replace Linpack for ranking supercomputers?

ISC 2013 The Linpack Fortran benchmark, which has been used to gauge the relative performance of workstations and supercomputers for many decades, is looking a little long in the tooth. So, some of the people who love Linpack and know it best are proposing a new benchmark - with the mind-numbing name of High Performance Conjugate Gradient, or HPCG.

All system benchmark tests run their courses, and the most successful ones usually overstay their welcome and stop doing what they are designed to do. That is, allow for the comparative ranking of systems in terms of performance across different architectures and software stacks in a meaningful way, and - if we are lucky - provide some insight into bang vs buck for different systems so that companies can weigh the relative economic merits of systems that can run a particular application.

Tests run on supercomputers mostly do the former, and rarely do the latter. That is a big problem that Michael Heroux, of Sandia National Laboratories, and Jack Dongarra, of both the University of Tennessee and Oak Ridge National Laboratory, tackled in their HPCG benchmark proposal paper, which you can read here (PDF).

Earlier this week, your friendly neighborhood El Reg server hack defended the Linpack test as the Top500 supercomputer rankings for June 2013 were announced at the International Super Computing conference in Leipzig, Germany. Linpack is important because it is fun to see the rankings twice a year, and it is an easy enough test to run that people actually do it with some enthusiasm on their machines.

Erich Strohmaier, one of the administrators of the Top500 who hails from Lawrence Berkeley National Laboratory, tells El Reg that there were around 800 submissions of Linpack reports for the June Top500 list, and a bunch of them are tossed out because there is something funky in them. Making it to the Top500 list is a big deal for a lot of supercomputer labs, and every national and district politician around the world loves to see a new machine come online in their sphere of influence for the photo-op to prove they are doing their jobs to protect our future. And moving down in the rankings is also a big deal, because if you are not moving ahead you are falling behind. Jysoo Lee, director of the KISTI supercomputing center in Korea found this out when Korea's relative rankings slipped in the June list. You get phone calls from people in power who are not pleased.

So Linpack matters and it doesn't matter at the same time. It is a bit like miles per gallon fuel efficiency ratings for cars, or if you live in New York, letter grade ratings for restaurants that the city mandates.

But when it comes to using Linpack as a relative performance metric, there are some issues. If you really want to hurt your head, delve into the HPCG paper put out by Heroux and Dongarra. The basic gist, as this server hack understands it, is that the kind of codes that were initially deployed on parallel clusters fifteen years ago bore more of a relationship to High Performance Linpack, or HPL, than they do in all cases today. (HPL is the parallel implementation of Linpack; earlier versions could only run on a single machine like a PC at the low end or a federated SMP server with shared memory or a vector processor at the high-end.) This is one of the reasons why the University of Illinois has refused to run Linpack on the "Blue Waters" ceepie-geepie built by Cray. (Remember, Linpack is voluntary. It is not a ranking of the Top500 supercomputers as much as it is a ranking of the Top500 supercomputers from organizations that want to brag.)

Here's the central problem as Heroux and Dongarra see it:

"At the same time HPL rankings of computer systems are no longer so strongly correlated to real application performance, especially for the broad set of HPC applications governed by differential equations, which tend to have much stronger needs for high bandwidth and low latency, and tend to access data using irregular patterns. In fact, we have reached a point where designing a system for good HPL performance can actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system."

There is a lot of deep technical description in the paper, but here is the really simplified version. Linpack is a set of calculations using matrix multiplication that scales up the size of the arrays to try to choke the machine and force it to reach its peak capacity across all of the computing elements in the cluster. The original Linpack from the dawn of time solved a dense matrix of 100 x 100 linear equations, then it moved to 1000 x 1000 as machines got more powerful, and then the Linpackers took the hood off and let the linear equation count scale and tweaked the HPL code to run across clusters and take advantage of modern networks and interconnects as well as coprocessors like Intel Xeon Phi and Nvidia Tesla cards.

The kind of calculations and data access patterns in the Linpack test are what Heroux and Dongarra refer to as Type 1. The machines are loaded up with lots of floating point processing and the data can be organized in such a way as to be particularly efficient at getting the right data to the right FP unit at the right time most of the time. With Type 2 patterns - which are more reflective of the kinds of differential equations increasingly used in simulations - the data access patterns are less regular and the calculations are finer-grained and recursive. Those shiny new Xeon Phi and Tesla accelerators are really good at Type 1 calculations and not so hot (yet) on Type 2 calculations.

The upshot, say Heroux and Dongarra, is that designing a system to reach one exaflops on Linpack might result in a machine that is absolutely terrible at running real world applications, thus defeating one of the primary purposes of a benchmark test. Benchmarks are a feedback loop into system designers, and are absolutely necessary. Here is an example of the divergence cited by Heroux and Dongarra on the "Titan" Opteron-Tesla ceepie-geepie down in Oak Ridge National Laboratories, with which they are intimately familiar.

Read this carefully:

"The Titan system at Oak Ridge National Laboratory has 18,688 nodes, each with a 16-core, 32 GB AMD Opteron processor and a 6GB Nvidia K20 GPU. Titan was the top-ranked system in November 2012 using HPL. However, in obtaining the HPL result on Titan, the Opteron processors played only a supporting role in the result. All floating-point computation and all data were resident on the GPUs. In contrast, real applications, when initially ported to Titan, will typically run solely on the CPUs and selectively off-load computations to the GPU for acceleration."

There is always an immediate predecessor, so don't try to tell me there isn't

This strikes me as a software problem, not a hardware problem: and not only that, one that doesn't have any particular bearing on the relevance of Linpack. And even Heroux and Dongarra basically admit as much in the next paragraph of the HPCG proposal:

"Of course, one of the important research efforts in HPC today is to design applications such that more computations are Type 1 patterns, and we will see progress in the coming years. At the same time, most applications will always have some Type 2 patterns and our benchmarks must reflect this reality. In fact, a system's ability to effectively address Type 2 patterns is an important indicator of system balance."

What this server hack really wants to know is how the initial Oak Ridge application set is performing on Titan, how much work needs to be done to optimize the code and actually use those GPU accelerators that the US taxpayers (like me) have shelled out for.

I want something even more devious than that, too. I want to see the Top500 list completely reorganized, using its historical data in a more useful fashion. I want the list to show a machine and its immediate predecessor. There is always an immediate predecessor, so don't try to tell me there isn't, and it is running a key real-world application, too. So, for these two machines, I want three numbers: Peak theoretical flops, sustained Linpack flops, and relative performance of the key workload (you won't be able to get actual performance data for this, of course.)

I want the Top500 table to show the relative performance gains for all three across the two machines. And predecessor machines don't necessarily have to be in the same physical location, either. The "Kraken" super at the University of Tennessee runs code for NOAA, which also has its own machines.

So, for instance, it might show that Titan has a peak theoretical performance of 27.12 petaflops across its CPUs and GPUs, and running the Linpack test and loading up both computing elements it was able to deliver 17.59 petaflops of oomph in an 8.21 megawatt power envelope. Importantly, the Titan machine has 560,640 Opteron cores running at 2.2GHz. The "Jaguar" super that predates Titan at Oak Ridge (technically, Jaguar was upgraded to Titan in two steps) was an all-CPU machine with 298,592 cores running at 2.6GHz that had a peak theoretical performance of 2.63 petaflops and a sustained Linpack performance of 1.94 petaflops, all in a 5.14 megawatt power envelope.

So the delta on peak performance moving from the last iteration of Jaguar (it had a processor and interconnect upgrade a little more than a year ago before adding the GPUs to make it a Titan late last year) was a factor of 10.31, but on sustained performance for the Linpack test, the delta between the two machines was only a factor of 9.06. And, by the way, that was only because a lot of the calculations were done on the GPUs, and even then, Titan still only had a computational efficiency across those CPUs and GPUs of 64.9 per cent. It may be that it just cannot be pushed much higher than that. The best machines on the list might have an 85 to 90 per cent computational efficiency.

But here's the fun bit. If you just look at the CPU portions of the Jaguar and Titan machines, then the aggregate floating point delta moving from Jaguar to Titan was a mere 58.9 per cent. In other words, if a key Oak Ridge application could, in theory, be deployed across nearly twice as many cores and a faster interconnect and make use of them all perfectly efficiently and did not offload any calculations to the GPU, then you get a speedup of about a factor of 1.6. If you offload some calculations to the Tesla GPUs - but only some, as Heroux and Dongarra said was the case - then maybe, and El Reg is guessing wildly here - maybe you could get that delta up to a factor of 2X, 3X, or even 4X.

The point is to not guess, but for Oak Ridge to say as part of a Top500 submission. Why are we all guessing? Moreover, why not have Top500 ranked by the familiar Linpack, add a HPCG column, then one key real-world relative performance metric, and do the before and after machines? More data is better.

There are some other things that a revised supercomputer benchmark test needs as well, since we are mulling it over. Every machine has a list price, and every benchmark run should have one. Every machine has a measured wall power when it is running, and every benchmark run should have one. The Top500 list is fun, but it needs to be more useful, and the reality is that supercomputing is more constrained by the power envelope and the budget than any other factors, and this data has to be brought into the equation. Even estimated street prices are better than nothing, and I am just annoyed enough to gin them up myself and get into a whole lot of trouble.

Another thing to consider is that the needs of the very high end of the supercomputer space are far different from the needs of the rest of the HPC community.

At the high end, research labs and HPC centers are writing their own code, but smaller HPC installations are running other people's code. The addition of a real-world workload is useful for such customers in particular if it is an off-the-shelf application. What would be truly useful is a set of clusters of varying sizes and technologies running the ten top HPC applications - perhaps one for each industry - with all the same before and after configurations and deltas as machines are upgraded in the field. I think what you will find is that a lot of commercial applications can't scale as far as you might hope across ever-increasing cluster sizes, and software developers in the HPC racket would be extremely adverse to such odious comparisons. It would be nice to be surprised and see third-party apps scale well, so go ahead, I dare you - surprise me.

The HPC industry needs a complete set of benchmark test results that show a wide variety of components being mixed and matched in different patterns and showing the effects on application performance. We don't need 50 different Linpack test results on clusters with slightly different configurations, but results on as many diverse systems as we can find and test. This is how Dongarra originally compiled his Linpack benchmarks, and it was the right way to do it. ®