On HPC benchmarking: measure and report the important things that users care about … wall clock time

Want to improve application performance by 10x or 100x? Few HPC customers would say no. Yet in some cases, the promises of tremendous performance improvements from accelerators, attached processors, field-programmable gate arrays, and the like evaporate when total application performance is evaluated.

This is a bit on the (negative) sensationalist side, though the author is correct in pointing out that these technologies have been overhyped.

I’ve been using a phrase to discuss this and other issues for a while.

With very little exception, you cannot simply drop in a new chip, a new memory DIMM, a new network card … and instantly everything goes faster.

This (for the most part) doesn’t happen, unless the system you are doing this to has a deep dependency upon an aspect you are changing (latency in networking, IO bandwidth in disks …)

In other words, to make efficient use of the resources, you need to adapt your codes to fit the underlying platform better.

This is hard as it turns out, for many codes.

I am a huge fan of accelerators. Really. In large part because I see them as a hardware embodiment of a portion of software, specifically a portion that could be more effectively accomplished in terms of specialized hardware than generic software. I am also a (severe) critic of compiler quality. Our compilers do not use our machine resources effectively, and they, at a deep level, determine our application performance.

Where accelerators make sense is when you have codes with sections well suited towards them. GPU codes use effectively hardware subroutines/method/function calls to perform their work. I/O codes can use hardware accelerated RAIDs as in JackRabbit. Computational codes can exploit the on-board micro-vectors (SSE), or the off-board GPGPUs. Eventually (shortly?) we may be able to exploit many many cores in a Larrabee.

The will accelerate repetitive processing, and help at offloading various processing from the main machine so it has more time to run Word or OpenOffice. Which will make them seem faster.

Finally, the issue the author took with benchmarking was that of kernels versus applications. This critique is spot on. When GPU-HMMer reported 100x performance delta, this was NOT 100x on the kernel. It was 100x wallclock. Timed with (the equivalent of) a stopwatch.

End users don’t care how fast a kernel is. They care how fast their application is. What is interesting about our application is that it appears to be (one of) the first to report whole application level performance delta versus kernel level delta.

No one (but computer scientists) care about the latter, everyone (but computer scientists) cares about the former.

We benchmarked a number of accelerators in the past and found claims of 50-100x kernel performance rarely resulted in more than 5-10x application level performance. Basically you have an Amdahl’s law type of thing going.

As the article went on, it used something I have been using for more than a decade … a simple equation to represent total execution time. So make one portion of this zero. Imagine your processing time goes to zero. You still have I/O and other elements, so your application will not become instantly a zero time enterprise.

But this can also give you a rubric for estimating a best case scenario.

Imagine 75% of your applications time is spent in one routine. If you could make that routine take (effectively) zero time, then you would get a 4x speedup. The question you need to ask is, a) what is the value of that 4x speedup, b) what is the cost (time/money/effort) of that speedup. We just went through this with a customer. Short version is that in some cases, for a marginal additional increase in cost (for a GPU card), 4-10x speedup at 1.5x the price is a win.

This is curiously why I have been pointing out for years that accelerator price matters tremendously, and why high priced accelerators are doomed to failure even before they hit the market. No one wants to spend 4x to get 4x real speedup. They gain nothing then on the cost side. So unless they are made of money, the Cost-Benefit analysis works against them.