July 12, 2011

Reviewing a paper that uses GPUs

Graphical processing units (GPUs) are all the ragethesedays. Most journal issues would be incomplete if at least one article didn’t mention the word “GPUs”. Like any good geek, I was initially interested with the idea of using GPUs for statistical computing. However, last summer I messed about with GPUs and the sparkle was removed. After looking at a number of papers, it strikes me that reviewers are forgetting to ask basic questions when reviewing GPU papers.

For speed comparisons, do the authors compare a GPU with a multi-core CPU. In many papers, the comparison is with a single-core CPU. If a programmer can use CUDA, they can certainly code in pthreads or openMP. Take off a factor of eight when comparing to a multi-core CPU.

Since a GPU has (usually) been bought specifically for the purpose of the article, the CPU can be a few years older. So, take off a factor of two for each year of difference between a CPU and GPU.

I like programming with doubles. I don’t really want to think about single precision and all the difficulties that entails. However, many CUDA programs are compiled as single precision. Take off a factor of two for double precision.

When you use a GPU, you split the job in blocks of threads. The number of threads in each block depends on the type of problem under consideration and can have a massive speed impact on your problem. If your problem is something like matrix multiplication, where each thread multiplies two elements, then after a few test runs, it’s straightforward to come up with an optimal thread/block ratio. However, if each thread is a stochastic simulation, it now becomes very problem dependent. What could work for one model, could well be disastrous for another.

So in many GPU articles the speed comparisons could be reduced by a factor of 32!

Just to clarify, I’m not saying that GPUs have no future, rather, there has been some mis-selling of their potential usefulness in the (statistical) literature.

Advertisements

Share this:

Like this:

Related

[…] so prevalent that csgillespie.wordpress.com recently published an excellent article that summarised everything you should consider when evaluating them. What you do is take the claimed speed-up, apply a set of common sense questions and thus […]

[…] to disagree with me entirely. What on earth is going on? Well, lets take the advice given by csgillespie.wordpress.com and turn it on its head. How do we get awesome speedup figures from the above benchmarks to help […]

Point #1 used to be a problem, not so much anymore. And papers like this should not get past review IMO. The more interesting question is whether they’re using the SIMD co-processor on the CPU or not. They’re usually not. But I’m willing to let slide because unlike on a GPU where access to the SIMD lanes is implicit in the language, it’s a PITA to access them on the CPU for most programmers. it’s really up to Intel and AMD to fix this one, not NVIDIA (though actually CUDA will soon be an option on x86: http://www.pgroup.com/resources/cuda-x86.htm) and that’s a work in progress right now.

Point #2 is bipolar in my experience. If the author of a paper has crap results but needs to publish or perish, they compare a spiffy new GPU to a 3.4 GHz Pentium 4, usually in single core mode. This is, of course, nonsense.

But…

When Intel and others set out to debunk GPU claims, they do the opposite. In one case (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.2755&rep=rep1&type=pdf), they took an 18-month old, 2 generations obsolete GPU, the GTX 280, and compared it to Intel’s latest and greatest and found it was *only* 2.5-5 times faster. 10x CPU (assuming all cores are firing and applying your 2x factor the other way as cruelly as possible) performance or better is frickin’ awesome in my book.

Point #3 is an excellent rule of thumb, but it’s not strictly necessary: http://dx.doi.org/10.1021/ct200239p. That said, don’t let the GPU vendors off the hook with this one. They really do need to provide access to double-precision on par with CPUs for cases where there’s no other option.

Point #4 is just the first level of the rabbit hole of SIMT programming. To get one of those magic 50-500x performance numbers, you have to uncover every level of parallelism in your problem and exploit it, usually applying warp-level programming in the process of doing so. This remains an art form to this day.

Overall, I’m finding very few people *get* GPU programming and its implications for all the major processor vendor roadmaps. But the gist is multi-core SIMD is the fastest route to performance for the foreseeable future. And that applies to everything from your cell-phone (i.e. webCL and renderscript) to your desktop. Every single one of them is widening their SIMD lanes and adding cores. Get used to it.

1. I agree, comparing GPUs to a single CPU should just be rejected. However, the conference I just attended one the talks did just that: GPU vs a single core CPU.

2. Again I agree. Here I have to come clean. When I messed about with GPUs I did the same dodgy old CPU vs new GPU comparsion. I just didn’t think about what I was doing. Two weeks later it dawned on me that I was been a bit naive 😉 I suspect many people do the same.

3. Agreed.

4. In my applications, I am interested in using GPUs for statistical models. However, with a new data set, we usually need a new model. So spending lots of time getting every drop of performance is not really worth it.

Summary: as you point out GPUs can be useful. I’m particularly excited with GPU support in native R graphics and using GPUs for matrix multiplication. However, trying to leverage GPUs into every possible application is obviously not sensible.

1) I would like to point out that people are not willing( do not have the time, resources, whatever, it does not matter) to spend time on everything (openmp,threads, MPI, parallel languages, etc) to see what is best for them. That is why you often see comparisons of GPU vs single threaded. After all if me application is 20x slower on a single core out of 2 or 4 cores, it is obviously at best 5x or 10x slower than using all 2 or 4 cores.

2) It is also often the case that a low end nvidia card is available, especially in laptops. Why not use every resource available?

1. I completely agree that authors don’t have time/resources/skill to try every possible implementation. However, many authors never even hint that a factor of 4 (or 8) speed-up can be achieved using a fairly basic parallel implementation on multi-core machine. A 20x speed-up using a GPU is no longer *that* impressive. If you want I can give you a copy of references where this occurs (off-line).
2. Of course you should use every resource, but I suspect that a low-end nvidia card would only give the same speed-ups as the multi-core processors on the laptop.

Don’t get me wrong – I think that GPUs are useful and can provide good speed-ups. But they can’t be used for every problem.