Posted
by
timothy
on Thursday July 15, 2010 @03:37PM
from the expense-report-manipulation-++ dept.

Esther Schindler writes "After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here's a high level view of the hardware change and what it might mean to your data center. (Hint: faster servers.) The article also addresses what it takes to write software for GPUs: 'Adopting GPU computing is not a drop-in task. You can't just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it's not something that can be accomplished with a few libraries and lines of code.'"

This is just like programing for a computer cluster... after a fashion.

Anyone used to do both should have no problem with this.

I'm anything but a high end programmer (I mostly only code for myself), and I have written plenty of code that runs with 7-10 threads. Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.

This isn't hundreds of threads that can run arbitrary code paths like a CPU, you have to totally redesign your code, or already have implemented parallel code so that you already run a number of threads that all do the same thing at the same time, just on different data.

The threads all run in lockstep, as in, all the threads better be at the same PC at the same time. If you run into a branch in the code, then you lose your parallelism, as the divergent threads are frozen until they come back together.

I'm not a big thread programmer, but I do work on threading tools. Most of the problems with threads seems to come with threads doing totally different code paths, and the unpredictable scheduling interactions that arise between them. GPU coding a lot more tightly controlled.

I'm really interested in using GPGPU for my physics calculations. But you know - I don't want to learn Nvidia's low-level, proprietary (whateveritis) in order to do an addition or multiplication, which may or may not outperform the CPU version. What would be _really_ great is stuff like porting the standard "low-level numerics" libraries to the GPU: BLAS, LAPACK, FFTs, special functions, and whatnot - the building blocks for most numerical programs. LAPACK+BLAS you already get in multicore versions, and there's no extra work on my part to use all cores on my PC. Please, computer geeks (i.e. more computer geek than myself), let me have the same on the GPU. When that happens, we can all buy Nvidia HotShit gaming cards and get research done. Until then, GPGPU is for the superdupergeeks.

Well, GPGPU actually in a way addresses the memory bandwidth. Mostly due to design limitations, each GPU comes with their own memory, and thus memory bus and bandwidth.Of course you can get that for CPUs as well (with new Intels or any non-ancient AMD) by going to multiple sockets, however that is more effort and costlier (6 PCIe slots - unusual but obtainable - and you can have 12 GPUs, each with their own bus, try getting a 12-socket motherboard...).

most post secondaries are now teaching students how to properly thread for parallel programming.

No they aren't. Even grad courses are no substitute for doing it. Never mind that parallel processing is a different animal than SIMD-like models that most GPUs use.

I haven't had to deal with any of it myself, but I imagine it'll boil down to knowing what calculations in your program can be done simultaneously, and then setting up a way to dump it off onto the next available core.

No, it's not like that. you set up a warp of threads running the same code on different data and structure it for minimal branching. That's the thumbnail sketch - nvidia has some good tutorials on the subject and you can use your current GPU.

I've done a little CUDA programming, and I've yet to find significant speedups doing it. Every single time, some limitation in the arch keeps it from running well. My last little project, ran about 30x faster on the GPU than the CPU, the only problem was that the overhead of getting it to the GPU + computation + overhead of getting it back, was roughly equal to the time it took to just dedicate a CPU.

I was really excited about AES on the GPU too, until it turned out to be about 5% faster than my CPU.

Now if the GPU was designed more as a proper coprocessor (ala early x87, or early Weitek) and integrated into the memory hierarchy better (put the funky texture ram and such off to the side) some of my problems might go away.