In the continual progression of GPU technology, we've seen GPUs become increasingly useful at generalized tasks as they have added flexibility for game designers to implement more customized and more expansive graphical effects. What started out as a simple fixed-function rendering process, where texture and vertex data were fed into a GPU and pixels were pushed out, has evolved into a system where a great deal of processing takes place inside the GPU. The modern GPU can be used to store and manipulate data in ways that go far beyond just quickly figuring out what happens when multiple textures are mixed together.

What GPUs have evolved into today are devices that are increasingly similar to CPUs in their ability to do more things, while still specializing in only a subset of abilities. Starting with Shader Model 2.0 on cards like the Radeon 9700 and continuing with Shader Model 3.0 and today's latest cards, GPUs have become floating-point powerhouses that are able to do most floating-point calculations many times faster than a CPU, a necessity as 3D rendering is a very FP-intensive process. At the same time, we have seen GPUs add programming constructs like looping, branching, and other abilities previously only used on CPUs, but which are crucial to enable effective programmer use of the GPU resources . In short, today's GPUs have in many ways become extremely powerful floating-point processors that have been used for 3D rendering but little else.

Both ATI and NVIDIA have been looking to put the expanded capabilities of their GPUs to good use, with varying success. So far, the only types of programs that have effectively tapped this power other than applications and games requiring 3D rendering have also been video related, such as video decoders, encoders, and video effect processors. In short, the GPU has been underutilized, as there are many tasks that are floating-point hungry while not visual in nature, and these programs have not used the GPU to any large degree so far.

Meanwhile the academic world has been working on designing and utilizing custom-built floating-point hardware for years for their own research purposes. The class of hardware related to today's topic, stream processors, are extremely powerful floating-point processors able to process whole blocks of data at once, where CPUs carry out only a handful of numerical operations at a time. We've seen CPUs implement some stream processing with instruction sets like SSE and 3DNow!+, but these efforts still pale in comparison to what custom hardware has been able to do. This same progress was happening on GPUs, only in a different direction, and until recently GPUs remained untapped as anything other than a graphics tool.

Today's GPUs have evolved into their own class of stream processors, sharing much in common with the customized hardware of researchers, as a result of the 3D rendering process also being a streaming task. The key difference here however is that while GPU designers have cut a couple of corners where they don't need certain functionality for 3D rendering as compared to what a custom processor can do, by and large they have developed extremely fast stream processors that are just as fast as custom hardware but due to economies of scale are many, many times cheaper than a custom design.

It's here where ATI is looking for new ideas on what to run on their GPUs as part of their new stream computing initiative. The academic world is full of such ideas, chomping at the bit to run their experiments on more than a handful of customized hardware designs. One such application, and part of the star of today's announcement, is Folding@Home, a Stanford research project designed to simulate protein folding in order to unlock the secrets of diseases caused by flawed protein folding.

Post Your Comment

43 Comments

The folding team just hasn't designed their architecture efficiently for parallelism within a system.

No doubt they are brilliant computational biologists, but it's simply an oxymoron to claim a system can scale well using thousands of systems but not with the cores within those systems - Nonsense.

In fact I challenge anyone from their coding team to explain this contradiction.

Now if they say look, we're busy, we just haven't had time to optimize the architecture for multi-core yet, then that makes perfect sense. But to say inherently the problem doesn't lend itself to that is not right.

Not at all true! See above comments, but data dependency is a key. They know the starting point, but beyond that they don't know anything. So they might generate 100,000 (or more) starting points. There's 100K WUs out there. They can't even start the second sequence of any of those points until the first point is complete.

Think of it within a core: They can split up a task into several (or hundreds) of pieces only if each piece is fully independent. It's not like searching for primes where scanning from 2^100000 to 2^100001 is totally unrelated to what happened in 2^99999 to 2^100000. Here, what happens at stage x of Project 2126 (Run 51, Clone 9, Gen 7) absolutely determins where stage x+1 of Project: 2126 (Run 51, Clone 9, Gen 7) begins. A separate task of Project: 2126 (Run 51, Clone 9, Gen 6) or whatever can be running, but the results there have nothing to do with Project: 2126 (Run 51, Clone 9, Gen 7). Reply

Think of it this way - what is the algorithmic difference between submiting jobs to distributed PCs vs. distributed processes within a PC?

Multiple processes within a PC could operate indepedently and easily take advantage of the multi-core parallelism. A master UI process could manage the sub processes on the machine so the that user would not even require special setup by the user.

I'm telling you the problem with leveraging multi-core is not inherent to the folding problem, it's just a limitation of how they've designed their architecture.

Again not to take away credit from all the goodness they have achieved, but if you think about it this is really indisputable. I'm sure their developers would agree. Reply

Are we talking about *can* they get some advantage from multiple cores with different code, or are we talking about gaining a nearly 2X performance boost? I would agree that there is room for them to use more than one core, but I would guess the benefit will be more like a 50% speedup.

Right now, running two instances of FAH nearly doubles throughput, but no individual piece is completed faster. They could build in support for executing multiple cores without user intervention, but that's not a big deal since you can already do that on your own. Their UI could definitely be improved. The difficulty is that they aren't able to crank out individual pieces faster; they can get more pieces done, but if there's a time sensitive project they can't explore it faster. For example, what if they come on a particular folding sequence that seems promising, and they'd like to investigate it further with 100K slices covering several seconds (or whatever). If piece one determines piece 2, and 2 determimes 3... well, they're stuck with a total time to calculate 100K segments that would be in the range of thousands of years (assuming a day or two per piece).

Anyway, there are tasks which are extremely difficult to thread, though I wouldn't expect this to be one. Threading and threading really well aren't the same, though. Four years from now, if they get octal core CPUs, that increases the total number of cores people can process, but they wouldn't be able to look at any longer sequences than today if CPUs are still at the same clockspeed. (GPUs doing 40X faster means they could look at 40X more length/complexity.)

Anyway, without low level access to their code and an understanding of the algorithms they're using, the simple truth is that neither of us can say for sure what they can or can't get from multithreading. Then there's the whole manpower problem - is it more beneficial to work on multithread, or to work on something else? Obviously, so far they have done "something else". :) Reply

Looking at their website, they are working on a multithreaded core which would take advantage of smp systems. Regardless of how well that turns out, a 40x increase is not going to happen until we get > 40 cores in a cpu, so this GPU client is still a very big deal.

I understand what you mean about data dependence and not being able to move on to more involved simulations due to time factors of individual work units, but it seems like this would be fairly easy to solve by simply splitting the work units in half or in quarters, etc. This could definitely be difficult to do, though, depending on how their software has been designed. Perhaps they would have to completely rewrite their software and it isn't worth the trouble. Reply

I don't think they can split a WU in half, though, or whatever. Best they can do would be to split off a computation so that, i.e. atoms 1-5000 are solved at each stage on core 1 and 5001-10000 are on core 2. You still come back to the determination of the "trajectory". If you start at A and you want to know where you end up, the only way to know is to compute each point on the path. You can't just break that calculation into A-->C and then C-->B with C being halfway.

I know the Pande people are working on a lot of stuff right now, so GPUs, PS3, SMP, etc. are all being explored to varying extents. Reply

The reason that modern GPUs are so powerful is that they have many parallel processing pipelines, which is only a little different than saying that they have many processing cores. Even the diagram given in this article is titled: "Modern GPU: 16-48 Multi-threaded cores." If the F@H algorithm can be optimized to use the parallelism that exists within modern GPUs, it should also be optimizeable for the parallelism of multi-core CPUs. Reply

I think it's about data dependency. Let's say you start 2000 processes on different PCs and run them for 1 unit time. The result from this is 2000 processes at 1 unit time, not 1 process at 2000 units time, which is probably what you'd prefer. Having a massive speed up on a single node means that that node can push a single "calculation" farther along. I'd guess that the client itself is not multithreaded because of the threading overhead, it may not be worth the effort to optimize heavily for a dual-core speed up since the overhead will take a chunk out of that but a 40x speed up is another thing altogether. Reply

The way FAH currently works is that pieces of a similation are distributed; some will "fail" (i.e. fold improperly or hit a dead end) early, others will go for a long time. So they're trying to simulate the whole folding sequence under a large set of variables (temperatures, environment, acid/base, whatever), and some will end earlier than others. Eventually, they reach the stage where most of the sequences are in progress, and new work units are generated as old WUs are returned. That's where the problem comes.

If we were still scaling to higher clock speed, they could increase the size/complexity of simulations and still get WUs back in 1-5 days on average. If you add multiple cores at the same clock speed as earlier CPUs (i.e. X2 3800+ is the same as two Athlon 64 3200+ CPUs), you can do twice as many WUs at a time, but you're still waiting the same amount of time for results that may be important for future WU creation.

Basically, Pande Group/Stanford has simulations that they'd like to run that might take months on current high-end CPUs, and then they don't know how fast each person is really crunching away - that's why some WUs have a higher priority. Now they can do those on an X1900 and get the results in a couple days, which makes the work a lot more feasible to conduct.