I played around with CUDA a while back, but I wasn't all that hyped on it. The syntax was shockingly similar to programing with shaders, so I just dumped it figuring it was a tool for non graphics programmers to make use of the GPU. I recently decided to give GPGPU another try with OpenCL. I couldn't find any decent demos out there, so I was losing interest pretty fast. I decided to go ahead and implement my own demo to see what all the fuss was about.

The first problem was learning the API. Its not like other popular APIs where you just type a single word in Google and you are bathed in helpful websites. Being a relatively new and underrated tech, you actually have to do your homework. Luckily I found a site that gives a decent intro to the OpenCL API. It's not perfect, it's missing some important cleanup routines and doesn't cover performance issues. I had to read over the OpenCL spec to learn about those (note: not fun).

So now that I had a wrapper for OpenCL, I went ahead with my first demo. I was losing patience fast, so I built something I knew was computationally insane and also easy to implement. Ahem... the Universe

At 32768 stars, 193 GFLOPS, the demo runs at about 10 fps on my ATI 5750 HD. Simulating the environment with all 6 cores on my AMD Thuban, it takes 63 seconds to render a single frame.

Overall I'm pleased with the experience. Runtime kernel compilation makes developing and debugging kernels a pleasure. Once you get know the API, the rest pretty much follows through. From what I know, PhysX is a good example of what some games out there today take advantage of, but I look forward to seeing some neat uses such as real time ocean (water) dynamics and real time smoke effects (see Blender 3D smoke emitter). I would not be surprised if you could make a great game on any one of those two topics alone.

rouncer
—
2011-04-11T13:37:01Z —
#2

thats really nice nut im fully into this gpgpu stuff too, ive thought of how to raytrace displacement maps, do high poly boolean operations, even thought of how I could maybe get an accellerated interpreter going for possibly accelarated ai in games.

gpgpu looks real good to me, and opencl must make it better, im still stuck with shaders for the moment though, like you said its tougher to find educational sites for CUDA and openCL than with shaders, so it makes it a little harder to learn.

nice images. *thumbs up*

My weapon for gpgpu is an nVidia GTX 480, and it hell kicks my cpus bum. Makes sense to take advantage of it.

fireside
—
2011-04-11T16:14:38Z —
#3

The universe is a little beyond my comprehension, how about a platformer or something? Is this an html5 thing, or download?

mcneilm
—
2011-04-11T16:20:24Z —
#4

wow! this is cool, is it doing any physics in the background i.e. I suppose I am asking (as a long-time reader, but first-time commenter ! ) if you have further detail on what it's calculating?

TheNut
—
2011-04-11T17:04:30Z —
#5

Thanks rouncer. I've been thinking a lot lately about using the GPU for raytracing. I know others have done it (Luxrender), so it's an interesting area to explore. One thing I would like to do is plug OpenCL as an script language for my texture generator program to complement the LUA scripts, which often run to slow for my liking. It is totally made for that.

fireside, at present there's no downloads. I rushed to get it finished so I could go to sleep If I have some time I'll add a UI and some fun features to poke around with. It's also built for the desktop right now, although when WebCL gets released I'll be all over that.

mcneilm, at 193 GFLOPS you better believe there be physics going on The primary formula is Newton's law of universal gravitation. Each star is affected by the sum of gravitational forces by all other stars in the system. With 32768 stars, that's 32768 * 32767 = 1x10\\^9 comparisons.

vrnunes
—
2011-04-11T17:55:11Z —
#6

So this simulated universe is expanding? With Newton physics? Independently of accuracy, it already looks interesting. Please make a video for us to see that in motion. Have you seen kkapture? Very nice, it manipulates the clock so that your demo video records precisely at the configured framerate, even at insane HD resolutions/setups. =)

TheNut
—
2011-04-12T03:46:02Z —
#7

I uploaded a short video on You Tube. The compression unfortunately removed the colours in the video, but it gives you an idea what it looks like running.

Reedbeta
—
2011-04-12T04:26:27Z —
#8

If you let that run for a little longer, I wonder if you'll start seeing galaxy formation... you should, if you're simulating the Newtonian physics accurately!

At 32768 stars, 193 GFLOPS, the demo runs at about 10 fps on my ATI 5750 HD. Simulating the environment with all 6 cores on my AMD Thuban, it takes 63 seconds to render a single frame.

That's ridiculous. Your CPU can do 134 GFLOPS so you should get a single-digit frames-per-second, not seconds-per-frame.

Is the OpenCL emulator really this slow, or are you using another method to run it on the CPU?

roel
—
2011-04-12T08:02:07Z —
#10

Nice, Nut. Your simulation must be O(n\\^2), right? Then 32768 stars is impressive.

TheNut
—
2011-04-12T12:08:34Z —
#11

Nick, I'm not using OpenCL's CPU implementation. This is a vanilla threaded app with whatever optimizations my C++ compiler can offer. I did fix some memory access issues by caching as much as I could in the registers. I reduced the time from 63s/f to 7s/f (lesson learned, slap wrist). More than this I am not sure. There could be other factors at play here, such as the leftover memory read/writes, perhaps a better algorithm, and the probably of unoptimized compiled code. I remember using the Intel compiler once and seeing a vast improvement in performance. I'm still experimenting with the whole aspect of improving performance. My video card claims TFLOP capabilities, so I would like to see a number that closely reaches that.

Reed, possible given sufficient time

vrnunes
—
2011-04-12T16:05:29Z —
#12

Quite nice. Thanks for the video. Around 00:10, at bottom left, I can see a senoidal body coming, weird, is that body influenced by other bodies, or is that plain imprecision? Nice simulation, regardless.

tobeythorn
—
2011-04-12T17:44:01Z —
#13

TheNut, Any chance you could share your code or write a tutorial? I too am interested in openCL, but found information limited.

TheNut
—
2011-04-12T22:48:47Z —
#14

vrnunes, I would wager something in the simulation is causing it to do that. When I add stars with varying mass, it seems to stabilize the system. I don't think it's a precision issue.

tobeythorn, I can't release source and I don't have enough time to finish an article or tutorial on the subject anytime soon. I do plan to write a couple tutorials on varying subjects once my WebGL engine matures though. In the meantime, the link I posted above should get you jump started with the OpenCL API. It shows you what you need to do from start to finish. If you have questions along the way, I could try to answer them for you. Having a background in parallel programming is helpful.

Nick, I'm not using OpenCL's CPU implementation. This is a vanilla threaded app with whatever optimizations my C++ compiler can offer. I did fix some memory access issues by caching as much as I could in the registers. I reduced the time from 63s/f to 7s/f...

Impressive speedup! :happy:

More than this I am not sure. There could be other factors at play here, such as the leftover memory read/writes, perhaps a better algorithm, and the probably of unoptimized compiled code. I remember using the Intel compiler once and seeing a vast improvement in performance. I'm still experimenting with the whole aspect of improving performance.

You'd probably get another significant speedup from using SIMD.

But the biggest speedup of all would come from algorithmic optimization. Note that you can partition your stars. Every group of stars is observed (in a gravitational sense) by other stars as a single heavy mass. So you can compute the center of gravity and total mass of each partition of stars, and then for each star individually add up the force from each other partition and each of the other starts in the partition it belongs to. Using an octree probably makes sense.

Note that a GPU is much less suited for running 'intelligent' algorithms, so I wouldn't be surprised if the CPU can actually outrun the GPU...