Hey, guys. I know you're getting tired of this but I just can't stop it!

I managed to get 3.7 million particles running at 60 FPS!

This latest version is simply an improvement to fix a problem with the old transform feedback particle engine: It didn't work with SLI (multiple GPUs). The driver does not explicitly synchronize the buffer memory after using transform feedback between GPUs, so trying to render particles from the last frame simply did nothing. I solved this by letting each GPU have its own particle buffer and update it twice (for 2 GPUs that is) but only render it once. That way my two GPUs can work with only their own feedback buffers and no driver synchronization. It's not very effective of course since the updating is has to be done twice per GPU now, but at least the rendering of the particles is only done once per GPU which pretty much doubles fill-rate. Even with just my smoothed pixel-sized point particles I got a performance increase from 3.0 million particles to 3.7 million particles, an almost 25% increase in performance. Note that this was with an Nvidia GTX 295 which came out in January 2009; high-end at the time but not very spectacular today.

The main limitation at the moment is actually memory usage. My transform feedback code is extremely unoptimized when it comes to memory usage (I could reduce it by 25-30% with relative ease). The real however problem is that the driver isn't smart enough to figure out that the buffers are only used by one GPU, so they are both allocated on both GPUs. For 2 GPUs, I need 4 full particle buffers to be able to ping pong between two of them on each GPU. 3 700 000 * 36 bytes * 2 * 2 = 508MBs of data... Of course, you don't need that many particles in a real game, so memory usage will be a much smaller problem there. The high efficiency of this technique combined with the fact that I got it working at all on SLI/Crossfire systems still makes it worth using even if you "only" have 100k particles or so.

As I wrote above, fill-rate is basically double what it was before, while the cost of updating particles is the same. With 3 million particles with a point size of 1 (4 pixels covered per particle due to smoothing) particles I "only" got an almost 23.3% increase in performance (60 --> 74), but with 100 000 particles with a point size of 43 (1 849 pixels per particle) the performance increase was 93.3% (60 --> 116). Fillrate scales linearly with the number of GPUs, while particle performance does not scale at all. My program can handle any number of GPUs (= up to 4, a limitation of SLI/Crossfire), but memory usage may become a problem on quad-SLI systems. =S

It should be possible to further optimize this by simply doing the update once on each GPU but with a twice as high delta, but this may cause inconsistencies between the GPUs due to floating point errors that build up over the life time of a particle. Might be worth investigating though since particles generally live a very short life time. I'd estimate performance of such an implementation to at least 5 million particles on my graphics card since it'd scale perfectly with any number of GPUs.

It should be possible to further optimize this by simply doing the update once on each GPU but with a twice as high delta, but this may cause inconsistencies between the GPUs due to floating point errors that build up over the life time of a particle. Might be worth investigating though since particles generally live a very short life time. I'd estimate performance of such an implementation to at least 5 million particles on my graphics card since it'd scale perfectly with any number of GPUs.

However, I "solved" the floating point problems by simply doing the updating twice for the particles that needed it in the shader with a for-loop instead of doing the whole transform feedback thingy twice. This turned out to be free (it's probably memory bottlenecked). I just hacked it all together, so it explodes for >2 GPUs and I honestly don't know exactly how it's working ^^', but I've compared it frame by frame with my original (non SLI) version and it's identical. Performance speaks for itself: 5 790 000 particles at 62 FPS. If I increase the number of particles any more than that I run out of VRAM (I only have 896 MBs, minus the 41MBs Windows uses) and performance drops to 1-5 FPS due to swapping. Seems like optimizing memory usage would improve performance too since it seems to be bottlenecked by that.

My particle system can only get 100K particles at 23fps... although that's only one thread on the cpu.... and only uses a few megabytes and has point gravity.....

EDIT : My counter is better than yours! cough

I get around 2100-2200FPS (<0.5ms) with 100k particles...

javaw.exe uses 25 MBs of RAM. VRAM usage is around 31MBs, including 6 MBs for the 1080p framebuffer (theoretically the particle buffers uses around 14MBs). CPU usage close to 0% since the only thing the CPU does is generate new particles and issue a few OpenGL commands per frame.

I discovered a stupid bug in my barely working new SLI version. The ordering of the particles is different between GPUs. This of course doesn't affect performance, but does cause some flickering. I didn't see it since the particles are so small they rarely overlap, and when they do it's so chaotic it's impossible to spot. When I increased the particle size I could easily see it. Well, it doesn't affect performance, so meh.

I wonder how many particles my GTX 580 with 1.5GB of VRAM could handle........

Not yet, I'll throw something together... Will be interesting to see how well it performs on other architectures. My old GTX 295 seems to work pretty well with transform feedback considering that it's very fast compared to my older shader/OpenCL implementations, but my laptop's GTX 460M takes a pretty big performance hit from it, probably because it has a lot less memory bandwidth. It seems like it's very tied to which architecture the GPU has, so it will be very interesting to see how it performs on later Nvidia cards and Radeon cards. I plan to release the single GPU version soon.

Note that transform feedback has nothing to do with OpenCL. It's just an extension to OpenGL that's available to OGL3 cards (core in OGL4).

would be nice to compare later the sources and do real benchmarks. What kind of features does your particle system/simulation do?

Right now? Almost no features at all. Not even texturing. The point is that after updating the particles with transform feedback you have a perfectly formatted VBO with data that you can do whatever you want with. Want to draw an asteroid 3D model for each particle? Just use instancing. 2D sprites? Use a geometry shader.

hmmm...this would take just about everything off of the cpu. The only question I ask is how much of the gpu do you take away? It is great for just some simulations but when it comes to actually using it in a game you still have all those triangles you be needing to render.

hmmm...this would take just about everything off of the cpu. The only question I ask is how much of the gpu do you take away? It is great for just some simulations but when it comes to actually using it in a game you still have all those triangles you be needing to render.

Performance of the AFR version on my GPU was 358 980 000 particles per second (5 790 000 * 62) so around 350 million particles per second, including some cheap rendering. 1 million particles runs at around 350 FPS, so it the math seems to work out correctly. Since particles are usually fragment limited, I think the gains from completely eliminating the updating and reuploading from the CPU and doing it basically for free on the GPU is a good thing. 100k particles should in theory run at 3500 FPS and therefore takes around .28ms to update and render. When I get home (I'm at uni) I can benchmark it with rasterizing disabled to check the raw updating performance of it.

so university is a bit boring atm, so i threw together the first version:~4mio at 30fps~16mio at 5fps

running only on my laptop with a NVidia 550, but without driver on mesa linuxhave to test it one my desktop^^

How are you updating and drawing your particles?

EDIT:Benchmark without any rendering (only updating with transform feedback):5 000 000 particles at 110 FPS = 550 000 000 particles per second. One million particles take about 1.8ms to update. That's definitely less than it would take to just upload that amount of data each frame.

In some ways it might, but keep in mind that for me, OpenCL performed identically to OpenGL using textures to store the data and a shader to update them. Without transform feedback, you also have a huge problem of keeping track of which particles are alive. It has good peak performance when all particles are alive, but is almost as fast when you have no particles at all, since you'll have to render a point for every allocated particle each frame to find the alive ones. Transform feedback solves this since it compacts the alive ones to the beginning of the VBO and also allows you to draw only the number of alive particles, but might be a little slower on some hardware. I also don't think that multi-GPU rendering will work with OpenCL.

OpenCL was a bit disappointing. It's really only faster if you manage to utilize the shared memory of the clusters to do calculations (which you can't for particles) and even then, it might not be faster. Just getting it up to OpenGL in performance was hard since you need to care about how you read memory so its aligned and stuff.

Java isn't the one that's "boss" in all this, it's GLSL, or on the case of that demo, HLSL (basically the same). Java is just the glue layer here, which should perform as well as nearly anything else. It's more like DX vs GL rather than anything vs Java.

That demo does use a compute shader, which only has a direct equivalent in OpenGL 4.3, but is otherwise morally equivalent to OpenCL (or perhaps a subset of it). The author does link to the source (I'll link it here too) so it would be interesting to see how much of it is directly portable.

My desktop doesn't have a OGL4 graphics card, so I can't test it on my computer. There's a big chance that compute shaders are faster, but they're still not as flexible as transform feedback. Sure, you might be able to cram out a few more particles, but it's extremely ineffective when you only have a few. He's probably getting better performance since his particle contain less information, most likely just 24 bytes vs my 36 bytes per particle. Besides, I just need to use a geometry shader to expand the points into quads, which is exactly what I did for that sprite engine. =S

Java isn't the one that's "boss" in all this, it's GLSL, or on the case of that demo, HLSL (basically the same). Java is just the glue layer here, which should perform as well as nearly anything else. It's more like DX vs GL rather than anything vs Java.

That demo does use a compute shader, which only has a direct equivalent in OpenGL 4.3, but is otherwise morally equivalent to OpenCL (or perhaps a subset of it). The author does link to the source (I'll link it here too) so it would be interesting to see how much of it is directly portable.

I wouldn't say that OpenCL = compute shaders, compute shaders are much easier to use correctly (handling memory). I should really make one using OGL 4.3...

It is viable but hard to implement into a game from what I understand.

The big problem with things such as particle system/physics is that they can be very computational intensive. With particle systems, just having 100k particles means that if everything is done on cpu the cpu has to calculate the position 100k times, calculate anything else the particle has 100k times, and still needs to send the updated particles to gpu.

With openCL you can have the gpu do all the calculations but you still need to send things to the gpu which is where my particle system dies at. theagentd suggestion is very nice as you get a prebuilt VBO that you could simply throw at the gpu meaning the CPU does next to nothing. Also, it is usable in opengl 3.0 which is very nice. I am now wondering if I would make a sprite batcher using this since geometry shaders are already 3.0

It's only possible to output 4 byte floats and ints with transform feedback, so instead of compressing stuff I just converted everything to floats, hence the inflated particle byte size.

WTF IS UP WITH OPENCL AGAIN?! I am getting really tired of how sensitive OpenCL seems to be. On my laptop's GTX 460M, the exact same code performs the same as the GPU version (2.2 million particles). I think it's because the 400 series had extensive hardware changes compared to the 200 series, but I'm really not thrilled to start delving into that stuff again...

It is viable but hard to implement into a game from what I understand.

The big problem with things such as particle system/physics is that they can be very computational intensive. With particle systems, just having 100k particles means that if everything is done on cpu the cpu has to calculate the position 100k times, calculate anything else the particle has 100k times, and still needs to send the updated particles to gpu.

With openCL you can have the gpu do all the calculations but you still need to send things to the gpu which is where my particle system dies at. theagentd suggestion is very nice as you get a prebuilt VBO that you could simply throw at the gpu meaning the CPU does next to nothing. Also, it is usable in opengl 3.0 which is very nice. I am now wondering if I would make a sprite batcher using this since geometry shaders are already 3.0

100k times isn't really that much since you have 4 processors doing 3 billion clock cycles per second. The problem is actually the insane amount of memory bandwidth needed. Just getting two RAM sticks and running them in dual channel gave me a 60% speed boost on a dual-core laptop compared to single channel. Most particles only need some basic math.

I wouldn't recommend a sprite batcher on the GPU. You'd need to run your whole game on the GPU to know HOW to move your sprites around.

I wouldn't recommend a sprite batcher on the GPU. You'd need to run your whole game on the GPU to know HOW to move your sprites around.

Wouldn't that be awesome? Haha

Ah, I couldn't come up with a better name for maxLife. It's just how much life (= how many frame) the particle is supposed to last, while life is the amount of life left (= how many more frames it should last). I use it to calculate the alpha, alpha = life / maxLife; I do this on the CPU for the CPU version, hence I had a 4 byte RGBA color. When I had life available on the GPU I only needed 3 bytes but I padded it to 4 anyway to gain some performance.

java-gaming.org is not responsible for the content posted by its members, including references to external websites,
and other references that may or may not have a relation with our primarily
gaming and game production oriented community.
inquiries and complaints can be sent via email to the info‑account of the
company managing the website of java‑gaming.org