GPUs continue to get more powerful at an amazing rate. The performance boost is not evenly distributed though. While ALU power is shooting through the roof some other parts of the chip are lagging behind. Most notably lagging are the ROPs. A Radeon HD 4890 has 16 ROPs, which is the same as even an old X800XT. This means that the only performance boost is from higher clocks. So while the HD 4890 has roughly 15x raw ALU power (if my math is right) it only has 70% higher fillrate than the X800XT. This of course makes sense since we're no longer in a "more pixels" race, but it's all about "better pixels" these days. The monitor resolutions aren't going up very fast, in fact, we had a drop with the transistion from CRT to LCD and I suppose we're now roughly back to where we were before.

The relatively low ROP performance can be a problem though when dealing with things like particle systems which can fill large portions of the screen with a lot of overdraw and often only uses a trivial shader. To deal with this people have come up with ideas such as rendering to a smaller buffer and then merging the result with the main buffer, but there are also simpler solutions like simply using fewer and denser particles.

Particles are often rendered as simple quads. This is particularly wasteful of our limited ROPs. It may not seem all that bad, but a relatively normal particle texture can contain a surprisingly large amount of empty space. Take for instance this smoke puff I found in the darkest corners of my texture collection. It may look like less, but half of it is actually empty space, which means that we're rendering twice as many pixel as we have to.

So the first approach at dealing with this situation is to shrink the quad to the bounding box of the particle. It may looks like a small gain, but that's actually an over 30% save. However, most particles aren't shaped like aligned rectangles, so we can optimize this further. So this afternoon I wrote a tool that takes a particle image as input and generates an optimized polygon enclosing the particle. It can be downloaded from the Cool stuff section. How much difference does it make use an optimize polygon over the naive approach? Quite a lot actually. In fact, the triangle case above is nearly as efficient as the naive rectangle, whereas the optimized quad shaves off another 10 percentage points. There's of course a diminishing return from using more vertices, but depending on the shape of the particle, it could be worth going higher. This is the result from this particular smoke puff:

So how is the tool doing its magic? First the source image is thresholded so that you have a bitmap of zeros and ones. Then it loops through the pixels again and adds all points that are ones into a convex hull that's continuously updated as new points are added. In order to not throw tens of thousands of points at it I'm doing a basic check with neighboring pixels so I'm only adding potential corner pixels. If the number of neighbors that are set is 5 or more it's either in the middle of a straight edge or at a corner that's pointing inwards and thus is not added. This reduces it to hundreds of points to consider. After this pass we have a convex hull enclosing the particle. For this smoke puff it had 23 edges in the end, so it appears to not get overly complex in normal cases. Finally it goes over all permutations of the edges given a provided target vertex count, checks for each permutation that a valid fully enclosing polygon can be constructed with those edges, and for all valid ones finds the one that has the smallest area. As always, the source code is included.

I haven't tested this tool extensively yet, so it won't surprise me if there are bugs or if it's not 100% robust, but it seems to work very nicely and provides the resulting polygon as good as instantaneously with the textures I have tested. I also haven't proven that the result is actually the optimal solution, but if it's not, it is probably close to optimum in real world scenarios.

If anyone finds this tool useful and plans to use it, or have any feedback, just let me know. Name

Comment

Enter the code below

KeldorMonday, June 1, 2009

What about the cost of pushing the extra vertices down the PCIe bus? That's gained in performance even slower than the ROPs... Unless you're generating them in a geometry shader, of course.HumusTuesday, June 2, 2009

That cost is relatively small, unless you're shooting for tens of thousands of particles. But then again, you don't have to increase PCIe traffic at all, even if you don't use the geometry shader. In my implementation I store the offset coordinates in static constant buffers. The vertex and index buffers are larger though, but that's static GPU side data too. The per particle data I need to upload is unchanged.More pages:1 2