Did it ever die? There are specific use cases for which each type of renderer is more suited than the other. What's most interesting about this is the use of a CS for work that would be more traditionally done CPU-side, but I wonder how well that would balance out in a real in-game scene. Thanks for the heads-up anyway, I'll be checking this out in more detail later on.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

The short answer is, it doesn't, not from a development point of view. Tile based light culling was originally developed for deferred shading, which is only getting better. And with the next generation of consoles no doubt having plenty of RAM and memory bandwidth there's no reason (at the moment) to waste rendering the geometry twice when you can just shove all you need through in G-Buffers and render it once.

In fact deferred lighting is probably going the same way for the same reason, both forward and deferred lighting were only ever used to minimize memory and bandwidth use, a precious commodity on the 360 and PS3. But with even mobile devices shoving past them now what's the point? Might as well use up that available RAM and double your polycount, throw a ton of shadow mapped lights, or something else of the kind.

As for AA, there was a recent paper (very recent) on minimal cost MSAA while doing deferred shading, similar to the cost of forward rendering. And of course you can also use temporal and/or morphological AA as well. I'm sure you could forward render transparency stuff using the same tile based scheme while you're going deferred, but deferred shading definitely seems to me to be the way to go.

The short answer is, it doesn't, not from a development point of view. Tile based light culling was originally developed for deferred shading, which is only getting better. And with the next generation of consoles no doubt having plenty of RAM and memory bandwidth there's no reason (at the moment) to waste rendering the geometry twice when you can just shove all you need through in G-Buffers and render it once.

Let's forget for a moment the GBuffer is typically bigger. NV40 could render about 60 pointlights per pass, I have difficulty understanding the need to render twice.

The only major advantage of deferred tech is the modularity of light processing compared to material rendering but this comes at a considerable cost: no one I've talked to understood the need to write shaders putting stuff in different buffers... and don't even get me started on packing.

So, in my opinion, the advantages are still unproven. I suppose UE3 and Samaritan makes this clear. Flexible Forward can emulate Deferred at a reduced cost... not vice versa.

So I've downloaded the full demo and run it a few times. I deliberately chose a fairly low-specced machine to see how viable the technique is on the kind of hardware that would be considered commodity today, and the short answer is - it's not.

Reminds me of the time I first got a 3DFX and - naturally - immediately grabbed GLQuake to test it out. Of course I neglected to pop in the 3DFX opengl32.dll file so I ended up drawing through the Windows software implementation at 1fps.

Obviously AMD feel that they've got something special with their new kit, and they want to show off it's capabilities to the best by taking a sub-optimal technique and making it realtime. More power to them, I wish them well with it. Maybe in 2 years time when this level of hardware is a commonplace average this might be an approach to think of using, but for now there seems to be better things to burn your GPU budget on.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

"Specifically, this demo uses DirectCompute [...] per-pixel or per-tile list of lights that forward-render based shaders use for lighting each pixel."

Now call me a cynic, but is this not pretty much exactly what Damian Trebilco did 5 years ago on 3 generation older hardware using nothing but the CPU and some stencil buffer trick?

Admittedly, Damian's demo with that horse model inside a room was not quite as artistic. The ATI demo sure is kind of funny, with a nice story, well done animations, and it looks quite good, but honestly I couldn't tell it really looks a class better than a thousand other demos (in fact, all of the characters are quite "plastic like" though of course the ATI guys will claim that this is intentional). Opposed to that, unlike a thousand other demos, it requires the lastest, fastest hardware to run...

I've done some simple tests (sponza with 128 unshadowed point lights) and it's definitely feasible. On a GTX 570 at 1920x1080 I get 1.1ms for filling the G-Buffer and 3.5ms for the lighting with a tiled deferred approach, while with an indexed deferred approach I get 0.75ms for the computing the lights per-tile and 4.5ms for forward rendering (both using a z-only prepass). So at 4.6ms vs. 5.25ms it's not too far off in terms of performance. But of course you really need to know how well it scales with:

You'd also want to compare how well a deferred renderer scales with lots of G-Buffer terms, especially with dense geometry. Unfortunately I don't have the time at the moment to thoroughly evaluate all of those things, but it does at least seem like a viable alternative to traditional deferred rendering. But I'm not sure if it would ever beat a good tiled deferred implementation outright. It would definitely make certain things a lot easier, since you wouldn't have to worry about packing things into a G-Buffer or special-case handling of different material types in the lighting shader.

@mhagain: there's a lot more going on in that demo than just indexed deferred rendering. For instance they use PTex, and a VPL-based GI solution.

@samoth: it's a variant of light indexed deferred, and they even say as much in their presentation. The technique just becomes a lot more practical when you can generate the light list per-tile in a compute shader, rather than having to do all of the nasty hacks required by the original implementation.

@mhagain: there's a lot more going on in that demo than just indexed deferred rendering. For instance they use PTex, and a VPL-based GI solution.

The bounced lights are the most interesting thing in it to me; my obsession with lighting is not shadows but brightness, and that one tickles my fancy.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

It would seem to me that there is promise in a CS powered light indexed deferred rendered. Even on AMD hardware the performance isn't bad, and apparently on nV it's better. Especially if you want MSAA and weren't planning on using thousands of lights.

Thanks for posting those phantom! I've noticed that the queries I use for profiling can get messed up a bit when VSYNC is enabled. For the numbers I gathered, I just used the total frame time with VSYNC disabled.

I also realized that it was pretty dumb and lazy of me to leave the number of lights hard-coded, so I uploaded a new version that lets you switch the number of lights at runtime.

Even on AMD hardware the performance isn't bad, and apparently on nV it's better. Especially if you want MSAA and weren't planning on using thousands of lights.

Well, the performance delta is better on the 680 vs tiled deferred but I suspect that is down to a reduction in memory bandwidth (680 runs slightly higher clocked but has 2/3 the bus size of the 7970); over all the 7970 seems to like it better render time wise (which is intresting as most game benchmarks have the 680 winning across the board).

Anyway, I'm going to try out the new version and report back in a bit regarding the various light counts.

It would seem that without AA the Tiled Deferred has the edge, but as soon as you throw AA into the mix things swing towards the Index Deferred method (1024, 2xAA being the notable exception to that rule).

TD shows the normal deferred charactistic of stable G-buffer pass times, but the tile lighting phase begins to get very expensive for it.
By constrast ID has a pretty constant lighting phase but the forward render phase shows the same kind of increase as TD's lighting phase.

I am noticing no change at all when enabling or disabling Z prepass when in light indexed mode. I would have though it would have a large impact on forward shading time.

The Z prepass is always enabled for light indexed mode, because a depth buffer is necessary to figure out the list of lights that intersect each tile. If you didn't do this you could build a list of lights just using the near/far planes of the camera, but I would suspect that the larger light lists + lack of good early z cull would cause performance to go right down the drain.

Thanks for the updated numbers phantom! It definitely seems as though light indexed removes a lot of bandwidth dependence, which is pretty cool. Ethatron posted some numbers on my blog where he showed that performance scaled pretty directly with shader core clock speed. I would suspect that overall light indexed would scale down pretty well to lower-class hardware with lower bandwidth ratings.

...If you didn't do this you could build a list of lights just using the near/far planes of the camera, but I would suspect that the larger light lists + lack of good early z cull would cause performance to go right down the drain.

I did look at that in my paper 'Tiled Shading' that someone posted a link to above. And the short answer is that no indeed, it does not end too well.

On the other hand, I imagine that it would be a useful technique simply to manage lights in an environment with not too many lights in any given location and limited views (e.g. RTS camera or so), as the limited depth span makes the depth range optimization less effective.

I've got an open gl demo too, which builds grids entirely on the CPU (so it's not very high performance, just there to demo the techniques).

Btw, one thing that could may affect your results that I noticed is that you make use of atomics to reduce the min/max depth. Shared memory atomics on NVIDIA hardware serialize on conflicts, so to use them to perform a reduction this way is less efficient than just using a single thread in the CTA to do the work (at least then you dont have to run the conflict detection steps involved). So this step gets a lot faster with a SIMD parallel reduction, which is fairly straight forward, dont have time to dig out a good link sorry, I'll just post a cuda variant I've got handy, for 32 threads (a warp), but scales up with apropriate barrier syncs, sdata is a pointer to a 32 element shared memory buffer (is that local memory in compute shader lingo? Anyway, the on-chip variety.).

Same goes for the list building, where a prefix sum could be used. Here it'd depend on the rate of collisions. Anyway, thinking this might be a difference between NVIDIA and AMD (Where I don't have a clue how atomics are implemented).

As a side note, it's much more efficient to work out the screen space bounds of each light before running the per tile checks, saves constructing identical planes for tens of tiles, etc.

Anyway, fun to see some activity on this topic! And I'm surprised at the good results for tiled forward.