Yikes, reading/writing ray states to global memory every bounce sounds scary, but GPUs never manage to surprise me. I suppose you're also in a good state now for doing complex materials as long as you're sorting for coherency like the megakernel paper does. Good work, keep us updated.

Yes I was a bit surprised to get it this fast. I mean, there's an impact for all the I/O, but pretty low; for any decent amount of triangles the gains of the improved occupancy are going to outweigh the costs. I noticed before that the wavefront paper is not optimistic enough about this; even a very basic shader benefits from wavefronts, not just complex shaders.

By the way, big factor was a 'SoA' data layout; e.g. ray origins are interleaved, and every read is a float4 read. This way, a single read becomes a 32*float4=512 byte consecutive read, which is very optimal. Since this works so well I made the target buffer SoA as well, so writing red components also means writing 32*int=128 bytes of consecutive memory. Saved another 5%. For light vertices, this didn't work, because these are read in random order after the first bounce.

Also important: wavefronts requires keeping track of counters (e.g. remaining extension / connection rays). I managed to keep all the counters on the device by using persistent kernels, so nothing ever gets copied to the CPU. Without this, the GTX980Ti suffers from very low GPU utilization (~45%) compared to mobile Quadro 4100, which suggests that this will be even worse on 1080. Now that kernels execute back-to-back, GPU utilization is near-optimal.

By the way, quick question for CUDA gurus: I use persistent kernels where the number of threads is simply the number of SMs times the blocksize (128 in my case), where each kernel 'fights for food' until work runs out (as in Aila & Laine, "Understanding the Efficiency..."). Strange thing is, this is more efficient if I start 4 or 8 times as many blocks as there are SMs. I can't figure out the reason for this behavior. Any ideas?

A SM can keep more than 1 block 'resident' at a time, as long as resources (register usage, shared memory, etc.) allow for it, and the warp scheduler can actually pull a warp from any of these 'resident' blocks at a given time. So it may be that your kernel permits more than 1 block to be resident on a SM at a time, so using more blocks per SM is allowing greater latency hiding, etc. as the warp scheduler has more blocks to pull from.

I was actually kind of curious about persistent kernels, so I added it to my pathtracer and got about a 10% slowdown. I also noticed the same behavior, where using about 4 blocks per SM was optimal.

I think with modern cards it may just be best to throw everything at it and let the hardware scheduler do its thing, I don't know how effective persistent threads is these days and it also makes it more difficult to run something across different cards with similar efficiency.

Did you see a performance boost with your implementation of persistent threads?

I didn't implement them for a performance boost, but to be able to keep counters on the GPU. If you produce N shadow rays in one kernel (shading), you either spawn N threads for the next kernel (tracing shadows), or you run SMCount*128*4 threads persistent shadow tracing code. Spawning N threads requires that the host knowns N; syncing this info was my primary bottleneck.

Just curious, when do you decide to present a frame? Are you trying to pull a full 8-bounce sample off per pixel before you present, or presenting at every bounce? I always struggle with this because frame display has a bit of overhead, but for interactivity you want to present quite often.

I present a frame after doing a full 8-bounce sample for every pixel (although most paths will be shorter due to Russian Roulette; the cap of max length 8 is for individual light paths, individual eye paths and combined paths, as in SmallVCM). Doing this in a reasonable amount of time is currently not a problem. I just switched to full scenes, for which I get ~150Mrays/s on the GTX980Ti. This yields real-time frame rates for a single BDPT sample per pixel (and even for multiple BDPT samples per pixel).

Presenting every bounce yields biased results for most frames; I'm not sure this would look good. Of course a cap on the path length (especially at a low value like 8) also introduces bias, but I suspect it is far less noticeable.

Due to Russian Roulette, the performance impact of longer paths is minimal by the way; only problem is that buffers get very large.

Since the (compacted) buffers are mostly empty for the deeper bounces, it may be possible to allocate for depth = 8 and bounce to depth = 64, with some kind of a safety cap in case the RNG decides that every path should reach 64 for a particular frame. Theoretically, this situation has a non-zero probability; practically this should never happen of course.

Also note that when using CUDA/OpenGL interop the pixel data never leaves the device. Overhead of presenting results is very small that way.

You mentioned that you use surface-local space to compute your microfacet BRDF. Here's an optimization for converting between the two spaces, this paper describes a faster way to calculate the basis from your surface normal.