All layered techniques capture only 8 significant layers of transparency for a fair performance comparison on an Nvidia 1060 Mobile with 6GB of VRAM.

I will share details, perks and downsides of all techniques.The source code is on the github page for IrrlichtBAW (example 16)

Methods 2,3,4 -- OpenGL 3.3

The Stencil Routed FamilyHere we basically use the stencil buffer to route up to K pixels in the order they come to K different outputs, this necessitates the use of Multisample Textures and/or Layered Rendering to get a different stencil value per sample. Multiple Render Targets do not solve the issue, as it adds extra channels not samples to the same pixel (i.e. 8 attachment MRT still has one depth and stencil buffer).

Essentially you initialize a given pixel's samples stencil buffer values to [S+0,S+1,...,S+K] and for drawing you always use either decrease/decrease_wrap for all stencil operations (stencil fail, depth fail, and pass) coupled with the Equal stencil test function with a reference value of S.

This means that for every sample the stencil counter gets decreased while only one sample gets written.

Now, some cards have different MSAA limits, either 8,16 or 32.So when you want to store more samples, you need to create a MSAA TextureArray and use layered rendering with a geometry shader gl_Layer like in the cubemap shadow example.However this means the GPU has to rasterize additional copies of each triangle, one per each layer you want to output.

This is why a 8 sample MSAA texture is preferrable to a 8 layer 1 sample MSAA texture array in your setup. The definite reason why the above is also preferable is MSAA buffer and DepthStencil buffer hardware compression.

Main downside of this technique is that we need to render to MSAA textures which are slower than renderbuffers, and just doing a clear pass without rendering any triangles onto the screen and a 'discard' on the first line of the resolve shader will drop the performance down to 1200 FPS.However the resolve shader for all 3 methods is extremely fast, taking 81us to sort up to 8 samples in a typical scenario.The Definite Downside of All The K-Buffer Methods is Unsuitability For Particle RenderingThis is because usually you have lots and lots of overdraw for a good particle system.

This technique captures the first K fragments that come as it uses the decrement saturate stencil op, it also uses a start value S=2. This is useful as you can draw an occlusion query with a fullscreen quad + multisample mask only for the last sample + stencil test equal to 0; to get a count of pixels which overflowed.Combined with conditional (predicated) rendering you could automagically decide whether to do more passes.It has an obvious downside that it captures the first K fragments that come, and if your scene has a high depth complexity you still need to sort it! It can also be quite unstable (pixels popping).So you kinda need to capture all layers for the technique to be useable (which eats GPU memory).

Downsides:-- Kills EarlyZStencil on most GPUs as the fail stencil ops are not GL_KEEP-- Capturing more fragments requires more memory (more MSAA textures), the paper mentions an insertion sort, but its infeasible anyway as a way to keep down memory use-- Captured fragments may not be important (high alpha)-- Even with 32 sample 4+ layer MSAA texture arrays, we are limited to 254 layers as stencil buffer is always 8 bit

Limitations:-- Gets the first K pixels drawnA screenshot of this being a problem-- If a pixel fails the depth test it doesn't get written to the K-Buffer but decrements our stencil counters so it locks other pixels from using that slot (depth-failed pixels count against the K-limit)

Optimizations1) Check the depth coordinate!=far to see whether you have to load the next sample for the resolve. (DONE)2) Different sorting networks for different counts of actually captured elements (technique 2 and 3 resolve shader is shared and has this)3) Initialize the ZBuffer of the MSAA texture with a full-screen quad outputting the depth value of the opaque scene pixel to all samples + enable Z Testing on transparents (would have to play around with the stencil func to not have depth-fail pixels count against the K fragment limit)4) Can enable Early-fragment tests to force early killing of any pixel after first K pixels (Still slower than technique nr 2)5) Actually use the ZBuffer to store depth instead of packing it together with the color into a 64bit 2 channel texture (Bavoil did this, but my example uses depth_stencil texture access).

Technique 3 - Stencil Routed KBuffer with special DevSH sauce

Transparent Material:No BlendingEnable ZWriteEnable ZTest

Here I capture all fragments but only keep K closest pixel from the sets of every Kth pixel, its not true first K pixels like in depth peeling but if our rasterized pixels are F0,F1,F2, ... ,FN then samples S0, ... , S(K-1) hold the closest pixels from the disjoint sets, like this S(i) = closest({F(i),F(K+i),F(2K+i),..}Your geometry would have to be really beyond repair (every 8th triangle would have to be in the same depth range neighbourhood) to mess up the results, and still the worse you'd get is the fragment closest to the camera followed by the (N/K)th fragment, the (2N/K)th, etc.Most of the time you could expect more or less the K closest.This is stable and does not benefit from sorting the scene (hence doesn't suggest it) as opposed to the Bavoil method.

I initialize the stencil buffer samples to (0,1, ... ,K-1) which means the start value is S, then set the stencil functions to DECR_WRAP and the stencil reference mask to K-1, which requires K to be a power of two.

Downsides:-- Kills EarlyZStencil on most GPUs as the fail stencil ops are not GL_KEEP, and there is depth testing involved with stencil masks

Limitations:-- Every transparent pixel that does not fail the Z-test gets rasterized and shadedNo overflow detection

Optimizations1) Check the depth coordinate!=far to see whether you have to load the next sample for the resolve. (DONE)2) Different sorting networks for different counts of actually captured elements (DONE)3) Initialize the ZBuffer of the MSAA texture with a full-screen quad outputting the depth value of the opaque scene pixel to all samples when initialising (could also do the stencil at once with AMD_shader_stencil_export)4) Can enable Early-fragment tests to try win performance from the actually used Z-Buffer (DONE - no help)5) Use the left-over stencil to prevent resolving empty pixels with no transparency layers (using 'discard' because I'm blending, the transparency buffer, in)

Technique 3 - 750 FPS down from 2000

Technique 4 - Stencil Routed K-Buffer Minimal Transparency (FAIL)

Transparent Material:No BlendingEnable ZWriteEnable ZTest

Everything here works exactly the same as technique 3, stencils etc.

The difference is that I swap the output Alpha Channel with the most significant bits of gl_FragDepth, so that each subsample keeps track of the most opaque (I use a reverseZ buffer so I use the Greater comparison) instead of the closest fragment.This would have the theoretical advantage of giving higher retention priority to more opaque pixels, but in case of having identical opacity pixels (especially important in the case of alpha=0.0) the closest one would win out.It was done in hope of getting the stencil K-routed techniques to work with thick particle systems.

However this method exhibits similar problems as method 2.

Postmortem:Small deviations in alpha cause overwrite, so in a row of 22 with 20 fragments of alpha=0.01, a fragment with depth=far but alpha=0.8 will overwrite a pixel with alpha=0.79 but with depth ~= near.In effect 7 captured pixels with alpha=0.01 would only compost together to give a single pixel of alpha=0.067 but the near pixel with alpha=0.79 would get overwritten by the far pixel and it would poke-through.In general opaque pixels or close-to-opaque have a tendency to poke out.This does not improve with only using fewer bits (bins) of alpha in gl_FragDepth.

Secondly GPUs round denormalized floats to 0.0, at least for the depth, since the alpha is mostly stuffed in the exponent, this causes all samples with alpha in the last bin (less than 1/255, 1/127, 1/31, or 1/7) to round to gl_FragDepth = (alpha<<(30-alphaBitsUsedForBinning))|(gl_FragDepthOriginalBits>>alphaBitsUsedForBinning) = 0.0Which makes all near-transparent samples disappear (be as if unwritten), hence holes in the dwarf's shoes.

Lastly 3 closest layers of alpha=0.5 are more visible (contribute more to the final colour) than any layer further on, even with full opacity (alpha=1), hence the depth dependency of blending.

Optimizations/Solutions1) Stuff less alpha bits in gl_FragDepth (Tried 7,5,3)2) Interleave/inject some significant bits of the alpha value into gl_FragDepth

Last edited by devsh on Mon Feb 12, 2018 6:56 pm, edited 14 times in total.

I had a long think about how to leverage the existing ZBuffer from the opaque pass in order not to waste our K slots for the transparent fragments which are behind opaque geometry.

In essence there are only 2 ways to do it:1) Blit the non-MSAA opaque depth to the MSAA K-Buffer Depth attachment instead of clearing it, then in the resolve shader you need to read all the K depth samples, sort all K of them, and then you get a list of the color sample ids that were written to (due to the depth test some transparent writes are cancelled, and written samples are no longer consecutive).2) Modify every transparent material shader to read from the opaque non-MSAA depth texture and perform its own depth test, then you can keep the standard resolve on github now.

DownsidesSolution 1Write all K samples to the ZBuffer every frame independent of how much transparent pixels are on the screen.Definitely loose EarlyZ on the transparent pass due to the above.Can't skip reading depth samples on resolveCan't skip the whole resolve shader for a pixel if the first depth is equal to the far value (meaning no samples written)Solution 2Have to modify every transparent shaderFor every drawn pixel need to fetch the non-MSAA depth buffer

So the only thing that we need to keep in mind is the bandwidth wastage.Solution 1 guarantees at least 2K extra depth buffer operations per pixel.Solution 2 guarantees 1 extra depth read operation per pixel drawn, so it may give problems with high depth complexity (high overdraw), but for that it would have to be much much higher, on average, than 2K for every single pixel on the screen.