Open Source

Bouncing Back

Last time around, I showed that we could improve upon Imageworks’ multiple-scattering approximation by precalculating a new term, $\Ems$, from the Heitz model itself. This is, of course, a well-trodden path in rendering: if you can’t use something directly then try to tabulate or fit it instead!

To make this more practical, I also outlined a multiple-scattering variant of the split integral trick already in common use in real-time rendering, which allows the specular colour, $\Fr$, to be factored out.

I’ll now go into the details of this precomputation process, and also show how, with further crunching, we can reduce the storage and run-time cost even more.

Walking the Walk

A practical approach for precomputing the directional albedo of a single-scattering BRDF is via Monte Carlo integration with BRDF importance sampling.1 We can calculate $\Ems$ from the Heitz model in a similar way via its random-walk sampling process.

Conceptually, each random walk starts with a ray coming from the camera, which bounces one or more times on the microsurface before finally escaping it:

Figure 1: An illustration of the random-walk sampling process. (Courtesy of Eric Heitz.)

At each bounce, some energy is lost due to absorption and the rest is reflected.2 In the case of our copper conductor test subject, the final energy (or energy throughput) is simply the product of Fresnel reflection at each bounce.

If we perform this process many times and average the final energy of every multi-bounce walk – i.e., treating single-scattering walks as having zero energy – then we obtain $\Ems$, as shown in the last post:

Figure 2: $\Ems$ for copper GGX material.

Doing the Splits

As I previously discussed, we don’t want to precompute $\Ems$ directly, since it bakes in $\Fr$. Instead, we’ll precalculate factors, $\wa \ldots \wnm$, which can be used to reconstruct $\Ems$ at run time for any $\Fr$:

To achieve this, we’re going to make two very simple modifications the random walk process:

The walk will now keep track of a vector of energy throughputs, $\ea \ldots \enm$, rather than a single value. At the beginning of the walk, $\ea$, is initialised to 1 and $\eb \ldots \enm$ to 0.

At each scattering event, the energy throughput was previously multiplied by a Fresnel factor, $F(\omega_m, \omega_r)$, where $\omega_h$ is a sampled normal and $\omega_r$ is the outgoing ray direction. If we use the Schlick Fresnel approximation for $F$, this is

$\quad$ We will now use $\ei$ to track the fraction of energy scaled by $\Fr^i$.

From Equation $\ref{eq:fres}$, we can see that that after the first bounce, a factor $\sia$ of the energy is multipled by $\Fr$ and the rest, $\ssa$, is left untinted, so we set $\ea = \ssa$ and $\eb = \sia$. On the next bounce, with a new Fresnel factor $\ssb$, we will have $\ea = \ssa \ssb$, $\eb = \sia \ssb + \ssa \sib$ and $\ec = \sia \sib$. More energy has moved from order 0 to order 1, and some from order 1 to order 2.

Working things though, the update to our energy throughput vector at each bounce can be written as

Or in code:

123

for(inti=N-1;i>0;i--)e[i]=mix(e[i-1],e[i],s);e[0]*=s;

The final $\wa \ldots \wnm$ factors that we’re after are simply the averages of $\ea \ldots \enm$ over multiple runs of the random walk process (again, ignoring single scattering):

Figure 3: Multiple-scattering Fresnel factors for GGX.

You can see the whole process in action in this WebGL demo, which reproduces Figure 2 as well as the decomposition shown in Figure 3.

Note: there’s quite a bit going on in this demo, but I didn’t want to get into the weeds in this post, so I’ll write up some separate notes at a later date.

Losing Weights

Although it may be hard to tell from Figure 3, there is some residual energy in the bottom right corner3 of $\wc$, and this is true for a few higher orders as well. This implies that we’d need two or three textures to store all of the factors for fully accurate results, as well as a fair amount of maths in the shader. It would be nice to slim this down!

Fortunately, since most of the energy is in the lower orders, the source polynomials for $\Ems$ can be accurately refitted to lower-order cubic curves. For instance, here’s a plot of $\Ems$ for the bottom right corner, together with the refit:

This means that, in practice, we only need a single four-channel texture and three MAD instructions (when Equation $\ref{eq:ms_sum}$ is written in Horner form) to reconstruct $\Ems$ in the shader:

I did this refitting in Mathematica using MiniMaxApproximation to minimise relative error, which proved important for accurately reproducing the lower end of the curves.

You can see the end result for yourself in this second WebGL demo, which compares Imageworks along with the improvements discussed so far (left half), against the Heitz reference (right half).

Note: I realise that not everyone has ready access to Mathematica, so I’m planning to rewrite the refitting step in either Python or C++ before publishing the complete end-to-end process on GitHub.

Bonus (Reading) Exercise

Up until now, I’ve been probing Imageworks’ energy-preservation solution, which approximates microfacet multiple scattering with a diffuse-like lobe. However, there are other potentially valid options here. For instance, concurrent to Imageworks’ investigations, Emmanuel Turquin developed a different solution while at Industrial Light & Magic, which instead approximates the multiple scattering via a rescaled single scattering lobe. This approach has since been adopted within Unity’s High Definition Render Pipeline4 as well Google’s Filament.

With permission, I have the privilege of hosting a technical report from Manu that goes into all of the details of his approach and various shortcuts, along with some comparisons to Imageworks and Heitz. That said, in conclusion he also states:

“Comparisons in this report have been largely of a qualitative nature. A more thorough and quantitative analysis of the differences between the three main methods is left for future work.”

In the next post, I hope to go some way towards providing this more detailed analysis. In the meantime, you should read Manu’s TR!

For an example of this, see the Image-Based Lighting section of [Karis 2013].↩

Since the focus is still on conductors, I’m ignoring refraction and transmission for now.↩

Open Source

A New Dimension

In the last post, I’d shown how a small tweak to Imageworks’ multiple-scattering Fresnel term, $\color{brown}{F_\mathrm{ms}}$, brought their approximation closer to the reference solution of Heitz et al. However, there was still a niggling difference at high roughness, most noticeable under uniform lighting:

At the end of the day, $\color{brown}{F_\mathrm{ms}}$ is based on a simple diffuse model, and its limitations become more apparent as multiple scattering increases with roughness.

A more accurate alternative would be to precalculate a multiple-scattering directional albedo LUT that incorporates Fresnel, directly from the Heitz model. This new term, which I’ll call $\color{teal}{E_\mathrm{Fms}}$, leads to the following minor change to the multiple-scattering lobe:

The good news is that this version produces results that match Heitz in a furnace environment:

The bad news is that we need a 3D LUT for $\Ems$, since it depends not only on the view angle ($\mu_o$) and roughness, but also $\Fr$ (or specular reflectance). Worse, for coloured metals – such as our lovely copper example – we’d need to do three separate 3D lookups for R, G and B. The only silver lining is that this cost can potentially be amortised across lights, but it’s still less than ideal for real-time applications.

Note: a secondary concern is that this change also breaks the recipriocity of the multiple-scattering model, since the results will be different if $\mu_o$ and $\mu_i$ are swapped. While this isn’t a practical issue for real-time rendering, it could be a problem in other contexts (e.g. bidirectional path tracing).

Schlick Alternative

If we’re willing to restrict ourselves to Schlick’s Fresnel approximation1, then we can adapt a popular approach that’s been used with environmental lighting in games for a number of years [Drobot 2013; Karis 2013; Lazarov 2013]. In this context, the roughness and directional-dependent effects of microfacet shadowing and Fresnel reflection are factored out and preintegrated ahead of time for a given BRDF. At run time, this term is combined with separately prefiltered environmental maps to produce a cheap but effective approximation of the real integral of the BRDF and the lighting.

This idea goes back further (see the Ambient BRDF of [Gotanda 2010]), but the newer variants more compact as they exploit the linearity of Schlick Fresnel, $\Fs$, to factor out $\Fr$, which reduces the dimensionality of the preintegrated table (or fit, in the case of [Lazarov 2013]). I’ll quickly recap how this works, as it naturally extends to a solution for $\Ems$.

What these approaches are effectively calculating is a version of the single-scattering directional albedo, $\E$, that incorporates $\Fs$. Let’s call this $\Ess$:

where $\fssp$ is the single-scattering BRDF without Fresnel.

The key observation is that Schlick Fresnel’s additive form

allows $\Ess$ to be split into two parts:

one that will be tinted by $\Fr$ of the material at run time, and another that’s left untinted. This means that we can precompute a 2D LUT containing these two terms, rather than needing a 3D LUT.

A little more formally, we can view this as decomposing $\E$ into two factors, $\wa$ and $\wb$:

which are then multiplied by two orders of $\Fr$ and summed to form $\Ess$:

Apologies if I have laboured the point, but hopefully you can see where this is going: we can do a similar decomposition with the multiple-scattering albedo, $1 - \E$.

This time I’ll present things visually, since I think we’ve seen enough integrals for now. First, here’s $1 - \E$ for GGX, which you may recognise from Imageworks’ slides:

Figure 3: $1 - \E$, for GGX.

and here is the decomposition into factors for the various orders of $\Fr$, over the ($\mu$, roughness) domain:

Figure 4: Multiple-scattering Fresnel factors for GGX.

Naturally we have more factors this time, since with multiple scattering there could be $1 \ldots N$ additional reflections before light leaves the microsurface. Given these factors, we can calculate $\Ems$ thusly:

As with the approach for environmental lighting, these factors can be precomputed and stored in 2D LUTs. At run time, we fetch the appropriate factors (based on view angle and roughness) and calculate $\Ems$ using Eq. $\ref{eq:ms_sum}$.

While this hopefully all makes sense, it’s a little abstract, so let’s visualise $\Ems$ for our copper material:

Figure 5: $\Ems$ for copper GGX material.

As we can see, it’s just like the multiple-scattering albedo shown in Figure 3, only now it’s been tinted by the different orders of Fresnel reflection. Note how the saturation increases from top left (low roughness, grazing angle) to bottom right (high roughness, incident view angle), as we would expect2.

Finally, here are the spheres again with this LUT-based solution, this time under direct lighting:

In this example, our revised multiple-scattering approximation (Eq.$~\ref{eq:fms}$) is barely indistinguishable from the Heitz model. The only slight difference, at roughness = 1, comes from the multiple-scattering lobe not being a perfect match to the ground truth, as we already saw in the last post:

A Random Walk to the Ground Truth

In the last post, we saw that significant energy was being lost due to the common single-scattering limitation of microfacet-based shading models:

An intuitive way to think about this is that these BRDFs are only modelling direct lighting (= single scattering) of the microsurface heightfield. Indirect lighting (= multiple scattering) is not simulated, and that is the cause of the missing energy in the image above.

So, why do we have this limitation? Well, in much the same way that direct lighting and shadowing is relatively straightforward to do in real-time compared to global illumination, the same is true for microfacet shading models.

Single scattering is solved efficiently by making certain simplifying assumptions about how microfacets of a microsurface are arranged. Through this, it’s possible to come up with analytic expressions for how light is directly reflected by the microsurface, while incorporating self shadowing.

This second aspect is modelled, naturally enough, by the shadowing term (part of the shadowing-masking term, $G$) of standard microfacet BRDFs. The Smith shadowing term [Smith 1976] is currently the most popular option here, and it has been shown by Heitz [2014] to produce results that are a close match to brute-force simulation. This is impressive considering that Smith’s model is pretty simple in terms of the microsurface assumptions that it makes.

Given the desirable properties of the Smith model (simple yet plausible, and also widely used), Heitz et al. [2016] chose to use it as the foundation for a new multiple-scattering model. It derives directly from Smith’s microsurface assumptions and is evaluated through a random walk process. As a result, all orders of scattering are accounted for, and energy conservation is achieved as a natural consequence.

Let’s return now to our earlier spheres, this time rendered using the Heitz model:

The rougher spheres are certainly a lot brighter than before, and placing them under uniform lighting (which matches the background) confirms that energy is now completely conserved:

Heitz et al. also showed that, just as with Smith and single-scattering, their multiple-scattering model has similar behaviour to a brute-force simulation. Given that property, I will proceed to use their model as a ground truth reference to compare Imageworks’ approach against.

A First Approximation

In contrast to the Heitz model, Imageworks’ solution is an approximation that attempts to compensate for the missing energy, rather than actually simulate the physical process of multiple scattering. This is achieved by adding an extra multiple-scattering lobe, $\fms$ – based on [Kelemen and Szirmay-Kalos 2001] – to the existing single-scattering BRDF, $\fss$:

$\Emo$ is equivalent to what we saw previously with the furnace test result of a fully reflective, single-scattering material:

It’s the fraction of incoming light that leaves the microsurface after a single bounce, for the view angle $\mu_o$. The new lobe, $\fms$, is designed to account for the remainder, $1 - \Emo$, so that we get $\fss + \fms = 1$, i.e. perfect energy conservation.

At first glance, the render looks pretty similar to the Heitz model, and we can see with a furnace test that energy is again conserved:

With a side-by-side comparison for each sphere (left half: Imageworks, right half: Heitz), we can see that indeed the two methods are very close, with only a minor visual difference at roughness = 1 (far right):

Here’s a zoomed in view of that particular case:

This is a promising early result, and I think you would be hard-pressed to tell the difference between the two in real production scenarios containing more complex lighting and spatially varying roughness, for instance.

Disorderly Conduct(or)

Now let’s see how things fare with more general conductors. To handle this case, Imageworks adapted a multiple-scattering Fresnel term, $\Fms$, from [Jakob et al. 2014] (Expanded Technical Report, Section 5.6), which accounts for absorption/tinting as light bounces multiple times on the microsurface:

where $\Favg$ is the cosine-weighted average of the Fresnel function, $F$, over the hemisphere.

Here is a new side-by-side comparison (left half: Imageworks, right half: Heitz), with a copper material:

There’s now a bit more of a difference between the two approaches, which is easier to see in a furnace:

Evidently the Imageworks result is lighter and less saturated than Heitz at higher roughness.

When I first saw this difference, I was tempted to conclude that $\Fms$ is simply an approximation that fails to accurately model the complexities of real multiple scattering. However, a closer look at the derivation of this term revealed a problem.

The diffuse multiple-scattering model of Jakob et al. assumes that $\Eavg$ is the fraction of light that escapes the microsurface after each scattering event, leaving $1 - \Eavg$ to continue to bounce. Furthermore, each reflection is assumed to attenuate the light energy by $\Favg$. This means that after the first bounce, the fraction of light energy leaving the surface is $\Favg\,\Eavg$, followed by $\Favg\,\Eavg\,\Favg\,(1 - \Eavg)$, for the second bounce, etc. $\Fms$ is the total factor if we sum over all orders of scattering:

The problem is that this model is including single scattering events ($\Favg\,\Eavg$), which we’ve already accounted for with $\fss$. This suggests that we should instead use:

However, this now gives $\Fms = 1 - \Eavg$ when $\Favg = 1$, instead of $\Fms = 1$ previously. We want the latter behaviour because the $\fms$ lobe already has a magnitude of $1 - \Emo$ (as discussed earlier), so we should normalise our adjusted $\Fms$ by $1/(1 - \Eavg)$:

This is very close to what we had before (Eq. $\ref{eq:fms}$), except there’s $\Favg^2$ instead of $\Favg$ in the numerator.

With this simple change, the Imageworks solution gets closer to the ground truth:

The furnace test reveals that there is still a small difference: Imageworks is now a bit darker and more saturated compared to Heitz. Still, it’s an overall improvement.

This adjustment has been added to an updated version of Imageworks’ slides, along with numerical fits for $\Em$ and $\Eavg$ from Christopher Kulla. The new slides also contain several important corrections3 that are highlighted in the speaker notes.

In the next post, I discuss a further improvement to $\Fms$ that gets us even closer to the reference.

Introduction

As part of the Physically Based Shading Course at SIGGRAPH last year, Christopher Kulla and Alejandro Conty presented the latest iteration of Sony Pictures Imageworks’ core production shading models. One area of focus for them was to improve the energy conservation of their materials; in particular they wanted to compensate for the inherent lack of multiple scattering in common analytic BSDFs, which can be a major source of energy loss. (More on this later.)

A year prior, Heitz et al. [2016] had addressed this very issue with an accurate model1 for microfacet multiple scattering. Unfortunately, since it uses a stochastic process, it wasn’t a good fit for Imageworks’ renderer. Instead, Kulla and Conty adapted ideas from earlier work [Kelemen and Szirmay-Kalos 2001; Jakob et al. 2014] in order to develop practical solutions for conductors and dielectrics.

While the multiple-scattering term that Imageworks uses is energy conserving by design, it doesn’t actually simulate how light scatters between the facets of a given microsurface (in their case modelled by GGX) in the way that the Heitz model does. Still, in spite of this theoretical shortcoming, it is undoubtedly an improvement over doing nothing at all.

I remember being quite excited when I first saw Imageworks’ results, particularly because their approach appeared to be suitable for real-time use. At the same time, I was curious to see exactly how well it compared to the Heitz model as the gold standard. And beyond that, I was eager to explore the general topic of real-time-friendly approximations to microfacet multiple scattering. In the next few posts, I will share my findings, but first let’s start with a quick recap of the problem…

Scratching the Microsurface

The most popular microfacet-based BSDFs in use today have a common limitation: they only model single scattering. This means that any incoming light that doesn’t immediately leave the microsurface through a single reflection or refraction event is ignored by these models. For instance, light that hits one facet and is reflected onto another (and so on) is treated as though it has been completely absorbed.

It’s this restriction on single scattering that made the derivation of compact, analytic models possible in the first place. However, the lack of multiple scattering can lead to significant energy loss with rougher surfaces, which makes sense since there’s a higher probability of light bouncing several times within the microsurface before escaping.

To give a concrete example of this problem, let’s start with a very simple material model: a GGX-based conductor with a constant reflectance of 1. Here is a render of a set of spheres made of this material, with roughness2 varying from 0.125 to 1 (left to right):

At a first glance this result might seem reasonable, but the problem of energy loss becomes readily apparent when the same spheres are placed in a uniform lighting environment (a.k.a. a furnace test):

As we can see, though our material is supposed to be completely reflective, more and more energy is lost as roughness increases. In fact, at the very right, close to 60% of the light has vanished due to the absence of microfacet multiple scattering. Up until recently, we had largely been sweeping this problem under the rug, but it’s hard to argue with concrete numbers like that.

A practical consequence of this behaviour is that it makes life harder for artists doing texture painting and look development, and while it might be possible to manually compensate for the darkening effect in simple cases such as above, it soon becomes an impossible task with textured reflectance and roughness.

It’s clear that we are falling a little short on the physically based shading front and the promise of intuitive material parameters. Fortunately, our collective feeling of shame will be momentary, since help is at hand.

Next we’ll take an initial look at Imageworks’ approach and see how it measures up against the Heitz model.

Note: The ACM is also providing open access to all of the content1 for a limited time2, as they did last year. (This probably explains why it’s taking a while for some of the material to show up elsewhere.)

As before, I’m collecting links to SIGGRAPH content: courses, talks, posters, etc. I’ll continue to update the post as new stuff appears; if you’ve seen anything that’s not here, please let me know in the comments.

Update:
In a welcome change this year, conference content is freely available from the ACM Digital Library (albeit via Author-Izer, so there’s a tedious countdown timer for each link). Here are the most relevant pages:

Exhibitor Tech Talks

2pm, Wednesday 13th August. Located in the west building, rooms 211-214.

We’re back once again with the Physically Based Shading (in Theory and Practice) course at SIGGRAPH! You can find the details on the new course page, but I’ll copy the schedule here, for your convenience:

As you can see, the composition of this year’s lineup is a little different than in previous years. To start with, we’ve incorporated a bit more theory into the first half of the course, beyond Naty Hoffman’s established and superlative introduction; Eric Heitz will be summarising his excellent JGCT paper on microfacet masking-shadowing functions, and Jonathan Dupuy will be distilling their recent work on LEADR Mapping. Jonathan also discusses a number of practical issues in his accompanying course notes.

Either side of the break, we have two game industry speakers, Yoshiharu Gotanda and Sébastien Lagarde. Yoshiharu will be covering his latest R&D at tri-Ace, targeting next-gen hardware; Sébastien will also be presenting some advances, along with sharing the Frostbite team’s experiences in bringing physically based rendering principles to their engine and a number of titles. Séb and Charles de Rousiers have also compiled a highly detailed and extensive set of course notes, which should be available in the coming days.

Arnold has fast become a (physically based) force to be reckoned with inside the VFX industry, so it’s high time that the renderer receive attention in the course. With that in mind, we have Anders Langlands (Solid Angle) talking about what makes his open-source shader library alShaders tick, the design decisions behind it, and how it plays to the strengths of Arnold.

Rounding off the session, we have Ian Megibben and Farhez Rayani from Pixar recounting the evolution of lighting over previous Toy Story films from an art perspective, as well as the challenges and benefits brought about by the switch to physically based rendering for Toy Story OF TERROR!

I hope to see you there!

One last thing…

We’ve been fortunate to have some really excellent presentations in the course over the past few years. One of the most enduring and influential has been Brent Burley’s Physically Based Shading at Disney, in 2012. Two years on, Brent has taken the time to update his course notes with a few additional details, complementing the shading model implementation that was added to Disney’s BRDF Explorer last November. Brent also revisits his “Smith G” roughness remapping, following the findings of Eric Heitz’ aforementioned paper. You can find Brent’s updated notes on the 2012 course page here.

Birds of a Feather

Keynote

Photos

]]>2012-11-12T22:59:00-08:00https://blog.selfshadow.com/2012/11/12/counting-quadsThis is a DX11 followup to an earlier article on quad ‘overshading’. If you’ve already read that, then feel free to skip to the meat of this post.

Recount

As you likely know, modern GPUs shade triangles in blocks of 2x2 pixels, or quads. Consequently, redundant processing can happen along the edges where there’s partial coverage, since only some of the pixels will end up contributing to the final image. Normally this isn’t a problem, but – depending on the complexity of the pixel shader – it can significantly increase, or even dominate, the cost of rendering meshes with lots of very small or thin triangles.

It’s hardly surprising, then, that IHVs have been advising for years to avoid triangles smaller than a certain size, but that’s somewhat at odds with game developers – artists in particular – wanting to increase visual fidelity and believability, through greater surface detail, smoother silhouettes, more complex shading, etc. (As a 3D programmer, part of my job involves the thrill of being stuck in the middle of these kinds of arguments!)

Traditionally, mesh LODs have helped to keep triangle density in check. More recently, deferred rendering methods have sidestepped a large chunk of the redundant shading work, by writing out surface attributes and then processing lighting more coherently via volumes or tiles. However, these are by no means definitive solutions, and nascent techniques such as DX11 tessellation and tile-based forward shading not only challenge the status quo, but also bring new relevancy to the problem of quad shading overhead.

Knowing about this issue is one thing, but, as they say: seeing is believing. In a previous article, I showed how to display hi-z and quad overshading on Xbox 360, via some plaform-specific tricks. That’s all well and good, but it would be great to have the same sort of visualisation on PC, built into the game editor. It would also be helpful to have some overall stats on shading efficiency, without having to link against a library (GPUPerfAPI, PerfKit) or run a separate tool.

There are several ways of reaching these modest goals, which I’ll cover next. What I’ve settled on so far is admittedly a hack: a compromise between efficiency, memory usage, correctness and simplicity. Still, it fulfils my needs so far and I hope you find it useful as well.

Going To Eleven

First, let’s restate the problem: what we want, essentially, is to count up the number of times we shade a given screen quad. The trick is to only count each shading quad once.

The way I achieved this on Xbox 360 hinged on knowing whether a given pixel was ‘alive’ or not, and then only accumulating overdraw for the first live pixel in each shading quad. As far as I’m aware, there’s no official way of detemining this on PC through standard graphics APIs, but some features of DX11 – namely Unordered Access Views (UAVs) and atomic operations – will allow us to arrive at the same result via a different route.

The right way

What I was after was an implementation that was as simple as before, involving three steps:

Render depth pre-pass (optional; do whatever the regular rendering path does for this)

Render scene (material/lighting passes) with special overdraw shader

Display results

A straightforward, safe option is to gather a list of triangles per screen quad, filtering by ID (a combination of SV_PrimitiveID and object ID). This filtering can be performed during the overdraw pass or as a post-process.

What’s unsatisfying with this approach is that it involves budgeting memory for the worst case, or accepting an upper bound on displayable overdraw. Whilst I can imagine that a multi-pass variation is doable, that just adds unwanted complexity to what ought to be a simple debug rendering mode.

The wrong way

So, in order to overcome these limitations, I started toying around with something a lot simpler:

The intent here is to use a UAV to keep track of the current triangle per screen quad. Through InterlockedExchange, we both update the ID and use the previous state to determine if we’re the first pixel to write this ID (prevID != id). If so, we increment an overdraw counter in a second UAV. This is similar in the spirit to the Xbox 360 version, in that we’re selecting one of the live pixels in a shading quad to update the overdraw count. Finally, we can display the results in a fullscreen pass:

On paper, this appears to elegantly avoid the storage and complexity of the previous approach. Alas, it relies on one major, dubious assumption: that quads are shaded sequentially! In reality, GPUs process pixels in larger batches of warps/wavefronts and there’s no guarantee that UAV operations are ordered between quads – hence the name: unordered. So, during the shading of pixels in a quad for one triangle, it’s perfectly possible for another unruly triangle to stomp over the quad ID and break the whole process!

The cheat’s way

Fortunately, we can get around this issue with a few modifications. The basic idea here is to loop and use InterlockedCompareExchange to attempt to lock the screen quad:

12345678910111213141516171819202122232425262728293031323334

RWTexture2D<uint>lockUAV:register(u0);RWTexture2D<uint>overdrawUAV:register(u1);[earlydepthstencil]voidOverdrawPS(float4vpos:SV_Position,uintid:SV_PrimitiveID){uint2quad=vpos.xy*0.5;uintprevId;uintunlockedID=0xffffffff;boolprocessed=false;intlockCount=0;for(inti=0;i<16;i++){if(!processed)InterlockedCompareExchange(lockUAV[quad],unlockedID,id,prevID);[branch]if(prevID==unlockedID){// Wait a bit, then unlock for other quadsif(++lockCount==2)InterlockedExchange(lockUAV[quad],unlockedID,prevID);processed=true;}if(prevID==id)processed=true;}if(lockCount)InterlockedAdd(overdrawUAV[quad],1);}

This leads to three outcomes for unprocessed pixels:

If prevID == unlockedID, then the pixel holds the lock for its shading quad

If prevID == id, another pixel in the shading quad holds the lock

Otherwise, no pixel in the shading quad holds the lock

In the first case we mark the pixel as processed and increment a lock counter. After an additional iteration, we release the lock. This ensures that pixels with the same ID see the state of the lock (second case), so that they can be filtered out. Finally, pixels that held the lock update the quad overdraw.

Ideally we’d loop until the pixel has been tagged as processed, but I haven’t had success with current NVIDIA drivers and UAV-dependent flow control, i.e.:

As a workaround, I’ve simply set the iteration count to a number that works well in practice across NVIDIA and AMD GPUs (those that I’ve had a chance to test, anyway).

Four, Three, Two, One

Now that we have a working system in place, it’s easy to gather other stats. For instance, although we can’t determine directly if a pixel is alive, we can count the number of live pixels in each shading quad, since Interlocked* operations are masked out for dead pixels. With this, we can tally up the number of quads with 1 to 4 live pixels in yet another UAV:

RWTexture2D<uint>lockUAV:register(u0);RWTexture2D<uint>overdrawUAV:register(u1);RWTexture2D<uint>liveCountUAV:register(u2);RWTexture1D<uint>liveStatsUAV:register(u3);[earlydepthstencil]voidOverdrawPS(float4vpos:SV_Position,uintid:SV_PrimitiveID){uint2quad=vpos.xy*0.5;uintprevID;uintunlockedID=0xffffffff;boolprocessed=false;intlockCount=0;intpixelCount=0;for(inti=0;i<64;i++){if(!processed)InterlockedCompareExchange(lockUAV[quad],unlockedID,id,prevID);[branch]if(prevID==unlockedID){if(++lockCount==4){// Retrieve live pixel count (minus 1) in quadInterlockedAnd(liveCountUAV[quad],0,pixelCount);// Unlock for other quadsInterlockedExchange(lockUAV[quad],unlockedID,prevID);}processed=true;}if(prevID==id&&!processed){InterlockedAdd(liveCountUAV[quad],1);processed=true;}}if(lockCount){InterlockedAdd(overdrawUAV[quad],1);InterlockedAdd(liveStatsUAV[pixelCount],1);}}

To my surprise, incrementing a 4-wide UAV didn’t lead to a massive slowdown here. That said, one can certainly use a number of buckets for intermediate results (indexed by the lower bits of the screen position, for instance), if this proves to be a problem.

With these numbers, it’s trivial to add a pie chart to the final pass:

I’ve added a new article, Blending in Detail, written together with Colin Barré-Brisebois, on the topic of blending normal maps. We go through various techniques that are out there, as well as a neat alternative (“Reoriented Normal Mapping”) from Colin that I helped to optimise.

This is by no means a complete analysis – particularly as we focus on detail mapping – so we might return to the subject at a later date and tie up some loose ends. In the meantime, I hope you find the article useful. Please let us know in the comments!

]]>2012-04-11T23:44:00-07:00https://blog.selfshadow.com/2012/04/11/travelling-without-movingLately I’d been getting increasingly frustrated with the limitations of WordPress(.com), so I longed for a change. With the Easter weekend, I finally had a little extra time and energy to make the switch to Octopress, plus a dedicated web host. Hopefully that’ll encourage me to start posting again, or at least remove one major grumble. I’m also looking forward to such liberties as the ability to embed WebGL, though I can’t entirely promise that I’ll wield such power responsibly.

Existing post URLs remain the same, but if you’re one of the illustrious few who subscribe to the blog via RSS, I’m guessing that you’ll need to change over to the new feed.Update: I’m redirecting the old feed URL now, so everything should be back to normal! Speaking of RSS, as I’m now using MathJax for $\LaTeX$, it appears that I’ll need to implement a fallback there, in addition to tracking down a rendering issue with Chrome. Please let me know if you spot any other oddities.

I first tinkered with SH wrap shading (as described in part 1) for Splinter Cell: Conviction, since we were using a couple of models [1][2] for some character-specific materials. Unfortunately, due to the way that indirect character lighting was performed, it would have required additional memory that we couldn’t really justify at that point in development. Consequently, this work was left on the cutting room floor and I only got as far as testing out Green’s model [1].

Recently, however, I spotted that Irradiance Rigs [3] covers similar ground. At the very end of the short paper, they briefly present a generalisation of Valve’s Half Lambert model [2] and the SH convolution terms for the first three bands:

This tidily combines the tunability of [1] with the tighter falloff of [2], albeit at the cost of a few extra instructions in the case of direct lighting. It’s not energy-conserving though, so for kicks I went through the maths – see appendix – and made the necessary adjustments:

I would suggest this as a good workout if your calculus skills are a little on the rusty side; think of it as a much-needed trip to the maths gym: sure it’s going to hurt at first, but you’ll feel better afterwards!

The same authors have since written a more in-depth paper, Wrap Shading [4], which Derek Nowrouzezahrai has kindly made available here. I recommend checking it out, since there’s some nice analysis and plenty of background information. One notable insight is that their model is perfectly represented by 3rd-order SH when $a = 1$ (i.e, Half Lambert). This becomes clear when you consider that the model is effectively unclamped in that case, so appropriate scaling of the constant, linear and quadratic bands () will match the function:

A similar observation can be made with Green’s model: it’s perfectly represented by 2nd-order SH when $a = 1$.

Shrink Wrap

But wait, at the end of the part 1, didn’t I promise that there would be a discussion of optimisation in this post? You’re quite right. Well, it just so happens that a snippet of reference shader code from this last paper makes for a neat little case study on improving shader performance.

Reference Version

This is pretty much the reference implementation for generating the normalised convolution terms of their generalised model:

The only thing that I’ve changed – beyond adding calling code – is to pass in the wrap parameter fA from the vertex shader. It was previously a user-supplied constant, which doesn’t make for a particularly credible example, since in that case all of the maths could simply be moved to the CPU and performed just the once!

Note that there’s been some attempt to pull out common terms, particularly for the final component, where instead of fA*fA - 2*fA + 3 (see $\mathbf{f}$) we now have fA*fA - t.x + 5.

Without further ado, let’s see how this stacks up in terms of ps_3_0 instructions:

Ouch! 16 is fairly substantial, but perhaps not all that surprising going by the HLSL. Since this is device-independent assembly, I decided to check the ALU count on Xbox 360 for comparison. In that case it’s a somewhat more reasonable 10 operations, because 5 scalar ops get dual-issued with vector ops. So, in summary, we have:

DX9: 16, X360: 10(+5) ALU ops

Cancellation

Immediately, a simple but very effective change we can make is to cancel through by the normalisation term, which leaves us with $\mathbf{\hat{f}}$ directly:

For instance, even seemingly ‘obvious’ opportunities like (a/b)/(b/a) will go unnoticed by FXC. This isn’t down to the compiler trying to maintain special-case behaviour such as divide by zero either, because it will happily replace a/a with 1 in the absence of any knowledge about the value of a.

Apologies if that was already perfectly clear and all I’ve done is insult your intelligence, but I’ve seen some people blithely leave everything up to the compiler and not scrutinise what it’s generating. Of course, high-level algorithmic optimisations are hugely important as well, but so is this lower-level stuff when a shader is being executed for millions of pixels!

Just look at what this small amount of effort has netted us:

DX9: 10, X360: 5(+3) ALU ops

Factorisation

Next we can factor fA*fA - 2*fA + 3 again – this time as (fA + 1)(fA + 3) - 6*fA – to reduce the numerator of the third term to a single multiply-add:

I’ve also taken the opportunity to manually vectorise the addition of fA, plus a subsequent pair of multiplications between resulting terms. In fact, the compiler does this anyway, as it’s relatively good at vectorising code. Still, one shouldn’t assume that it will always get things right!

Whether there’s a gain or not, manual vectorisation – which is often quick to do – makes it easier to sanity check the output assembly. Just scanning through, you might expect add, mul, mov, rcp, mul, mad, rcp, mul and you’d be pretty much spot on.

So, for DX9 we’ve reduced the op count by 2, but what about Xbox 360? Here, we’ve only succeeded in shaving off one paired scalar op. However, this may turn into a real gain once the function is part of a larger shader.

DX9: 8, X360: 5(+2) ALU ops

Rescaling

This next trick involves rescaling so that the second term becomes 1/t.y, or a single rcp:

This is a win for ps_3_0 but not for Xbox 360, as it removes the opportunity for pairing. It’s possible that some clever variation could fix this, but it doesn’t matter because we haven’t exhausted our optimisation options…

DX9: 6, X360: 6 ALU ops

Fitting

There are potentially significant gains to be had from numerical fitting, so it’s worth taking the time familiarise yourself with the various techniques, maths packages and libraries out there.

In this instance, I’m performing a cubic fit – i.e. $ax^3 + bx^2 + cx + d$ – for the 2nd and 3rd bands. Polynomials are attractive for performance because they can be efficiently evaluated as a series of mad instructions when written in Horner form: $x(x(ax + b) + c) + d$

Xbox 360 does all this in one less operation because placing 1 into r.x can be achieved with a destination register modifier:

DX9: 4, X360: 3 ALU ops

I could present graphs showing how the cubic approximations fare, but take it from me that they are extemely close. In fact, we can arguably drop down to a quadratic fit and save a further mad in the process. This is still acceptable:

Figure 1: Comparison between original and quadratic fit for 2nd and 3rd bands (left, right)

In both cases – cubic and quadratic – I’ve actually constrained the fitting process so that the curves go through the endpoints. This reduces the worst case error a little and maintains the nice property of exactness when $a = 1$. Of course, something has to give and so the average error is a little higher.

In practice, this quadratic approximation has little effect on the end result. When lighting with a single directional source – a worst-case scenario – the difference is slight and far less significant than the error that comes from using 3rd-order SH in the first place.

Modifiers

And yet, we’re still not done! The DX9 figure suggests that we might pay the instruction cost of moving 1 into r.x with some GPUs, and although it could go away when the terms are actually used, it would be cute if we could get rid of it just in case.

Notice that the two curves are monotonically decreasing and within the range [0, 1]. If we negate the intermediate result of the first mad, saturate and then negate again, there will be no overall effect. By doing this, we can take r.x along for the ride and force it to 0 through one of the negative constants, then add 1 via the final mad:

Wrapping Up

Here’s a WebGL sample that encapsulates this mini-series on wrap shading.

In conclusion, shader optimisation is critical for video game rendering, so you shouldn’t defer to the compiler. To quote Michael Abrash: “The best optimizer is between your ears”. Don’t forget it, train it!

Appendix

Normalisation factor for generalised Half Lambert:

]]>2011-12-31T18:39:22-08:00https://blog.selfshadow.com/2011/12/31/righting-wrap-part-1A while back, Steve McAuley and I were discussing physically based rendering miscellanea over a quiet pint – hardly a stretch of the imagination, since we’re English 3D programmers after all! Anyway, it turned out that we both had plans to write up a few thoughts in relation to wrap shading, and, following some gentle arm-twisting, Steve has posted his. I suggest that you go and read that first if you haven’t already, then return here for a continuation of the subject.

Bad Wrap

Wrap shading has its uses when more accurate techniques are too expensive, or simply to achieve a certain aesthetic, but common models [1][2] have some deficiencies out of the box. Neither of these is energy conserving and they don’t really play well with shadows either. On top of that, Valve’s Half Lambert model [2] has a fixed amount of wrap, so it can’t be tuned to suit different materials (or, perhaps, to limit shadow oddities). I’ll come back to the point about flexibility in part 2, but first I’d like to discuss another factor that’s easily overlooked: environmental lighting.

If you’re set on using some form of wrap shading, then it’s not just a matter of applying it to your standard direct sources – directional, point and spot lights, for instance – it ought to be carried through to environmental lighting as well! Naturally, the importance of this depends on how strong and directional your secondary lighting is; obviously if you’re only using constant ambient then there’s no problem, but these days it’s fairly common to encode indirect lighting in Spherical Harmonics (SH) [3] and perhaps some additional lights as well. Fortunately, wrap shading in the context of SH lighting is easy, and much like energy conservation it’s a relatively cheap or free addition, so it’s worth considering even if the results prove to be subtle.

Looking After the Environment

So, how do we accomplish this? Well, that’s best explained with a quick recap. If you recall, for diffuse SH lighting, we first project the lighting environment, $f$, into SH:

(Of course, in practice, this is commonly performed offline as a numerical integration over a cube map.)

We then convolve this with the SH-projected cosine lobe, $h$, like so:

Next, we can evaluate the lighting (more specifically, irradiance) for a given surface direction, $s$:

Finally, a division by $\pi$ gives us outgoing (or exit) radiance. Personally, I find it convenient to roll these extra terms into $h$ itself. The nice thing about this is that the convolution kernel then boils down (via analytical integration) to easy to remember values for the first three SH bands:

For further details, you can find a complete and approachable account in [4].

Now, back to wrap: adjusting things for our shading model of choice is simply a matter of replacing $\hat{h}$. Let’s try this for the simple wrap model from Green [1] that Steve already discussed:

From Steve’s post, we know that we need an additional normalisation factor of $1 + w$ for energy conservation, so the full formula for our new convolution, which I’ll call $\hat{g}$, is:

You can go through a similar process of analytical integration as Steve did, only now with the additional SH basis terms $y_{l}^{0}$, or if you’re lazy like me, you can throw the formula at a package like Mathematica. Either way, once you’re done, you’ll arrive at the following (or something equivalent):

We can clearly see that this reduces to the cosine convolution kernel, $\hat{h}$, when $w = 0$. In visual terms, the effect of changing $w$ is evident with a single directional light, as you would expect:

It’s a Wrap?

Okay, confession time: this post wasn’t just about wrap shading, since it also serves as a foundation for future posts. For instance, part 2 will conveniently segue into shader optimisations, which is a topic I’ve been planning to write – or, perhaps more accurately, rant – about in general, and I’ll also be returning to Spherical Harmonics down the line.

Figure 1: Major axes for original (left), swizzle (mid) and perpendicular (right) vectors

Introduction

Two months ago, there was a question (and subsequent discussion) on Twitter as to how to go about generating a perpendicular unit vector, preferably without branching. It seemed about time that I finally post something more complete on the subject, since there are various ways to go about doing this, as well as a few traps awaiting the unwary programmer.

Solution Quartet

Here are four options with various trade-offs. If you happen to know of any others, by all means let me know and I’ll update this post.

Note: in all of the following approaches, normalisation is left as an optional post-processing step.

Quick ‘n’ Dirty

A quick hack involves taking the cross product of the original unit vector – let’s call it $\mathbf{u}(x, y, z)$ – with a fixed ‘up’ axis, e.g. $(0, 1, 0)$, and then normalising. A problem here is that if the two vectors are very close – or equally, pointing directly away from each other – then the result will be a degenerate vector. However, it’s still a reasonable approach in the context of a camera, if the view direction can be restricted to guard against this. A general solution in this situation is to fall back to an alternative axis:

Hughes-Möller

In a neat little Journal of Graphics Tools paper [1], Hughes and Möller proposed a more systematic approach to computing a perpendicular vector. Here’s the heart of it:

Or, as the paper also states: “Take the smallest entry (in absolute value) of $\mathbf{u}$ and set it to zero; swap the other two entries and negate the first of them”.

Figure 2: Distribution of v over the sphere

However, there’s a problem with this as written: it doesn’t handle cases where multiple components are the smallest, such as $(0, 0, 1)$! I hit this a few years back when I needed to generate an orthonormal basis for some offline geometry processing, and it’s easily remedied by replacing $<$ with $\leq$. Here’s a corrected version in code form:

Stark

More recently, Michael M. Stark suggested some improvements to the Hughes-Möller approach [2]. Firstly, his choice of permuted vectors is almost the same – differing only in signs – but even easier to remember:

In plain English: the perpendicular vector $\mathbf{\bar{v}}$ is found by taking the cross product of $\mathbf{u}$ with the axis of its smallest component. (Note: the same care is needed when multiple components are the smallest.)

Figure 3 visualises this intermediate ‘swizzle’ vector over the sphere:

I know what you’re thinking, there’s a problem here too: it should be zm = 1^(xm | ym)! Although still robust, the effect of this error is that the nice property of even, symmetrical distribution over the sphere is lost:

Let me save you the trouble of parsing this fully (or consider it an exercise for later); if you boil it down, what you’re effectively left with is:

I’ve failed, thus far, to pinpoint the origin of or thought process behind this approach. That said, some insight can be gained from visualising the resulting vectors:

Figure 5: Covering one’s axis

Their maximum component is in the x axis, except close to the +/-ve x poles. Essentially, Microsoft’s solution ensures robustness without concern for distribution, much like the initial ‘quick’ approach.

Performance

I haven’t benchmarked these implementations, since in cases where I’ve needed to generate perpendicular vectors, absolute speed wasn’t important or the call frequency was vanishingly small. Even in performance-critical situations, it really depends on what properties/restrictions you can live with and your target architecture(s). Still, I can’t help but think that XMVector3Orthogonal is doing a little bit more than it needs to, so maybe there’s cause to revisit this subject at a later date.

Conclusion

I hope you’ve learnt something about generating perpendicular vectors, or that I’ve at least made you aware of some of the minor issues in previous work on the subject. On that note, if you spot any new errors here, please let me know!

]]>2011-10-01T18:56:14-07:00https://blog.selfshadow.com/2011/10/01/hidden-costsI recently added two of my existing publications, one about high performance dynamic visibility and the other on how to display pixel quad overshading in real-time on Xbox 360.

The first of these was originally published in GPU Pro 2. Unfortunately, I missed some errors that crept into the typeset version, so I was pleased to finally correct those and I took the opportunity to rework a few sentences for greater clarity as well. Now that it’s online, I’ll also be able to refer directly to certain sections in follow-up blog posts on the subject.

The second took the form of a journal entry for the Microsoft Game Developer Network, which went up in the spring. It may have flown under your radar, as I’ve since spoken to a few developers who hadn’t seen it, yet were keen to have such a tool in their engine. For NDA reasons, I can’t go into all of the implementation details here, so think of it as a ‘graphical appetiser’.

In a way, the two topics are related: the primary goal of a visibility system is to efficiently remove parts of the world that can’t be seen from a given viewpoint, whereas the purpose of a debug overshading mode is to directly visualise pixel shader work, some of which can likewise have zero contribution to the final image.

I think it’s also fair to say that keeping both forms of redundancy in check is a critical part of optimising the rendering performance of most AAA titles. For that reason, I hope you find these articles useful, and as always, please let me know what you think!