Tuesday, July 19, 2016

One of the standard solutions that has emerged for physically based rendering (PBR) is to use pre-filtered mipmapped radiance environment maps (PMREMs).

Put in very imprecise terms, a PMREM is a cube map that has a perfect reflection of the environment in the lowest (largest) mip and successively blurrier images in the lower mips. Typically the image is convolved with your material's distribution function at various roughnesses, and a cosine lobe is used at the lowest resolution.

From this, specular lighting is sampled from the LOD level that matches material roughness, and a low LOD is used for diffuse lighting.

What I haven't seen is any papers on games doing this entirely in-engine.

There's plenty written on baking these cubemaps ahead of time, and it appears that quite a few engines are augmenting PMREMs with screen space reflections (with various heuristics to cope with materials being rough and all of the weird things SSR can have).

But the only work I've seen on real-time PMREMs is the old GPU Gems 2 chapter that projects the original cube map into spherical harmonics (and then back into a cube map) as a way of getting a reasonable low frequency or diffuse cube map. But this was written back when a clean reflection and a diffuse map were good enough; it doesn't handle rough specular reflections at all.*

The problem that X-Plane faces in adopting modern game-engine graphics is that we can't bake. Our "level" is the entire planet, and it is built out of user-installed scenery packages that can be mixed and matched in real-time. This includes adding a mix of surface objects onto a mesh from another pack. Baking is out the question because the final assembly of the level only exists when the user starts the app.

So I have been experimenting with both SH-based convolution of cube maps and simply importance sampling a distribution function on the GPU. It appears we're reaching the point where both of these can potentially function in real-time...at least, if you have a big GPU, a relatively low-res cube map (e.g. 256x256, not 1024x1024) and only one cube map.**

My question is: is anyone else already doing this? Is this a thing?

* You can add a lot more spherical harmonic coefficients, but it doesn't scale well in image quality; the amazing thing about SH are that the artifacts from having a very small number of bands are, perhaps by luck, very acceptable for low frequency lighting. The problem is that, as coefficients are added in, things get worse. The original image isn't reconstructed well (for the number of bands we can hope to use on a GPU) and the artifacts become significantly less desirable.

** To be clear: importance sampling is only going to work for a very, very small number of samples. I believe that for "tight" distributions it should be possible to find filter kernels that are equivalent to the importance-sampled result that can run in realtime. For very wide distributions, this is out of the question, but in that case, SH convolution might provide a reasonable proxy. What I don't know yet is what goes "in the middle". My guess is: some kind of incorrect and hacky but tolerable blend of the two.

Can you see what's wrong with it? Don't over think it; it's not a relaxed vs sequential atomics issue or a bug in the RAII code. The logic is roughly this:

Decrement my reference count.

If I was the last one (and my count is now zero):

Lock the global table of art assets.

While the table is locked, clear out my entry.

Delete myself.

We should get "more optimal" and drop the lock before we delete ourselves, but that's not the issue either.

The issue is a data race! The race exists between the completion of the atomic decrement and acquiring the table lock. In this space another thread can come along and:

Try to load this same art asset; it will acquire the table lock, look up our art asset, find it, increase the reference count back to 1.

It will then drop the lock, leaving us exactly how we were except our reference count is now 0.

When we proceed to nuke the art asset we will leave that other client with a bad pointer and crash.

I found this because I actually caught it in the debugger -- what are the odds?

Being too clever for my own good, I thought "let's be clever and just re-check the reference count after we take the global lock; in the rare load-during-delete we can then abort the load, and in most cases releasing our reference count stays fast.

That fixes the above race condition but doesn't fix this other race condition: in the space between the decrement and the lock:

That thread then releases its reference, which hits zero, takes the lock, deletes the art asset and the table entry and releases the lock.

Now we are the ones with a bad pointer, and we crash re-deleting the art asset.

So I've come to the conclusion that the only safe thing to do is to take the table lock first before doing any decrement. In the event that we hit zero reference count, since we did so while the table is locked, no one could have found our art asset by the table to increase the count (and clearly no one else already had it if we went from ref == 1 to ref == 0). So now we know it's safe to delete.

This isn't great for performance; it means we take a global lock (per class of art asset) on any reference count decrement, but I think we can survive this; we have already moved not only most loading but most unloading to asynchronous worker threads, so we're eating the lock in a place where we can afford to take our time.

A More Asynchronous Design

I can imagine a more asynchronous design:

The table of art assets itself holds one reference count.

When we decrement the reference count to one, we queue up an asynchronous table "garbage collection" to run on a worker thread.

The garbage collection takes the table lock once to find and clean out unused art assets, blocking loads for a small amount of time while the table is inspected.

Here the reference count decrement doesn't need to do any extra work unless it is "probably" finished (e.g. gets down to 1 reference count remaining) so transient reference count chagnes (e.g. we went from 3 to 4 back to 3) remain as fast as we can perform an atomic operation with some acq/rel semantics.

I have not had a chance to profile the simple solution since trying to make it correct; if it turns out to be a perf problem I'll post a follow-up. The only time we've seen a perf problem is when the table lock was required for all operations (this was a design that we fixed almost a decade ago -- a design that was okay when it was single threaded and thus lock free).

The rendering path in X-Plane requires no reference count changes at all, so it is unaffected by anything on this path.

Wednesday, July 06, 2016

I've been meaning to post this: this is ASAN (Address Sanitizer) in X-Code 7 catching a memory scribble in an X-Plane beta.

I don't usually drink the Kool-Aid when it comes to Apple development tools. If you say "Swift Playground" my eyes roll all the way into the back of my head like in the exorcist.

My skepticism with Apple tools comes from X-Plane being a big heavy app that aims to use 100% of a gamer-class PC; when run with developer tools on the Mac, the results sometimes aren't pretty. Instruments took several versions to reach a point where we could trace X-Plane without the tool blowing up.

ASAN is something else. It's so fast that it can run X-Plane (fully unoptimized debug build) with real settings, like what users use, at 7-10 fps with full address checking. That's not even on the same planet as Valgrind. (When we tried Valgrind on Linux, we didn't have the patience to find the fps - it never finished auditing the load sequence.)

In this crash ASAN has not only shown me where I have utterly stomped on memory, but it has also provided a full backtrace to where I relinquished the memory that I am now stopping on and where I first allocated it. That's a huge amount of information to get in a real-time run.

In this case the underlying bug issue was: we have a geometry accumulator that takes a big pile of geometry and stuffs it in a VBO when done. (What makes the class interesting is that it takes input geometry from multiple sources and sequences them into one big VBO for efficiency.)

A participating client can't delete their chunk of the VBO until after accumulation is finished, but there was no assert protecting us from this programming mistake. When code does delete its geometry "too early", the removed reference isn't properly tracked and stale pointers are used during the VBO build-up process, trashing memory.

Suffice it to say, ASAN's view of this is about the best you can hope for in this kind of scribble situation.