Tuesday, March 14, 2017

Introduction

In the last post we looked at our systems for doing data-driven rendering in Stingray. Today I will go through the two default rendering pipes we ship as templates with Stingray. Both are entirely described in data using two render_config files and a bunch of shader_source files.

We call them the “stingray renderer” and the “mini renderer”

Stingray Renderer

The “stingray renderer” is the default rendering pipe and is used in almost all template and sample projects. It’s a fairly standard “high-end” real-time rendering pipe and supports the regular buzzword features.

The render_config file is approx 1500 lines of sjson. While 1500 might sound a bit massive it’s important to remember that this configuration is highly configurable, pretty much all features can be dynamically switched on/off. It also run on a broad variety of different platforms (mobile -> consoles -> high-end PC), supports a bunch of different debug visualization modes, and features four different stereo rendering paths in addition to the default mono path.

If you are interested in taking a closer look at the actual implementation you can download stingray and you’ll find it under core/stingray_renderer/renderer.render_config.

Going through the entire file and all the implementation details would require multiple blog posts, instead I will try to do a high-level break down of the default layer_configuration and talk a bit about the feature set. Before we begin, please keep in mind that this rendering pipe is designed to handle lots of different content and run on lots of different platforms. A game project would typically use it as a base and then extend, optimize and simplify it based on the project specific knowledge of the content and target platforms.

Here’s a somewhat simplified dump of the contents of the layer_configs/default array found in core/stingray_renderer/renderer.render_config in Stingray v1.8:

So what we have above is a fairly standard breakdown of a rendered frame, if you have worked with real-time rendering before there shouldn’t be much surprises in there. Something that is kind of cool with having the frame flow in this representation and pairing that with the hot-reloading functionality of render_configs, is that it really encourages experimentations: move things around, comment stuff out, inject new resource generators, etc.

Let’s go through the frame in a bit more detail:

Extension insertion points

First of all there are a bunch of extension_insertion_point at various locations during the frame, these are used by render_config_extensions to be able to schedule work into an existing render_config. You could argue that an extensions system to the render_configs is a bit superfluous, and for an in-house game engine targeting a specific industry that might very well be the case. But for us the extension system allows building features a bit more modular, it also encourages sharing of various rendering features across teams.

Shadows

We start off by rendering shadow maps. As we want to handle shadow receiving on alpha blended geometry there’s no simple way to reuse our shadow maps by interleaving the rendering of them into the lighting code. Instead we simply gather all shadow casting lights, try to prioritize them based on screen coverage, intensity, etc. and then render all shadows into two shadow maps.

One shadow map is dedicated to handle a single directional light which uses a cascaded shadow map approach, rendering each cascade into a region of a larger shadow map atlas. The other shadow map is an atlas for all local light sources, such as spot and point lights (interpreted as 6 spot lights).

Clustered shading

We separate local light sources into two kinds: “simple” and “custom”. Simple lights are either spot lights or point lights that don’t have a custom material graph assigned. Simple light sources, which tend to be the bulk of all visible light sources in a frame, get inserted into a clustered shading acceleration structure.

While simple lights will affect both opaque and transparent materials, custom lights will only affect opaque geometry as they run a more traditional deferred shading path. We will touch on the lighting a bit more soon.

Here we use the layer system to record a bind and a clear for a few render targets into a RenderContext generated by the LayerManager.

Then, depending on if the vr_supported render setting is true or not we kick a resource generator that marks in the stencil buffer any pixels falling outside of the lens region. This resource generator only does something if the renderer is running in stereo mode. Also note that the branch above is a static_branch so if vr_supported is set to false the execution of the vr_mask resource generator will get eliminated completely during boot up of the renderer.

Next we lay down the gbuffer. We are using a fairly fat “floating” gbuffer representation. By floating I mean that we interpret the gbuffer channels differently depending on material. I won’t go into details of the gbuffer layout in this post but everything builds upon a standard metallic PBR material model, same as most modern engines runs today. We also stash high precision motion vectors to be able to do accurate reprojection for TAA, RGBM encoded irradiance from light maps (if present, else irradiance is looked up from an IBL probe), high precision normals, AO, etc. Things quickly add up, in the default configuration on PC we are looking at 192 bpp for the color targets (i.e not counting depth/stencil). The gbuffer layout could use some love, I think we should be able to shrink it somewhat without losing any features.

We then kick a resource generator called stabilize_and_linerize_depth, this resource generator does two things:

It linearizes the depth buffer and stores the result in an R32F target using a fullscreen_pass.

It does a hacky TAA resolve pass for depth in an attempt to remove some intersection flickering for materials rendering after TAA resolve. We call the output of this pass stable_depth and use it when rendering editor selections, gizmos, debug lines, etc. We also use this buffer during post processing for any effects that depends on depth (e.g. depth of field) as those runs after AA resolve.

After that we have another more minimalistic gbuffer layer for splatting deferred decals.

Last but not least we kick another resource generator that calculates per pixel velocity for any pixels that haven’t been rendered to during the gbuffer pass (i.e skydome).

At this point we are fully done with the gbuffer population and are ready to do some lighting. We start by laying down the indirect specular / reflections into a separate buffer. We use a rather standard three-step fallback scheme for our reflections: screen-space reflections, falling back to localized parallax corrected pre-convoluted radiance cubemaps, falling back to a global pre-convoluted radiance cubemap.

The reflections layer is the target layer for all cubemap based reflections. We are naively rendering the cubemap reflections by treating each reflection probe as a light source with a custom material. These lights gets picked up by a resource generator performing traditional deferred shading - i.e it renders proxy volumes for each light. One thing that some people struggle to wrap their heads around is that the resource generator responsible for running the deferred shading modifier isn’t kicked until a few lines down (in the lighting resource generator). If you’ve paid attention in my previous posts this shouldn’t come as a surprise for you, as what we describe here is the GPU scheduling of a frame, nothing else.

When the reflection probes are laid down we move on and run a resource generator for doing Screen-Space Reflections. As SSR typically runs in half-res we store the result in a separate render target.

We then finally kick the lighting resource generator, which is responsible for the following:

Build a screen space mask for sun shadows, this is done by running multiple fullscreen_passes. The fullscreen_passes transform the pixels into cascaded shadow map space and perform PCF. Stencil culling makes sure the shader only runs for pixels within a certain cascade.

SSAO with a bunch of different quality settings.

A fullscreen pass we refer to as the “global lighting” pass. This is the pass that does most of the heavy lifting when it comes to the lighting. It handles mixing SSR with probe reflections, mixing of SSAO with material AO, lighting from all simple lights looked up from the clustered shading structure as well as calculates sun lighting masked with the result from sun shadow mask (step 1).

Run a traditional deferred shading modifier for all light sources that has a material graph assigned. If the shader doesn’t target a specific layer the lights proxy volume will be rendered at this point, else it will be scheduled to render into whatever layer the shader has specified.

At this point we have a fully lit HDR output for all of our opaque materials.

debug_visualization - Kick of a resource generator for doing debug rendering. When debug rendering is enabled, the post processing pipe is disabled so we can render straight to the output target / back buffer here. Note: This doesn’t need to be scheduled exactly here, it could be moved later down the pipe.

fog - Kick of a resource generator for blending fog into the accumulation target.

skydome - Layer for rendering anything skydome related.

hdr_transparent - Layer for rendering transparent materials, traditional forward shading using the clustered shading acceleration structure for lighting. VFX with blending usually also goes into this layer.

stream_capture_buffer - Arbitrary location for capturing various render targets and dumping them into system memory.

cubemap_capture - Capturing point for reflection cubemap probes.

selection - Layer for rendering selection outlines.

So basically a bunch of miscellaneous stuff that needs to happen before we enter post processing…

Post Processing

Up until this point we’ve been in linear color space accumulating lighting into a 4xf16 render target (hdr0). Now its time to take that buffer and push it through the post processing resource generator.

The post processing pipe in the Stingray Renderer does:

Temporal AA resolve

Depth of Field

Motion Blur

Lens Effects (chromatic aberration, distortion)

Bloom

Auto exposure

Scene Combine (exposure, tone map, sRGB, LUT color grading)

Debug rendering

All steps of the post processing pipe can dynamically be enabled/disabled (not entirely true, we will always have to run some variation of step 7 as we need to output our result to the back buffer).

Before we present we allow rendering of unlit geometry in LDR (mainly used for HUDs and debug rendering), potentially do some more debug rendering and if we’re in VR mode we kick a resource generator that handles left/right eye combining (if needed).

That’s it - a very high-level breakdown of a rendered frame when running Stingray with the default “Stingray Renderer” render_config file.

Mini Renderer

We also have a second rendering pipe that we ship with Stingray called the “Mini Renderer” - mini as in minimalistic. It is not as broadly used as the Stingray Renderer so I won’t walk you through it, just wanted to mention it’s there and say a few words about it.

The main design goal behind the mini renderer was to build a rendering pipe with as little overhead from advanced lighting effects and post processing as possible. It’s primarily used for doing mobile VR rendering. High-resolution, high-performance rendering on mobile devices is hard! You pretty much need to avoid all kinds of fullscreen effects to hit target frame rate. Therefore the mini renderer has a very limited feature set:

It’s a forward renderer. While it’s capable of doing per pixel lighting through clustered shading it rarely gets used, instead most applications tend to bake their lighting completely or run with only a single directional light source.

No post processing.

While all lighting is done in linear color space we don’t store anything in HDR, instead we expose, tonemap and output sRGB directly into an LDR target (usually directly to the back buffer).

The mini_renderer.render_config file is ~400 lines, i.e. less than 1/3 of the stingray renderer. It is still in a somewhat experimental state but is the fastest way to get up and running doing mobile VR. I also feel that it makes sense for us to ship an example of a more lightweight rendering pipe; it is simpler to follow than the render_config for the full stingray renderer, and it makes it easy to grasp the benefits of data-driven rendering compared to a more static hard-coded rendering pipe (especially if you don’t have source access to the full engine as then the hard-coded rendering pipe would likely be a complete black box for the user).

Wrap up

I realize that some of you might have hoped for a more complete walkthrough of the various lighting and post processing techniques we use in the Stingray renderer. Unfortunately that would have become a very long post and also it feels a bit out of context as my goal with this blog series has been to focus on the architecture of the stingray rendering pipe rather than specific rendering techniques. Most of the techniques we use can probably be considered “industry standard” within real-time rendering nowadays. If you are interested in learning more there are lots of excellent information available, to name a few:

Thursday, March 9, 2017

Introduction

With all the low-level stuff in place it’s time to take a look at how we drive rendering in Stingray, i.e how a final frame comes together. I’ve covered this in various presentations over the years but will try do go through everything again to give a more complete picture of how things fit together.

Stingray features what we call a data-driven rendering pipe, basically what we mean by that is that all shaders, GPU resource creation and manipulation, as well as the entire flow of a rendered frame is defined in data. In our case the data is a set of different json files.

These json-files are hot-reloadable on all platforms, providing a nice workflow with fast iteration times when experimenting with various rendering techniques. It also makes it easy for a project to optimize the renderer for its specific needs (in terms of platforms, features, etc.) and/or to push it in other directions to better suit the art direction of the project.

There are four different types of json-files driving the Stingray renderer:

.shader_node - shader source and meta data used by the graph based shader system.

Today we will be looking at the render_config, both from a user’s perspective as well as how it works on the engine side.

Meet the render_config

The render_config is a sjson file describing everything from which render settings to expose to the user to the flow of an entire rendered frame. It can be broken down into four parts: render settings, resource sets, layer configurations and resource generators. All of which are fairly simple and minimalistic systems on the engine side.

Render Settings & Misc

Render settings is a simple key:value map exposed globally to the entire rendering pipe as well as an interface for the end user to peek and poke at. Here’s an example of how it might look in the render_config file:

As you will see we have branching logics for most systems in the render_config which allows the renderer to take different paths depending on the state of properties in the render_settings. There is also a block called render_caps which is very similar to the render_settings block except that it is read only and contains knowledge of the capabilities of the hardware (GPU) running the engine.

On the engine side there’s not that much to cover about the render_settings and render_caps, keys are always strings getting murmur hashed to 32 bits and the value can be a bool, float, array of floats or another hashed string.

When booting the renderer we populate the render_settings by first reading them from the render_config file, then looking in the project specific settings.ini file for potential overrides or additions, and last allowing to override certain properties again from the user’s configuration file (if loaded).

The render_caps block usually gets populated when the RenderDevice is booted and we’re in a state where we can enumerate all device capabilities. This makes the keys and values of the render_caps block somewhat of a black box with different contents depending on platform, typically they aren’t that many though.

So that covers the render_settings and render_caps blocks, we will look at how they are actually used for branching in later sections of this post.

There are also a few other miscellaneous blocks in the render_config, most important being:

shader_pass_flags - Array of strings building up a bit flag that can be used to dynamically turn on/off various shader passes.

shader_libraries - Array of what shader_source files to load when booting the renderer. The shader_source files are libraries with pre-compiled shader libraries mainly used by the resource generators.

Resource Sets

We have the concept of a RenderResourceSet on the engine side, it simply maps a hashed string to a GPU resource. RenderResourceSets can be locally allocated during rendering, creating a form of scoping mechanism. The resources are either allocated by the engine and inserted into a RenderResourceSet or allocated through the global_resources block in a render_config file.

The RenderInterface owns a global RenderResourceSet populated by the global_resources array from the render_config used to boot the renderer.

So while the above example mainly shows how to create what we call DependentRenderTargets (i.e render targets that inherit its properties from another render target and then allow overriding properties locally), it can also create other buffers of various kinds.

We’ve also introduced the concept of a static_branch, there are two types of branching in the render_config file: static_branch and dynamic_branch. In the global_resource block only static branching is allowed as it only runs once, during set up of the renderer. (Note: The branch syntax is far from nice and we nowadays have come up with a much cleaner syntax that we use in the shader system, unfortunately it hasn’t made its way back to the render_config yet.)

So basically what this example boils down to is the creation of a set of render targets. The output_target is a bit special though, on PC and consoles we simply just setup an alias for an already created render target - the back buffer, while on gl based platforms we create a new separate render target. (This is because we render the scene up-side-down on gl-platforms to get consistent UV coordinate systems between all platforms.)

The other special case from the example above is the sun_shadow_map which grabs the resolution from a render_setting called sun_shadow_map_size. This is done because we want to expose the ability to tweak the shadow map resolution to the user.

When rendering a frame we typically pipe the global RenderResourceSet owned by the RenderInterface down to the various rendering systems. Any resource declared in the RenderResourceSet is accessible from the shader system by name. Each rendering system can at any point decide to create its own local version of a RenderResourceSet making it possible to scope shader resource access.

Worth pointing out is that the resources declared in the global_resource block of the render_config used when booting the engine are all allocated in the set up phase of the renderer and not released until the renderer is closed.

Layer Configurations

A render_config can have multiple layer_configurations. A Layer Configuration is essentially a description of the flow of a rendered frame, it is responsible for triggering rendering sub-systems and scheduling the GPU work for a frame. Here’s a simple example of a deferred rendering pipe:

Each line in the simple_deferred array specifies either a named layer that the shader system can reference to direct rendering into (i.e a renderable object, like e.g. a mesh, has shaders assigned and the shaders know into which layer they want to render - e.g gbuffer), or it can trigger a resource_generator.

The order of execution is top->down and the way the GPU scheduling works is that each line increments a bit in the “Layer System” bit range covered in the post about sorting.

On the engine side the layer configurations are managed by a system called the LayerManager, owned by the RenderInterface. It is a tiny system that basically just maps the named layer_config to an array of “Layers”:

sort_key - As mentioned above and in the post about how we do sorting, each layer gets a sort_key assigned from the “Layer System” bit range. By looking up the layer’s sort_key and using that when recording Commands to RenderContexts we get a simple way to reason about overall ordering of a rendered frame.

name - the shader system can use this name to look up the layer’s sort_key to group draw calls into layers.

depth_sort - describes how to encode the depth range bits of the sort key when recording a RenderJobPackage to a RenderContext. depth_sort is an enum that indicates if sorting should be done front-to-back or back-to-front.

render_targets - array of named render target resources to bind for this layer

depth_stencil_target - named render target resource to bind for this layer

resource_generator -

clear_flags - bit flag hinting if color, depth or stencil should be cleared for this layer

profiling_scope - used to record markers on the RenderContext that later can be queried for GPU timings and statistics.

When rendering a World (see: RenderInterface) the user passes a viewport to the render_world function, the viewport knows which layer_config to use. We look up the array of Layersfrom the LayerManager and record a RenderContext with state commands for binding and clearing render targets using the sort_keys from the Layer. We do this dynamically each time the user calls render_world but in theory we could cache the RenderContext between render_world calls.

The name Layer is a bit misleading as a layer also can be responsible for making sure that a ResourceGenerator runs, in practice a Layer is either a target for the shader system to render into or it is the execution point for a ResourceGenerator. It can in theory be both but we never use it that way.

Resource Generators

The Resource Generators is a minimalistic framework for manipulating GPU resources and triggering various rendering sub-systems. Similar to a layer configuration a resource generator is described as an array of “modifiers”. Modifiers get executed in the order they were declared. Here’s an example:

First modifier in the above example is a dynamic_branch. In contrast to a static_branch which gets evaluated during loading of the render_config, a dynamic_branch is evaluated each time the resource generator runs making it possible to take different paths through the rendering pipeline based on settings and other game context that might change over time. Dynamic branching is also supported in the layer_config block.

If the branch is taken (i.e if auto_exposure_enabled is true) the modifiers in the pass array will run.

The first modifier is of the type fullscreen_pass and is by far the most commonly used modifier type. It simply renders a single triangle covering the entire viewport using the named shader. Any resource listed in the inputs array is exposed to the shader. Any resource(s) listed in the outputs array are bound as a render target(s).

The second and third modifiers are of the type compute_kernel and will dispatch a compute shader. inputs array is the same as for the fullscreen_pass and uavs lists resources to bind as UAVs.

This is obviously a very basic example, but the idea is the same for more complex resource generators. By chaining a bunch of modifiers together you can create interesting rendering effects entirely in data.

Stingray ships with a toolbox of various modifiers, and the user can also extend it with their own modifiers if needed. Here’s a list of some of the other modifiers we ship with:

In Stingray we encourage building all lighting and post processing using resource generators. So far it has proved very successful for us as it gives great per project flexibility. To make sharing of various rendering effects easier we also have a system called render_config_extension that we rolled out last year, which is essentially a plugin system to the render_config files.

I won’t go into much detail how the resource generator system works on the engine side, it’s fairly simple though; There’s a ResourceGeneratorManager that knows about all the generators, each time the user calls render_world we ask the manager to execute all generators referenced in the layer_config using the layers sort key. We don’t restrain modifiers in any way, they can be implemented to do whatever and have full access to the engine. E.g they are free to create their own ResourceContexts, spawn worker threads, etc. When the modifiers for all generators are done executing we are handed all RenderContexts they’ve created and can dispatch them together with the contexts from the regular scene rendering. To get scheduling between modifiers in a resource generators correct we use the 32-bit “user defined” range in the sort key.

Future improvements

Before we wrap up I’d like to cover some ideas for future improvements.

The Stingray engine has had a data-driven renderer from day one, so it has been around for quite some time by now. And while the render_config has served us good so far there are a few things that we’ve discovered that could use some attention moving forward.

Scalability

The complexity of the default rendering pipe continues to increase as the demand for new rendering features targeting different industries (games, design visualization, film, etc.) increases. While the data-driven approach we have addresses the feature set scalability needs decently well, there is also an increasing demand to have feature parity across lots of different hardware. This tends to result in lots of branching in render_config making it a bit hard to follow.

In addition to that we also start seeing the need for managing multiple paths through the rendering pipe on the same platform, this is especially true when dealing with stereo rendering. On PC we currently we have 5 different paths through the default rendering pipe:

Mono - Traditional mono rendering.

Stereo - Old school stereo rendering, one render_world call per eye. Almost identical to the mono path but still there are some stereo specific work for assembling the final image that needs to happen.

Instanced Stereo - Using “hardware instancing” to do stereo propagation to left/right eye. Single scene traversal pass, culling using a uber-frustum. A bunch of shader patch up work and some branching in the render_config.

Nvidia Single Pass Stereo (SPS) - Somewhat similar to instanced stereo but using nvidia specific hardware for doing multicasting to left/right eye.

We estimate that the number of paths through the rendering pipe will continue to increase also for mono rendering, we’ve already seen that when we’ve experimented with explicit multi-GPU stuff under DX12. Things quickly becomes hairy when you aren’t running on a known platform. Also, depending on hardware it’s likely that you want to do different scheduling of the rendered frame - i.e its not as simple as saying: here are our 4 different paths we select from based on if the user has 1-4 GPUs in their systems, as that breaks down as soon as you don’t have the exact same GPUs in the system.

In the future I think we might want to move to an even higher level of abstraction of the rendering pipe that makes it easier to reason about different paths through it. Something that decouples the strict flow through the rendering pipe and instead only reasons about various “jobs” that needs to be executed by the GPUs and what their dependencies are. The engine could then dynamically re-schedule the frame load depending on hardware automatically… at least in theory, in practice I think it’s more likely that we would end up with a few different “frame scheduling configurations” and then select one of them based on benchmarking / hardware setup.

Memory

As mentioned earlier our system for dealing with GPU resources is very static, resources declared in the global_resource set are allocated as the renderer boots up and not released until the renderer is closed. On last gen consoles we had support for aliasing memory of resources of different types but we removed that when deprecating those platforms. With the rise of DX12/Vulkan and the move to 4K rendering this static resource system is in need of an overhaul. While we can (and do) try to recycle temporary render targets and buffers throughout the a frame it is easy to break some code path without noticing.

DX12 improvements

Today our system implicitly deals with binding of input resources to shader stages. We expose pretty much everything to the shader system by name and if a shader stage binds a resource for reading we don’t know about it until we create the RenderJobPackage. This puts us in a somewhat bad situation when it comes to dealing with resource transitions as we end up having to do some rather complicated tracking to inject resource barriers at the right places during the dispatch stage of the RenderContexts (See: RenderDevice).

We could instead enforce declaration of all writable GPU resources when they get bound as input to a layer or resource generator. As we already have explicit knowledge of when a GPU resource gets written to by a layer or resource generator, adding the explicit knowledge of when we read from one would complete the circle and we would have all the needed information to setup barriers without complicated tracking.