Wednesday, February 22, 2017

Today we will be looking at the RenderInterface. I’ve struggled a bit with deciding if it is worth covering this piece of the code or not, as most of the stuff described will likely feel kind of obvious. In the end I still decided to keep it to give a more complete picture of how everything fits together. Feel free to skim through it or sit tight and wait for the coming two posts that will dive into the data-driven aspects of the Stingray renderer.

The glue layer

The RenderInterface is responsible for tying together a bunch of rendering sub-systems. Some of which we have covered in earlier posts (like e.g the RenderDevice) and a bunch of other, more high-level, systems that forms the foundation of our data-driven rendering architecture.

The RenderInterface has a bunch of various responsibilities, including:

Tracking of windows and swap chains.

While windows are managed by the simulation thread, swap chains are managed by the render thread. The RenderInterface is responsible for creating the swap chains and keep track of the mapping between a window and a swap chain. It is also responsible for signaling resizing and other state information from the window to the renderer.

Managing of RenderWorlds.

As mentioned in the Overview post, the renderer has its own representation of game Worlds called RenderWorlds. The RenderInterface is responsible for creating, updating and destroying the RenderWorlds.

Owner of the four main building blocks of our data-driven rendering architecture: LayerManager, ResourceGeneratorManager, RenderResourceSet, RenderSettings

Will be covered in the next post (I’ve talked about them in various presentations before [1] [2]).

Owner of the shader manager.

Centralized repository for all available/loaded shaders. Controls scheduling for loading, unload and hot-reloading of shaders.

Owner of the render resource streamer.

While all resource loading is asynchronous in Stingray (See [3]), the resource streamer I’m referring to in this context is responsible for dynamically loading in/out mip-levels of textures based on their screen coverage. Since this streaming system piggybacks on the view frustum culling system, it is owned and updated by the RenderInterface.

The interface

In addition to being the glue layer, the RenderInterface is also the interface to communicate with the renderer from other threads (simulation, resource streaming, etc.). The renderer operates under its own “controller thread” (as covered in the Overview post), and exposes two different types of functions: blocking and non-blocking.

Blocking functions

Blocking functions will enforce a flush of all outstanding rendering work (i.e. synchronize the calling thread with the rendering thread), allowing the caller to operate directly on the state of the renderer. This is mainly a convenience path when doing bigger state changes / reconfiguring the entire renderer, and should typically not be used during game simulation as it might cause stuttering in the frame rate.

Typical operations that are blocking:

Opening and closing of the RenderDevice.

Sets up / shuts down the graphics API by calling the appropriate functions on the RenderDevice.

Creation and destruction of the swap chains.

Creating and destroying swap chains associated to a Window. Done by forwarding the calls to the RenderDevice.

Loading of the render_config / configuring the data-driven rendering pipe.

The render_config is a configuration file describing how the renderer should work for a specific project. It describes the entire flow of a rendered frame and without it the renderer won’t know what to do. It is the RenderInterface responsibility to make sure that all the different sub-systems (LayerManager, ResourceGeneratorManager, RenderResourceSet, RenderSettings) are set up correctly from the loaded render_config. More on this topic in the next post.

Loading, unloading and reloading of shaders.

The shader system doesn’t have a thread safe interface and is only meant to be accessed from the rendering thread. Therefor any loading, unloading and reloading of shaders needs to synchronize with the rendering thread.

Registering and unregistering of Worlds

Creates or destroys a corresponding RenderWorld and sets up mapping information to go from World* to RenderWorld*.

Non-blocking functions

Non-blocking functions communicates by posting messages to a ring-buffer that the rendering thread consumes. Since the renderer has its own representation of a “World” there is not much communication over this ring-buffer, in a normal frame we usually don’t have more than 10-20 messages posted.

Main interface for rendering of a world viewed from a certain Camera into a certain Viewport. The ShadingEnvironment is basically just a set of shader constants and resources defined in data (usually containing a description of the lighting environment, post effects and similar). swap_chain is a handle referencing which window that will present the final result.

When the user calls this function a RenderWorldMsg will be created and posted to the ring buffer holding handles to the rendering representations for the world, camera, viewport and shading environment. When the message is consumed by rendering thread it will enter the first of the three stages described in the Overview post - Culling.

Reflection of state from a World to the RenderWorld.

Reflects the “state delta” (from the last frame) for all objects on the simulation thread over to the render thread. For more details see [4].

Synchronization.

uint32_t create_fence();
void wait_for_fence(uint32_t fence);

Synchronization methods for making sure the renderer is finished processing up to a certain point. Used to handle blocking calls and to make sure the simulation doesn’t run more than one frame ahead of the renderer.

Presenting a swap chain.

void present_frame(uint32_t swap_chain = 0);

When the user is done with all rendering for a frame (i.e has no more render_world calls to do), the application will present the result by looping over all swap chains touched (i.e referenced in a previous call to render_world) and posting one or many PresentFrameMsg messages to the renderer.

Providing statistics from the RenderDevice.

As mentioned in the RenderContext post, we gather various statistics and (if possible) GPU timings in the RenderDevice. Exactly what is gathered depends on the implementation of the RenderDevice. The RenderInterface is responsible for providing a non blocking interface for retrieving the statistics. Note: the statistics returned will be 2 frames old as we update them after the rendering thread is done processing a frame (GPU timings are even older). This typically doesn’t matter though as usually they don’t fluctuate much from one frame to another.

Generic callback mechanics to easily inject code to be executed by the rendering thread.

Creation, dispatching and releasing of RenderContexts and RenderResourceContexts.

While most systems tends to create, dispatch and release RenderContexts and RenderResourceContexts from the rendering thread there can be use cases for doing it from another thread (e.g. the resource thread creates RenderResourceContexts). The RenderInterface provides the necessary functions for doing so in a thread-safe way without having to block the rendering thread.

Wrap up

The RenderInterface in itself doesn’t get more interesting than that. Something needs to be responsible for coupling of various rendering systems and manage the interface for communicating with the controlling thread of the renderer - the RenderInterface is that something.

In the next post we will walk through the various components building the foundation of the data-driven rendering architecture and go through some examples of how to configure them to do something fun from the render_config file.

The RenderDevice has a bunch of helper functions for initializing/shutting down the graphics APIs, creating/destroying swap chains, etc. All of which are fairly straightforward so I won’t cover them in this post, instead I will put my focus on the two dispatch functions consuming RenderResourceContexts and RenderContexts:

Resource Management

As covered in the post about RenderResourceContexts, they provide a free-threaded interface for allocating and deallocating GPU resources. However, it is not until the user has called RenderDevice::dispatch() handing over the RenderResourceContexts as their representation gets created on the RenderDevice side.

All implementations of a RenderDevice have some form of resource management that deals with creating, updating and destroying of the graphics API specific representations of resources. Typically we track the state of all various types of resources in a single struct, here’s a stripped down example from the DX12 RenderDevice implementation called D3D12ResourceContext:

As you might remember, the linking between the engine representation and the RenderDevice representation is done using the RenderResource::render_resource_handle. It encodes both the type of the resource as well as a handle. The resource_lut is an indirection to go from the engine handle to a local index for a specific type (e.g vertex_buffers or index_buffers in the sample above). We also track freed indices for each type (e.g. unused_vertex_buffers) to simplify recycling of slots.

The implementation of the dispatch function is fairly straight forward. We simply iterate over all the RenderResourceContexts and for each context iterate over its commands and either allocate or deallocate resources in the D3D12ResourceContext. It is important to note that this is a synchronous operation, nothing else is peeking or poking on the D3D12ResourceContext when the dispatch of RenderResourceContexts is happening, which makes our life a lot easier.

Unfortunately that isn’t the case when we dispatch RenderContexts as in that case we want to go wide (i.e. forking the workload and process it using multiple worker threads) when translating the commands into API calls. While we don’t allow allocating and deallocating new resources from the RenderContexts we do allow updating them which mutates the state of the RenderDevice representations (e.g. a D3D12VertexBuffer).

At the moment our solution for this isn’t very nice, basically we don’t allow asynchronous updates for anything else than DYNAMIC buffers. UPDATABLE buffers are always updated serially before we kick the worker threads no matter what their sort_key is. All worker threads access resources through their own copy of something we call a ResourceAccessor, it is responsible for tracking the worker threads state of dynamic buffers (among other things). In the future I think we probably should generalize this and treat UPDATABLE buffers in a similar way.

(Note: this limitation doesn’t mean you can’t update an UPDATABLE buffer more than once per frame, it simply means you cannot update it more than once per dispatch).

Shaders

Resources in the D3D12ResourceContext are typically buffers. One exception that stands out is the RenderDevice representation of a “shader”. A “shader” on the RenderDevice side maps to a ShaderTemplate::Context on the engine side, or what I guess we could call a multi-pass shader. Here’s some pseudo code:

The pseudo code above is essentially the RenderDevice representation of a shader that we serialize to disk during data compilation. From that we can create all the necessary graphics API specific objects expressing an executable shader together with its various state blocks (Rasterizer, Depth Stencil, Blend, etc.).

As discussed in the last post the sort_key encodes the shader pass index. Using Shader::sort_mode, we know which bit range to extract from the sort_key as pass index, which we then use to look up the ShaderPass from Shader::passes. A ShaderPass contains one ShaderProgram per active shader stage and each ShaderProgram contains the byte code for the shader to compile as well as “bind info” for various resources that the shader wants as input.

We will look at this in a bit more detail in the post about “Shaders & Materials”, for now I just wanted to familiarize you with the concept.

Render Context translation

Let’s move on and look at the dispatch for translating RenderContexts into graphics API calls:

This function basically just takes the RenderContext::Commands from all RenderContexts and merges them into a new array, runs a stable radix sort, and returns the sorted commands in output. To avoid memory allocations the RenderDevice implementation owns the memory of the output buffer.

Now we have all the commands nicely sorted based on their sort_key. Next step is to do the actual translation of the data referenced by the commands into graphics API calls. I will explain this process with the assumption that we are running on a graphics API that allows us to build graphics API command lists in parallel (e.g. DX12, GNM, Vulkan, Metal), as that feels most relevant in 2017.

Before we start figuring out our per thread workloads for going wide, we have one more thing to do; “instance merging”.

Instance Merging

I’ve mentioned the idea behind instance merging before [1,2], basically we want to try to reduce the number of RenderJobPackages (i.e. draw calls) by identifying packages that are similar enough to be merged. In Stingray “similar enough” basically means that they must have identical inputs to the input assembler as well as identical resources bound to all shader stages, the only thing that is allowed to differ are constant buffer variables. (Note: by todays standards this can be considered a bit old school, new graphics APIs and hardware allows to tackle this problem more aggressively using “bindless” concepts. )

The way it works is by filtering out ranges of RenderContexts::Commands where the “instance bit” of the sort_key is set and all bits above the instance bit are identical. Then for each of those ranges we fork and go wide to analyze the actual RenderJobPackage data to see if the instance_hash and the shader are the same, and if so we know its safe to merge them.

The actual merge is done by extracting the instance specific constants (these are tagged by the shader author) from the constant buffers and propagating them into a dynamic RawBuffer that gets bound as input to the vertex shader.

Depending on how the scene is constructed, instance merging can significantly reduce the number of draw calls needed to render the final scene. The instance merger in itself is not graphics API specific and is isolated in its own system, it just happens to be the responsibility of the RenderDevice to call it. The interface looks like this:

Pass in a reference to the sorted RenderContext::Commands in merged_commands and after the instance merger is done running you hopefully have fewer commands in the array. :)

You could argue that merging, sorting and instance merging should all happen before we enter the world of the RenderDevice. I wouldn’t argue against that.

Prepare workloads

Last step before we can start translating our commands into state / draw / dispatch calls is to split the workload into reasonable chunks and prepare the execution contexts for our worker threads.

Typically we just divide the number of RenderContext::Commands we have to process with the number of worker threads we have available. We don’t care about the type of different commands we will be processing and trying to load balance differently. The reasoning behind this is that we anticipate that draw calls will always represent the bulk of the commands and the rest of the commands can be considered as unavoidable “noise”. We do, however, make sure that we don’t do less than x-number of commands per worker threads, where x can differ a bit depending on platform but is usually ~128.

For each execution context we create a ResourceAccessors (described above) as well as make sure we have the correct state setup in terms of bound render targets and similar. To do this we are stuck with having to do a synchronous serial sweep over all the commands to find bigger state changing commands (such as RenderContext::set_render_target).

This is where the Command::command_flags bit-flag comes into play, instead of having to jump around in memory to figure out what type of command the Command::head points to, we put some hinting about the type in the Command::command_flags, like for example if it is a “state command”. This way the serial sweep doesn’t become very costly even when dealing with large number of commands. During this sweep we also deal with updating of UPDATABLE resources, and on newer graphics APIs we track fences (discussed in the post about Render Contexts).

The last thing we do is to set up the execution contexts with create graphics API specific representations of command lists (e.g. ID3D12GraphicsCommandList in DX12),

Translation

When getting to this point doing the actual translation is fairly straight forward. Within each worker thread we simply loop over its dedicated range of commands, fetch its data from Command::head and generate any number of API specific commands necessary based on the type of command.

Note: In DX12 we also track where resource barriers are needed during this stage. After all worker threads are done we might also end up having to inject further resource barriers between the command lists generated by the worker threads. We have ideas on how to improve on this by doing at least parts of this tracking when building the RenderContexts but haven’t gotten around looking into it yet.

Execute

When the translation is done we pass the resulting command lists to the correct queues for execution.

Note: In DX12 this is a bit more complicated as we have to interleave signaling / waiting on fences between command list execution (ExecuteCommandList).

Next up

I’ve deliberately not dived into too much details in this post to make it a bit easier to digest. I think I’ve manage to cover the overall design of a RenderDevice though, enough to make it easier for people diving into the code for the first time.

With this post we’ve reached half-way through this series, we have covered the “low-level” aspects of the Stingray rendering architecture. As of next post we will start looking at more high-level stuff, starting with the RenderInterface which is the main interface for other threads to talk with the renderer.

Tuesday, February 14, 2017

Introduction

This post will focus on ordering of the commands in the RenderContexts. I briefly touched on this subject in the last post and if you’ve implemented a rendering engine before you’re probably not new to this problem. Basically we need a way to make sure our RenderJobPackages (draw calls) end up on the screen in the correct order, both from a visual point of view as well as from a performance point of view. Some concrete examples,

Make sure g-buffers and shadow maps are rendered before any lighting happens.

Make sure opaque geometry is rendered front to back to reduce overdraw.

Make sure transparent geometry is rendered back to front for alpha blending to generate correct results.

Make sure the sky dome is rendered after all opaque geometry but before any transparent geometry.

All of the above but also strive to reduce state switches as much as possible.

All of the above but depending on GPU architecture maybe shift some work around to better utilize the hardware.

There are many ways of tackling this problem and it’s not uncommon that engines uses multiple sorting systems and spend quite a lot of frame time getting this right.

Personally I’m a big fan of explicit ordering with a single stable sort. What I mean by explicit ordering is that every command that gets recorded to a RenderContext already has the knowledge of when it will be executed relative to other commands. For us this knowledge is in the form of a 64 bit sort_key, in the case where we get two commands with the exact same sort_key we rely on the sort being stable to not introduce any kind of temporal instabilities in the final output.

The reasons I like this approach are many,

It’s trivial to implement compared to various bucketing schemes and sorting of those buckets.

We only need to visit renderable objects once per view (when calling their render() function), no additional pre-visits for sorting are needed.

The sort is typically fast, and cost is isolated and easy to profile.

Parallel rendering works out of the box, we can just take all the Command arrays of all the RenderContexts and merge them before sorting.

To make this work each command needs to know its absolute sort_key. Let’s breakdown the sort_key we use when working with our data-driven rendering pipe in Stingray. (Note: if the user doesn’t care about playing nicely together with our system for data-driven rendering it is fine to completely ignore the bit allocation patterns described below and roll their own.)

Nothing to see here, moving on… (Not really sure why these 2 bits are unused, I guess they weren’t at some point but for the moment they are always zero) :)

7 bits - Layer System

This 7-bits range is managed by the “Layer system”. The Layer system is responsible for controlling the overall scheduling of a frame and is set up in the render_config file. It’s a central part of the data-driven rendering architecture in Stingray. It allows you to configure what layers to expose to the shader system and in which order these layers should be drawn. We will look closer at the implementation of the layer system in a later post but in the interest of clarifying how it interops with the sort_key here’s a small example:

Above we have three layers exposed to the shader system and one kick of a resource_generator called lighting (more about resource_generators in a later post). The layers are rendered in the order they are declared, this is handled by letting each new layer increment the 7 bits range belonging to the Layer System with 1 (as can be seen in the sort_key comments above).

The shader author dictates into which layer(s) it wants to render. When a RenderJobPackage is recorded to the RenderContext (as described in the last post) the correct layer sort_keys are looked up from the layer system and the result is bitwise ORed together with the sort_key value piped as argument to RenderContext::render().

3 bits - Shader System (Pass Deferred)

The next 3 bits are controlled by the Shader System. These three bits encode the shader pass index within a layer. When I say shader in this context I refer to our ShaderTemplate::Context which is basically a wrapper around multiple linked shaders rendering into one or many layers. (Nathan Reed recently blogged about “The Many Meanings of “Shader””, in his analogy our ShaderTemplate is the same as an “Effect”)

Since we can have a multi-pass shader rendering into the same layer we need to encode the pass index into the sort_key, that is what this 3 bit range is used for.

32 bits - User defined

We then have 32 user defined bits, these bits are primarily used by our “Resource Generator” system (I will be covering this system in the post about render_config & data-driven rendering later), but the user is free to use them anyway they like and still maintain compatibility with the data-driven rendering system.

1 bit - Instance bit

This single bit also comes from the Shader System and is set if the shader implements support for “Instance Merging”. I will be covering this in a bit more detail in my next post about the RenderDevice but essentially this bit allows us to scan through all commands and find ranges of commands that potentially can be merged together to fewer draw calls.

16 bits - Depth

One of the arguments piped to RenderContext::render() is an unsigned normalized depth value (0.0-1.0). This value gets quantized into these 16 bits and is what drives the front-to-back vs back-to-front sorting of RenderJobPackages. If the sorting criteria for the layer (see layer example above) is set to back-to-front we simply flip the bits in this range.

3 bits - Shader System (Pass Immediate)

A shader can be configured to run in “Immediate Mode” instead of “Deferred Mode” (default). This forces passes in a multi-pass shader to run immediately after each other and is achieved by moving the pass index bits into the least significant bits of the sort_key. The concept is probably easiest to explain with an artificial example and some pseudo code:

Take a simple scene with a few instances of the same mesh, each mesh recording one RenderJobPackages to one or many RenderContexts and all RenderJobPackages are being rendered with the same multi-pass shader.

In “Deferred Mode” (i.e pass indices encoded in the “Shader System (Pass Deferred)” range) you would get something like this:

foreach (pass in multi-pass-shader)
foreach (render-job in render-job-packages)
render (render-job)
end
end

If shader is configured to run in “Immediate Mode” you would instead get something like this:

foreach (render-job in render-job-packages)
foreach (pass in multi-pass-shader)
render (render-job)
end
end

As you probably can imagine the latter results in more shader / state switches but can sometimes be necessary to guarantee correctly rendered results. A typical example is when using multi-pass shaders that does alpha blending.

Wrap up

The actual sort is implemented using a standard stable radix sort and happens immediately after the user has called RenderDevice::dispatch() handing over n-number of RenderContexts to the RenderDevice for translation into graphics API calls.

Next post will cover this and give an overview of what a typical rendering back-end (RenderDevice) looks like in Stingray. Stay tuned.

Friday, February 10, 2017

Render Contexts Overview

In the last post we covered how to create and destroy various GPU resources. In this post we will go through the system we have for recording a stream of rendering commands/packages that later gets consumed by the render backend (RenderDevice) where they are translated into actual graphics API calls. We call this interface RenderContext and similar to RenderResourceContext we can have multiple RenderContexts in flight at the same time to achieve data parallelism.

Let’s back up and reiterate a bit what was said in the Overview post. Typically in a frame we take the result of the view frustum culling, split it up into a number of chunks, allocate one RenderContext per chunk and then kick one worker thread per chunk. Each worker thread then sequentially iterates over its range of renderable objects and calls their render() function. The render() function takes the chunk’s RenderContext as one of its argument and is responsible for populating it with commands. When all worker threads are done the resulting RenderContexts gets “dispatched” to the RenderDevice.

So essentially the RenderContext is the output data structure for the second stage Render as discussed in the Overview post.

The RenderContext is very similar to the RenderResourceContext in the sense that it’s a fairly simple helper class for populating a command buffer. There is one significant difference though; the RenderContext also has a mechanics for reasoning about the ordering of the commands in the buffer before they get translated into graphics API calls by the RenderDevice.

Ordering & Buffers

We need a way to reorder commands in one or many RenderContexts to make sure triangles end up on the screen in the right order, or more generally speaking; to schedule our GPU work.

There are many ways of dealing with this but my favorite approach is to just associate one or many commands with a 64 bit sort key and when all commands have been recorded simply sort them on this key before translating them into actual graphics API calls. The approach we are using in Stingray is heavily inspired by Christer Ericsson’s blog post “Order your graphics draw calls around!”. I will be covering our sorting system in more details in my next post, for now the only thing important to grasp is that while the RenderContext records commands it does so by populating two buffers. One is a simple array of a POD struct called Command:

sort_key - 64 bit sort key used for reordering commands before being consumed by the RenderDevice, more on this later.

head - Pointer to the actual data for this command.

command_flags - A bit flag encoding some hinting about what kind of command head is actually pointing to. This is simply an optimization to reduce pointer chasing in the RenderDevice, it will be covered in more detail in a later post.

Render Package Stream

The other buffer is what we call a RenderPackageStream and is what holds the actual command data. The RenderPackageStream class is essentially just a few helper functions to put arbitrary length commands into memory. The memory backing system for RenderPackageStreams is somewhat more complex than a simple array though, this is because we need a way to keep its memory footprint under control. For efficiency, we want to recycle the memory instead of reallocating it every frame, but depending on workload we are likely to get some RenderContexts becoming much larger than others. This creates a problem when using simple arrays to store the commands as the workload will shift slightly over time causing all arrays having to grow to fit the worst case scenario, resulting in lots of wasted memory.

To combat this we allocate and return fixed size blocks of memory from a pool. As we know the size of each command before writing them to the buffer we can make sure that a command doesn’t end up spanning multiple blocks; if we detect that we are about to run out of memory in the active block we simply allocate a new block and move on. If we detect that a single command will span multiple blocks we make sure to allocate them sequentially in memory. We return a block to the pool when we are certain that the consumer of the data (in this case the RenderDevice) is done with it. (This memory allocation approach is well described in Christian Gyrling’s excellent GDC 2015 presentation Parallelizing the Naughty Dog Engine Using Fibers)

You might be wondering why we put the sort_key in a separate array instead of putting it directly into the header data of the packages written to the RenderPackageStream, there are a number of reasons for that:

The actual package data can become fairly large even for regular draw calls. Since we want to make the packages self contained we have to put all data needed to translate the command into an graphics API call inside the package. This includes handles to all resources, constant buffer reflections and similar. I don’t know of any way to efficiently sort an array with elements of varying sizes.

Since we allocate the memory in blocks, as described above, we would need to introduce some form of “jump label” and insert that into the buffer to know how and when to jump into the next memory block. This would further complicate the sorting and traversal of the buffers.

It allows us to recycle the actual package data from one draw call to another when rendering multi-pass shaders as we simply can inject multiple Commands pointing to the same package data. (Which shader pass to use when translating the package into graphic API calls can later be extracted from the sort_key.)

We can reduce pointer chasing by encoding hints in the Command about the contents of the package data. This is what we do in command_flags mentioned earlier.

Render Context interface

With the low-level concepts of the RenderContext covered let’s move on and look at how it is used from a users perspective.

If we break down the API there are essentially three different types of commands that populates a RenderContext:

While state commands primarily are used for doing bigger graphics pipeline state changes (like e.g. changing render targets) they are also used for some miscellaneous things like clearing of bound render targets, pushing/poping timer markers, and some other stuff. There is no obvious reasoning for grouping these things together under the name “state commands”, it’s just something that has happened over time. Keep that in mind as we go through the list of commands below.

A bit more exotic commands

resource - Which RenderResource to bind to that slot (has to point to a VertexStream)

offset - A byte offset describing where to begin writing in the buffer pointed to by resource.

set_instance_multiplier(uint32_t multiplier);
Allows the user to scale the number instances to render for each render() call (described below). This is a convenience function to make it easier to implement things like Instanced Stereo Rendering.

Markers

push_marker(const char *name)
Starts a new marker scope named name. Marker scopes are both used for gathering RenderDevice statistics (number of draw calls, state switches and similar) as well as for creating GPU timing events. The user is free to nestle markers if they want to better group statistics. More on this in a later post.

First argument piped to render() is a pointer to a RenderJobPackage, and as you can see the function also returns a pointer to a RenderJobPackage. What is going on here is that the RenderJobPackage piped as argument to render() gets copied to the RenderPackageStream, the copy gets patched up a bit and then a pointer to the modified copy is returned to allow the caller to do further tweaks to it. Ok, this probably needs some further explanation…

The RenderJobPackage is basically a header followed by an arbitrary length of data that together contains everything needed to make it possible for the RenderDevice to later translate it into either a draw call or a compute shader dispatch. In practice this means that after the RenderJobPackage header we also pack RenderResource::render_resource_handle for all resources to bind to all different shader stages as well as full representations of all non-global shader constant buffers.

Since we are building multiple RenderContexts in parallel and might be visiting the same renderable object (mesh, particle system, etc) simultaneously from multiple worker threads, we cannot mutate any state of the renderable when calling its render() function.

Typically all renderable objects have static prototypes of all RenderJobPackages they need to be drawn correctly (e.g. a mesh with three materials might have three RenderJobPackages - one per material). Naturally though, the renderable objects don’t know anything about in which context they will be drawn (e.g. from what camera or in what kind of lighting environment) up until the point where their render() function gets called and the information is provided. At that point their static RenderJobPackages prototypes somehow needs to be patched up with this information (which typically is in the form of shader constants and/or resources).

One way to handle that would be to create a copy of the prototype RenderJobPackage on the stack, patch up the stack copy and then pipe that as argument to RenderContext::render(). That is a fully valid approach and would work just fine, but since RenderContext::render() needs to create a copy of the RenderJobPackage anyway it is more efficient to patch up that copy directly instead. This is the reason for RenderContext::render() returning a pointer to the RenderJobPackage on the RenderPackageStream.

Before diving into the RenderJobPackage struct let’s go through the other arguments of RenderContext::render():

shader_context

We will go through this in more detail in the post about our shader system but essentially we have an engine representation called ShaderTemplate, each ShaderTemplate has a number of Contexts.

A Context is basically a description of any rendering passes that needs to run for the RenderJobPackage to be drawn correctly when rendered in a certain “context”. E.g. a simple shader might declare two contexts: “default” and “shadow”. The “default” context would be used for regular rendering from a player camera, while the “shadow” context would be used when rendering into a shadow map.

What I call a “rendering pass” in this scenario is basically all shader stages (vertex, pixel, etc) together with any state blocks (rasterizer, depth stencil, blend, etc) needed to issue a draw call / dispatch a compute shader in the RenderDevice.

interleave_sort_key

RenderContext::render() automatically figures out what sort keys / Commands it needs to create on it’s command array. Simple shaders usually only render into one layer in a single pass. In those scenarios RenderContext::render() will create a single Command on the command array. When using a more complex shader that renders into multiple layers and/or needs to render in multiple passes; more than one Command will be created, each command referencing the same RenderJobPackage in its Command::head pointer.

This can feel a bit abstract and is hard to explain without giving you the full picture of how the shader system works together with the data-driven rendering system which in turn dictates the bit allocation patterns of the sort keys, for now it’s enough to understand that the shader system somehow knows what Commands to create on the command array.

The shader author can also decide to bypass the data-driven rendering system and put the scheduling responsibility entirely in the hands of the caller of RenderContext::render(), in this case the sort key of all Commands created will simply become 0. This is where the interleave_sort_key comes into play, this variable will be bitwise ORed with the sort key before being stored in the Command.

shader_pass_branch_key

The shader system has a feature for allowing users to dynamically turn on/off certain rendering passes. Again this becomes somewhat abstract without providing the full picture but basically this system works by letting the shader author flag certain passes with a “tag”. A tag is simply a string that gets mapped to a bit within a 64 bit bit-mask. By bitwise ORing together multiple of these tags and piping the result in shader_pass_branch_key the user can control what passes to activate/deactivate when rendering the RenderJobPackage.

job_sort_depth

A signed normalized [0-1] floating point value used for controlling depth sorting between RenderJobPackages. As you will see in the next post this value simply gets mapped into a bit range of the sort key, removing the need for doing any kind of special trickery to manage things like back-to-front / front-to-back sorting of RenderJobPackages.

gpu_affinity_mask

Same as the gpu_affinity_mask parameter piped to begin_state_command().

Most of these are self explanatory, I think the only thing worth pointing out is the front_face enum. This is here to dynamically handle flipping of the primitive winding order when dealing with objects that are negatively scaled on an uneven number of axes. For typical game content it’s rare that we see content creators using mesh mirroring when modeling, for other industries however it is a normal workflow.

struct ComputeInfo
{
uint32_t thread_count[3];
bool async;
};

So while BatchInfo mostly holds the parameters needed to render something, ComputeInfo hold the parameters to dispatch a compute shader. The three element array thread_count containing the thread group count for x, y, z. If async is true the graphics API’s “compute queue” will be used instead of the “graphics queue”.

resource_offset

Byte offset from start of RenderJobPackage to an array of n_resources with RenderResource::Handle. Resources found in this array can be of the type VertexStream, IndexStream or VertexDeclaration. Based on the their type and order in the array they get bound to the input assembler stage in the RenderDevice.

shader_resource_data_offset

Byte offset from start of RenderJobPackage to a data block holding handles to all RenderResources as well as all constant buffer data needed by all the shader stages. The layout of this data blob will be covered in the post about the shader system.

instance_hash

We have a system for doing what we call “instance merging”, this system figures out if two RenderJobPackages only differ on certain shader constants and if so merges them into the same draw call. The shader author is responsible but not required to implement support for this feature. If the shader supports “instance merging” the system will use the instance_hash to figure out if two RenderJobPackages can be merged or not. Typically the instance_hash is simply a hash of all RenderResource::Handle that the shader takes as input.

resource_tag & object_tag & batch_tag

Three levels of debug information to make it easier to back track errors/warning inside the RenderDevice to the offending content.

3. Resource updates

The last type of commands are for dynamically updating various RenderResources (Vertex/Index/Raw buffers, Textures, etc).

This function basically returns a pointer to the first byte of the buffer that will replace the contents of the resource. map_write() figures out the size of the buffer by casting the resource to the correct type (using the type information encoded in the RenderResource::render_resource_handle). It then allocates memory for the buffer and a small header on the RenderPackageStream and returns a pointer to the buffer.

sort_key & shader_context & shader_pass_branch_key

In some rare situations you might need to update the same buffer with different data multiple times within a frame. A typical example could be the vertex buffer of a particle system implementing some kind of level-of-detail system causing the buffers to change depending on e.g camera position. To support that the user can provide a bunch of extra parameters to make sure the contents of the GPU representation of the buffer is updated right before the graphics API draw calls are triggered for the different rendering passes. This works in a similar way how RenderContext::render() can create multiple Commands on the command array referencing the same data.

Unless you need to update the buffer multiple times within the frame it is safe to just set all of the above mentioned parameters to 0, making it very simple to update a buffer:

void *buf = rc.map_write(resource, 0);
// .. fill bits in buffer ..

Note: To shorten the length of this post I’ve left out a few other flavors of updating resources, but map_write is the most important one to grasp.

GPU Queues, Fences & Explicit MGPU programming

Before wrapping up I’d like to touch on a few recent additions to the Stingray renderer, namely how we’ve exposed control for dealing with different GPU Queues, how to synchronize between them and how to control, communicate and synchronize between multiple GPUs.

New graphics APIs such as DX12 and Vulkan exposes three different types of command queues: Graphics, Compute and Copy. There’s plenty of information on the web about this so I won’t cover it here, the only thing important to understand is that these queues can execute asynchronously on the GPU; hence we need to have a way to synchronize between them.

To handle that we have exposed a simple fence API that looks like this:

Here’s a pseudo code snippet showing how to synchronize between the graphics queue and the compute queue:

uint64_t sort_key = 0;
// record a draw call
rc.render(graphics_job, graphics_shader, sort_key++);
// record an asynchronous compute job
// (ComputeInfo::async bool in async_compute_job is set to true to target the graphics APIs compute queue)
rc.render(async_compute_job, compute_shader, sort_key++);
// now lets assume the graphics queue wants to use the result of the async_compute_job,
// for that we need to make sure that the compute shader is done running
rc.wait_fence(IdString32("compute_done"), sort_key++, GRAPHICS_QUEUE);
rc.signal_fence(IdString32("compute_done"), sort_key++, COMPUTE_QUEUE);
rc.render(graphics_job_using_result_from_compute, graphics_shader2, sort_key++);

As you might have noticed all methods for populating a RenderContext described in this post also takes an extra parameter called gpu_affinity_mask. This is a bit-mask used for directing commands to one or many GPUs. The idea is simple, when we boot up the renderer we enumerate all GPUs present in the system and decide which one to use as our default GPU (GPU_DEFAULT) and assign that to bit 1. We also let the user decide if there are other GPUs present in the system that should be available to Stingray and if so assign them bit 2, 3, 4, and so on. By doing so we can explicitly direct control of all commands put on the RenderContext to one or many GPUs in a simple way.

As you can see that is also true for the fence API described above, on top of that there’s also a need for a copy interface to copying resources between GPUs:

Even though this work isn’t fully completed I still wanted to share the high-level idea of what we are working towards for exposing explicit MGPU control to the Stingray renderer. We are actively working on this right now and with some luck I might be able to revisit this with more concrete examples when getting to the post about the render_config & data-driven rendering.

Next up

With that I think I’ve covered the most important aspects of the RenderContext. Next post will dive a bit deeper into bit allocation ranges of the sort keys and the system for sorting in general, hopefully that post will become a bit shorter.

Wednesday, February 1, 2017

Render Resources

Before any rendering can happen we need a way to reason about GPU resources. Since we want all graphics API specific code to stay isolated we need some kind of abstraction on the engine side, for that we have an interface called RenderDevice. All calls to graphics APIs like D3D, OGL, GNM, Metal, etc. stays behind this interface. We will be covering the RenderDevice in a later post so for now just know that it is there.

We want to have a graphics API agnostic representation for a bunch of different types of resources and we need to link these representations to their counterparts on the RenderDevice side. This linking is handled through a POD-struct called RenderResource:

Any engine resource that also needs a representation on the RenderDevice side inherits from this struct. It contains a single member render_resource_handle which is used to lookup the correct graphics API specific representation in the RenderDevice.

The most significant 8 bits of render_resource_handle holds the type enum, the lower 24 bits is simply an index into an array for that specific resource type inside the RenderDevice.

Various Render Resources

Let’s take a look at the different render resource that can be found in Stingray:

Texture - A regular texture, this object wraps all various types of different texture layouts such as 2D, Cube, 3D.

RenderTarget - Basically the same as Texture but writable from the GPU.

DependentRenderTarget - Similar to RenderTarget but with logics for inheriting properties from another RenderTarget. This is used for creating render targets that needs to be reallocated when the output window (swap chain) is being resized.

BackBufferWrapper - Special type of RenderTarget created inside the RenderDevice as part of the swap chain creation. Almost all render targets are explicitly created by the user, this is the only exception as the back buffer associated with the swap chain is typically created together with the swap chain.

ShaderConstantBuffer - Shader constant buffers designed for explicit update and sharing between multiple shaders, mainly used for “view-global” state.

VertexStream - A regular Vertex Buffer.

VertexDeclaration - Describes the contents of one or many VertexStreams.

IndexStream - A regular Index Buffer.

RawBuffer - A linear memory buffer, can be setup for GPU writing through an UAV (Unordered Access View).

Shader - For now just think of this as something containing everything needed to build a full pipeline state object (PSO). Basically a wrapper over a number of shaders, render states, sampler states etc. I will cover the shader system in a later post.

Most of the above resources have a few things in common:

They describe a buffer either populated by the CPU or by the GPU

CPU populated buffers has a validity field describing its update frequency:

STATIC - The buffer is immutable and won’t change after creation, typically most buffers coming from DCC assets are STATIC.

UPDATABLE - The buffer can be updated but changes less than once per frame, e.g: UI elements, post processing geometry and similar.

DYNAMIC - The buffer frequently changes, at least once per frame but potentially many times in a single frame e.g: particle systems.

They have enough data for creating a graphics API specific representation inside the RenderDevice, i.e they know about strides, sizes, view requirements (e.g should an UAV be created or not), etc.

Render Resource Context

With the RenderResource concept sorted, we’ll go through the interface for creating and destroying the RenderDevice representation of the resources. That interface is called RenderResourceContext (RRC).

We want resource creation to be thread safe and while the RenderResourceContext in itself isn’t, we can achieve free threading by allowing the user to create any number of RRC’s they want, and as long as they don’t touch the same RRC from multiple threads everything will be fine.

Similar to many other rendering systems in Stingray the RRC is basically just a small helper class wrapping an abstract “command buffer”. On this command buffer we put what we call “packages” describing everything that is needed for creating/destroying RenderResource objects. These packages have variable length depending on what kind of object they represent. In addition to that the RRC can also hold platform specific allocators that allow allocating/deallocating GPU mapped memory directly, avoiding any additional memory shuffling in the RenderDevice. This kind of mechanism allows for streaming e.g textures and other immutable buffers directly into GPU memory on platforms that provides that kind of low-level control.

Handing it over directly to the RenderDevice requires the caller to be on the controller thread for rendering as RenderDevice::dispatch() isn’t thread safe. If the caller is on any other thread (like e.g. one of the worker threads or the resource streaming thread) RenderInterface::dispatch() should be used instead. We will cover the RenderInterface in a later post so for now just think of it as a way of piping data into the renderer from an arbitrary thread.

Wrap up

The main reason of having the RenderResourceContext concept instead of exposing allocate()/deallocate() functions directly in the RenderDevice/RenderInterface interfaces is for efficiency. We have a need for allocating and deallocating lots of resources, sometimes in parallel from multiple threads. Decoupling the interface for doing so makes it easy to schedule when in the frame the actual RenderDevice representations gets created, it also makes the code easier to maintain as we don’t have to worry about thread-safety of the RenderResourceContext.

In the next post we will discuss the RenderJobs and RenderContexts which are the two main building blocks for creating and scheduling draw calls and state changes.

Introduction

When we started writing Bitsquid back in mid 2009 all platforms we intended to run on were already multi-core architectures. This and the fact that we had some prior experience trying to get our last engine to run efficiently on the PS3 answered the question how not to architecture an efficient renderer that scales to many cores. We knew we needed more than functional parallelism, we wanted data-parallelism.

To solve that we divide the CPU view of a rendered frame into three stages:

Culling - Filter out visible renderable objects with respect to a camera from a potentially huge set of different type of objects (meshes, particle systems, lights, etc).

Render - Iterate over the filtered result from Culling and “record” an intermediate representation of draw calls/state switches to a command buffer.

Dispatch - Take result from Render and translate that into actual render API calls (D3D, OGL, Metal, GNM, etc).

As you can see each stage pipes its result into the next. Rendering is typically very simple in that sense; we tend to have a one way flow of our data: [[user input or time affects state, state propagates into changes of the renderable objects (transforms, shader constants, etc), figure out what need to be rendered, iterate over that and finally generate render API calls. Rinse & Repeat :]]

If we ignore the problem of ordering the final API calls in the rendering backend it’s fairly easy to see how we can achieve data parallelism in this scenario. Just fork at each stage splitting the workload into a n-chunks (where n is however many worker threads you can throw at it). When all workers are done for a stage take the result and pipe into the next stage.

In essence this is how all rendering in Stingray works. Obviously I’ve glanced over some rather important and challenging details but as you will see they are not too hard to solve if you have good control over your data flows and are picky about when mutation of the data happens.

Design Philosophies & Concepts

The rendering code in Stingray tends to be heavily influenced by Data Oriented Programming principles. When designing new systems our biggest efforts usually goes into structuring our data efficiently and thinking about its flow through the systems, more so than writing the actual code that transforms the data from one form to another.

To achieve data-parallelism throughout the rendering code the first thing to realize is that we have to be very picky about when mutation of the renderable objects happens. Multiple worker threads will run over our objects and its not unlikely that more than one thread visits the same object at the same time, hence we must not mutate the state of our objects in its render function. Therefore all of our render() functions are const.

To further guard ourselves from the outer world (i.e gameplay, physics, etc) the renderer operates in complete isolation from the game logics. It has its own representation of the data it needs, and only the data relevant for rendering. While the gameplay logics usually wants to reason about high-level concepts such as game entities (which basically groups a number of meshes, particle systems, lights, etc together), we on the rendering side don’t really care about that. We are much more interested in just having an array of all renderable objects in a game world, in a memory layout that makes it efficient to access.

Another nice thing with decoupling the representation of the renderable objects from the game objects is that it allows us to run simulation in parallel with rendering (functional parallelism). So while simulation is updating frame n the renderer is processing frame n-1. Some of you might argue that overlaying rendering on top of simulation doesn’t give any performance improvements if the work in all systems is nicely parallelized. In reality though this isn’t really the case. We still have systems that don’t go wide, or have certain sections where they need to do synchronous processing (last generation graphics APIs: e.g DX11, OpenGL are good examples). This creates bubbles in the frame slowing us down.

By overlaying simulation and rendering we get a form of bubble filling among the worker threads which in most cases gives a big enough speed improvement to justify the added complexity that comes from this architecture. More specifically:

Double buffering of state - since the simulation might mutate the state of an object for frame n at the same time as the renderer is processing frame n-1 any mutable state needs to be double buffered.

Life scope tracking of immutable data - while immutable/read only state such as static vertex and index buffers are safe to read by both simulation and renderer we still need to be careful not pulling the rug under the renderers feet by freeing anything still being in use by the renderer.

Here’s a conceptual graph showing the benefits of overlaying simulation and rendering:

So basically what we got here is two “controller threads”: simulation and render both offloading work to the worker threads. In the case that a controller thread is blocked waiting for some work to finish it will assist the worker threads striving to never sit idle. One thing to note is that to prevent frames from stacking up, we never allow the simulation thread to run more than one frame ahead of the render thread.

As a comparison here’s the same workload with simulation and rendering running in sequence.

As you can see we get significantly more idle time (bubbles) on the worker threads due to certain parts of both the simulation and rendering not being able to go wide.

Next up

I think this pretty much covers the high level view of the core rendering architecture in Stingray. Now lets go into some more detail.

Welcome

To simplify knowledge transferring inside the Autodesk development teams and in an attempt to improve my writing skills I’ve decided to do a walkthrough of the Stingray rendering architecture. The idea is to do this as a series of blog posts over the coming weeks starting from the low-level aspects of the renderer chewing my way up to more high-level concepts as I go.

I’ve covered some of these topics before in various presentations over the years but those have been more focused on how our data driven aspects of the renderer works and less on the core architecture behind it. This is an attempt to do a more complete walk-through of the entire rendering architecture.

When I started thinking about this it felt like an almost impossible undertaking considering how much slower I am at expressing myself in text than in code, but after spending a couple of days going through the entire stingray code base doing some spring cleaning it felt a bit more manageable so I’ve now decided to give it a try.

(Note: this has nothing at all to do with me feeling the pressure from Niklas Frykholm who’s currently doing a complete walk-through of the entire Stingray engine code base (well everything except rendering) as a series of youtube videos [1]. Not at all… I feel no pressure, no guilt, nothing… I promise… Thanks Niklas for pushing me!)

Outline

Below is some kind of outline of what I intend to cover and in what order, I might swap things around as I go if I discover it makes more sense. This post will work as an index and I will link to the posts as they come online.