Catflier

Wednesday, 22 February 2017

I did not post for a while, it doesn't means I've been doing nothing, actually quite the opposite.

Lately I've been reworking a bit of my graph internals (basically read, decouple some bits), and I had lot of ideas and implemented quite a few, so I decided I wanted to share some theory again.

First thing I already spoke about the fact that multi core evaluation is tricky to get right (basically, you need your nodes to do enough work to make it worthwhile, otherwise you gonna have serious issues or just have something slower).

A second thought was to eventually have a compiled form of the graph, while that's a bit of buzzword, I also believe that compiled ones will not always be faster.

Why? Wouldn't removing evaluation overhead make things faster?

Yes, but by still having access to your model, there's a serious lot of optimizations that you can perform, so answer is : not always.

Let's take the following simple patch, as you can see, nothing really fancy, but that will perfectly fit

1/A very basic graph

A/Lazy recursion

This is by far the simplest to implement (to be honest, you can do it in less than 100 lines of code probably).

Pseudo code:

Start with NodeFinal

function processnode(node)

if node is already processed

return

for each input pin

get upwards node (info is likely stored in a link structure)

processnode (parent node connected to input pin)

end for

end function

process node (some form of update/evaluate call)

mark node as processed

Yes, that's about it really, simple to understand and very idiomatic.

Now first things that we can notice:

We will call processnode several times on some of them (for example, Time is used by 3 different nodes)

In our case (Since I did not introduce any form of "graph cutting techniques") we can also see that actually order of execution is immutable (our graph will always run and process nodes in the same order).

So let's optimize that

B/Reversed List

Getting the above graph (i add the screenshot again):

We can deduce our dependency order (if I consider I'll scan for pins in a left to right fashion)

This list is correct, but replicates some elements, filtered version is :

Time -> Node 5 -> Node 2 -> Node 9 -> Node 1 -> Node 12 -> Node final

To build the list, we perform a first lazy evaluation pass as seen upwards, but instead of calling Update, we simply add the node to a list.

Once we get that first list, building the reverse order is done as follows:

var reversedList = reverse(dependencylist)

var executionList = new EmptyList

foreach node in reversedList

if not executionList.Contains(node) then executionList.Add(node)

Store that list somewhere, and then graph evaluation becomes:

foreach node in executionList

update(node)

As you can see, that removes recursion, and actually this is really easily compilable (just generate llvm/il/source code) and compile

Issue is of course that each time the graph changes, you need to preparse it again, which is not always advisable at runtime (the compilation step might severely hurt).

Now let's add some features in our graph

2/Lazy pins (and virtual links)

Let's take the following graph.

Here we consider that LargeData and Generator1 nodes are really heavy, and feed some data into builder, but does not need frequent updates.

Builder has been built with first pin to be Lazy (eg: only build upwards if the second pin something like Apply is true (which is dealt somehow by the conditions upwards)

Graph is then decomposed int the following path:

Builder -> And -> Condition1 -> UserInput -> Condition2

If Second pin (eg result of and node) is true

LargeData -> Generator1 -> Time

As you can see, now our evaluation is split into 2 different parts (basically, everything non lazy, then eventually the lazy part is required)

This is suddenly harder to convert into list (and kinda difficult to auto compile).

So suddenly the idiomatic purely lazy version above is not so bad after all. You only need to add a condition on the node to tell if you want to go upwards or not.

Another version is to have some form of internal link eg:

Here the switch node will decide (depending on SwitchValue result) to run either from Part1 or Part2

This can be implemented in 2 ways:

Use lazy as above.

Rebuild the graph structure (basically if SwitchValue 0, the internal representation of the graph is like this):

As you can see the user "sees" the first version of the graph, but internally the link is removed, so the Part2 node (and above) are now orphans and never run.

This is also really efficient, but of course if you use an optimization technique above, every time the internal representation changes (which can be every frame), you need to rebuild your structure.

So deciding to do so means that you need to decide if it's worth it (does the optimization technique + optimized evaluation is still faster that idiomatic version).

Please note those techniques are always defined on runtime (eg: at some point, some node is granted "power" to cut the graph).

As we have seen, they add some complexity to the evaluation part.

Now let's separate our usage into 2 parts :

Authoring : When user is effectively building the patch

Runtime : When we are running

In case you want to deploy your application, you will of course have effective gains by optimizing runtime performance version.

But in many cases we still need fast authoring, even if those are niche scenarios, I always have cases where I need to go modify something very early before a show (or actually , during the show).

Some people will say it's bad, but in cases where it's not avoidable (position some element properly, add a quick animation or whatever some client urgently requires) this becomes critical not to disrupt the render output too much (basically, freeze or lag).

So the whole key is to balance both (of course you can also provide several graph evaluation strategies, so the user can decide depending on his use case).

3/Dual graph

One very easy technique to balance this part is to maintain two evaluators, one idiomatic version, and one optimized version.

When patch structure changes, we swap to idiomatic (since it has close to zero rebuild time), and build a new optimized version in background.

When optimized version is ready, switch back to it.

And key if you use c#, keep a reference to your old optimized evaluators somewhere, it's likely that you'll have a Gen 1/2 GC trigger, which is not desirable either,a small "memory increase instead" might be preferred.

4/Determinism

Now let's introduce something interesting, since all the above was fairly basic.

Basically, let's consider a node as "Take some input and produce some outputs"

We could consider 3 cases of "output production"

Deterministic : This is identical to "pure functions" in functional paradigm. Same inputs will always produce same outputs.

Non deterministic : This will produce varying results (for example, an animaton filter with keep previous frame result and depend on that)

User input, non deterministic: This is the same as above, but is a special case (for example a node that has an editable value)

So let's consider the following patch:

Let's say that Constant (and Constant2) nodes are immutable (basically you set the value when you create the node and you are not allowed to modify it).

Other operators are basic (math operators are fully deterministic).

This patch only needs to run once, it will always produce the same results.

You will of course say : I need to animate my data to show on the screen, and I agree, But I can be remotely sure that in large patches, there are a lot of sections like that.

You could of course use a lazy pin as specified above (below finalvalue), but we could be able to just automate this.

Let's take a second example:

In that case, only the Constant node is constant, since Time node is non deterministic, there is not much we can do, but then having:

Here we can see that all the section in gray can be ran only once and cached, it never needs to run several times.

If now we add user input to the mix:

Here the group on the right only changes when user input is changed, so we can also really easily cache results (as user data will not change ever so often).

We do not need to check for user changes manually (generally editors tend to trigger an event).

Pseudo code for evaluation is a tad more complex.

First when graph is updated, we need to propagate some information downwards as per:

Walk though node recursively, check if it has a Fully non deterministic parent (or a user input based one), mark that in a flag (and in case of user input, store some reference to it).

void ProcessNode(node)

if node has no deterministic parents, stop here (unless it's the first frame or graph has changed)

if node has a fully non deterministic parent (anything that depends on Time node in our case), process it

if node has one "user signal" version, check if user has changed value (some form of dirty flag), and only process it if required.

In our example patch, we can see that unless user is editing value, only FinalValue, +, Add, Divide and Time need to be processed, the rest can be completely ignored most of the time, and this can be automated!

Again, in our example, if we deploy our application, the UserValue node can automatically be changed to a constant node (since user will not have access to editor anymore). So we can also even completely remove the User input, non deterministic case, and stay with only Deterministic of not.

As a side note, in case of deploy, we can also even store the result values we need and just remove the nodes completely.

For example :

Can be transformed to:

Here we store all results that are needed (Sine and Multiplies), and don't need those processors anymore.

Of course, in case nodes operate on large lists, storing results can require some large files, so we can of course keep the option to keep nodes and run it only once (and replace them my constants in the graph right after).

Where is the catch then?

There is one, every node must be annotated, we need to provide metadata in order to indicate about a node Determinism, this can of course easily be achieved using some attributes (in c# example)

What's then important is that each node must be "curated", as a Non deterministic node marked as Non Deterministic will prevent downward optimization, opposite scenario is worse, as it will lead to incorrect results (data will not change whereas it should), but adding those is a small cost versus the reward.

5/Conclusion

Compiling graphs as flattened structures can feel tempting as first sight, but graph data structure can hold a large amount of "knowledge" which is then lost, so before to decide to do so it can be interesting to take a step back and see if it's worthwhile at all (that for sure, is another debate and there will be pros and cons in each case).

Since this post probably feels long to digest, I'll keep the next section for later, which will include shaders in the mix, so stay tuned.

Sunday, 25 October 2015

In the previous post, I spoke about the ability to perform hit detection using analytical functions.

This works extremely well when we can restrict our use case to it, but now we have some other cases where this is not as ideal:

Perform detection on arbitrary shape/3d model.

User input is not a pointer anymore, but can also be arbitrary (threshold camera texture, Kinect Body Index)

Both previous cases combined together

While we can often perform detection for 3d model by using triangle raycast (I'll keep that one for next post), it can be pretty expensive (specially if we perform a 10 touch hit detection, we need to raycast 10 times).

So instead, one easy technique is to use ID map.

Concept is extremely simple, instead of performing hit with a function, we will render our scene into a UInt texture, where each pixel will be object ID.

Of course it means you have to render your scene another time, but in that case you can also easily use the following:

Great thing with this technique, our depth buffer already makes sure that we have closest object ID stored (so we get that for "free").

So now we have our ID map, picking objectID from pointer is trivial:

Code Snippet

Texture2D<uint> ObjectIDTexture;

RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;

float2 MousePosition;

int width =512;

int height =424;

[numthreads(1,1,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint w,h;

ObjectIDTexture.GetDimensions(w,h);

float2 p = MousePosition;

p = p *0.5f+0.5f;

p.y =1.0f-p.y;

p.x *=(float)w;

p.y *=(float)h;

uint obj = ObjectIDTexture.Load(int3(p,0));

RWObjectBuffer[0]= obj;

}

Not much more is involved, we grab the pixel id, store in a buffer that we can retrieve in staging.

In case we need multiple pointer, we only need to grab N pixels instead, so process stays pretty simple (and we don't need to render scene for each pointer).

Now as mentioned before, we might need to perform detection against arbitrary texture.

As a starter, for simplicity, I will restrict the use case to single user texture.

So first we render user into a R8_Uint texture , where 0 means no active user and anything else = active.

We render our object map next in the same resolution.

We create a buffer (same size as object count, uint), that will store how many user pixel hit an object pixel.

Dispatch to perform this count.

Use another Append buffer, that select elements over a minimum account of pixel (this is generally important to avoid noise with camera/kinect textures).

Accumulating pixel hit count is done this way:

Code Snippet

Texture2D<uint> ObjectIDTexture;

Texture2D<float> InputTexture;

RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;

float Minvalue;

int maxObjectID;

[numthreads(8,8,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint obj = ObjectIDTexture[tid.xy];

float value = InputTexture[tid.xy];

if(value > Minvalue && obj < maxObjectID)

{

uint oldValue;

InterlockedAdd(RWObjectBuffer[obj],1,oldValue);

}

}

Make sure you use InterlockedAdd, as you need atomic operation in that case.

Next we can filter elements:

Code Snippet

StructuredBuffer<uint> HitCountBuffer;

AppendStructuredBuffer<uint> AppendObjectIDBuffer : BACKBUFFER;

int minHitCount;

[numthreads(64,1,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint c,stride;

HitCountBuffer.GetDimensions(c,stride);

if(tid.x >= c)

return;

int hitcount = HitCountBuffer[tid.x];

if(hitcount >= minHitCount)

{

AppendObjectIDBuffer.Append(tid.x);

}

}

This is that easy, of course instead of only rendering ObjectID in the map, we can easily add some extra metadata (triangle ID, closest vertexID) for easier lookup.

Now in order to perform multi user detection (for example, using Kinect2 body Index texture), process is not much different.

Instead of having a buffer of ObjectCount, we create it of ObjectCount*UserCount

Accumulator becomes:

Code Snippet

Texture2D<uint> ObjectIDTexture;

Texture2D<uint> UserIDTexture;

RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;

float Minvalue;

int maxObjectID;

int objectCount;

[numthreads(8,8,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint obj = ObjectIDTexture[tid.xy];

uint pid = UserIDTexture[tid.xy];

if(pid !=255< maxObjectID)

{

uint oldValue;

InterlockedAdd(RWObjectBuffer[pid*objectCount+obj],1,oldValue);

}

}

And filtering becomes:

Code Snippet

StructuredBuffer<uint> HitCountBuffer;

AppendStructuredBuffer<uint2> AppendObjectIDBuffer : BACKBUFFER;

int minHitCount;

int objectCount;

[numthreads(64,1,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint c,stride;

HitCountBuffer.GetDimensions(c,stride);

if(tid.x >= c)

return;

int hitcount = HitCountBuffer[tid.x];

if(hitcount >= minHitCount)

{

uint2 result;

result.x = tid.x % objectCount;//objectid;

result.y = tid.x / objectCount;

AppendObjectIDBuffer.Append(result);

}

}

We now have a tuple userid/object id instead, as shown in the following screenshot:

Please also note this technique can also easily be optimized with stencil, setting a bit per user. You get then limited to 8 users tho (7 users in case you also want to reserve one bit for object itself).

You will need one pass per user also (so 6 pass with proper depth stencil state/reference value).

If you lucky enough and can run on Windows10/DirectX11.3, and have a card that allows you, you can also simply do :

Code Snippet

Texture2D<uint> BodyIndexTexture :register(t0);

uint PS(float4 p : SV_Position): SV_StencilRef

{

uint id = BodyIndexTexture.Load(int3(p.xy,0));

if(id =255)//No user magic value provided by Kinect2

discard;

return id;

}

Here is a simple stencil test rig, to show all of the intermediates:

That's it for part 2 (that was simple no?)

For the next (and last) part, I'll explain a few more advanced cases (triangle raycast, scene precull....)

So last month been working on latest commercial project (nothing involving any extreme creative skills), so back into research mode.

I got plenty of new ideas for rendering, and quite some parts of my engine are undergoing some reasonable cleanup (mostly new binding model to ease dx12 transition later on).

There's different areas in my tool that I'm keen on improving, many new parts will be for other blog posts, but one has lately drawn my attention and I really wanted to get this one sorted.

As many of you know (or don't), I've been working on many interactive installations around, from small to (very) large.

One common requirement for those are some form of Hit Detection, you have some input device (Kinect, Camera, Mouse, Touch, Leap....), and you need to know if you hit some object in your scene in order to have those elements to react.

After many years in the industry, I've been developing a lot of routines in that aspect, so I thought it would be nice to have all of that as a decent library (to just pick when needed).

After a bit of conversation with my top coder Eric, we wanted to do a bit of feature list, what do we expect of an intersection engine, then the following came up:

We have various scenarios, some routines are better fit to some use cases, so we don't want a "one mode to rule them all". For example, if our objects are near spherical, we don't want to ray cast mesh triangles, ray cast on bounding sphere is appropriate (and of course much faster).

We want our routines sandboxed, so 4v/flaretic subpatch, it should be one node with inputs/outputs, cooking done properly inside and optimized. That saves us load time, reduce compilation times for shaders (or allow precompiled), and easier to control workflow (if our routine is not needed it costs 0).

We want our library minimal, so actually hit routines should not even create data themselves, they are a better fit as pure behaviours (It also helps to have those routines working in different environments).

We don't want to be gpu only, if a case fits better as CPU, then we should use CPU (if preparing buffers costs more time than performing the test directly, then let's just do it directly in cpu).

Next we wanted to decide which type of outputs we needed, this came out:

bool/int flag, which indicates if object is hit or not

filtered version for hit objects

filtered version for non hit objects

Then here are the most important hit detection features we require (they cover a large part of our use cases in general)

Arbitrary texture to shape (most common scenario for this is infrared camera, or kinect body index texture). In that case we also want the ability to differenciate between user id as well as object id.

In any 3d scenario, we also eventually want either closest object or all objects that get from the test.

Now our buffer also contains our distance to object, the only leftover is to grab the closest element.

We have 2 ways to work that out:

Use Compute shader (Use InterlockedMin to filter closest element, since distance is generally positive there's no float to uint tricks to apply), then perform another pass to check if element distance is equal to minimum.

Use Pipeline ; DepthBuffer is pretty good to keep closest element, so we might as well let him do it for us ;)

Using pipeline is extremely easy as well, process is as follow:

Create a 1x1 render target (uint), Associated with a 1x1 depth buffer

Prepare an indirect draw buffer (from the UAV counter), and draw as point list, write to pixel 0 in vertex, and pass distance so it's written to depth buffer, since code speaks more, here it is:

Wednesday, 12 August 2015

So here we go, Windows 10 is out, and so is "officially" DirectX12 (first official samples are finally now available).

Since I was not part of early access program, I could not see much samples, but setting up pipeline for tests was reasonably easy, even tho you have no idea about "best practices".
I helped a bit fixing some of the remaining issues in SharpDX (and also integrated DirectX11.3 support in there on the way, welcome volume tiled resources and conservative rasterization).

So way before official samples I managed to have most features I needed running, but official samples helped to finally nail down a bit the "last pieces of the puzzle" (mostly swap chain handling and descriptor heaps best practices).

First thing we obviously do is to build a small utility library (to remove the boilerplate and be able to prototype fast), and then play :)

So new API means changes, new approaches, new possibilities, everything that I love.

As I'm generally not dealing with the "general case" (game engines with ton of assets), eg: lot of procedural geometry, tons of different effects permutations, many different scenes... I will of course speak about what it changes for those scenarios.

So.... let's go

1/Resources

Finally (and really I mean it), a resource is just a resource, eg: some place in memory, there are no restrictions anymore about binding.

Before in DirectX11, you could, for example, not create a resource usable both as Vertex/Index buffer and StructuredBuffer.

For vertex buffer it's okay (can just bind as StructuredBuffer and fetch in Vertex Shader using VertexID), but index buffer can only be bound within the pipeline, so either had to copy to byte address, or mimic Append/Counter features with ByteAddressBuffer.

Now I can create a resource, create an Index buffer view for it, and an append or counter structured view as well, no problem, way it should be.

(Note: From DirectX11.2 it was possible to do it in some ways using Tiled resources, you create 2 "empty resources" (one for index, one for structured), and allocate the same tile mapping to each of those, works but a bit hacky).

This is a huge step forward in my opinion, since now I can construct geometry once and seamlessly use it in pipeline or compute.

2/Heaps

As well as having resource as locations, now we have several ways to create resources.

Commited : this is roughly equivalent to DirectX11 style, you create a resource and the runtime allocates backing memory for it

Placed : Now we can create a Heap (eg: some memory), and place resources in there, so several resources can share the same heap (and even overlap, so for small temporary resources this is rather useful)

Reserved : Reserved resources is the Tiled resources equivalent

Heaps can also be of 3 types:

Default : same as previously

Upload : Cpu accessible for writing

Readback : Cpu Accessible for reading

As a side note, the way to upload resources now is totally up to you, recommended way is to allocate an upload resource, write data in there, and use a copy command to copy data to a default resources, since it's mentionned that gpu access to Upload "resources" is slower (which I actually confirmed from some benchmarks).

So you have pretty much full control on how you organize/load your memory (specially as you can have dedicated copy Command Queues, but that one likely for another post).

3/Queries

Queries in DirectX11 were an utter pain (and often some form of disaster waiting to happen).

You had to wait for some point later in the frame, loop until data is available, then get data, Stall party in brief.

Now queries handling is much simpler, you just create a query heap (per query type), which can also contain backing memory for several of those, then use begin/end (except time which now only uses end as it gets gpu timer)

Then when you need to access data, you resolve query in a resource (which can be of any type), so you can either wait for end of frame and readback or also even use that data right away (stream output to indirect draw, I was waiting for this for so long..., bye bye DrawAuto, will no miss you anytime ;)

4/Uav Counters

On the same fashion, uav counters have simply.... (drum rolls) ... disappeared.

Now uav counters are simply backed using a resource, which of course means that you have full read/write control over it in a cpu/gpu fashion (you can even share counter between different resources/uavs, I'm pretty sure I'll find a weird use case for it at some points)

Previously in DirectX11, you could only set initial count before a dispatch in Cpu, which was sometimes quite limiting (quite often just ended up using a small buffer and interlocked functions to mimic append/counter, now all can also be used seamlessly, also a Stream Output query result can now be set as an initial count for an append buffer, which is something I already see myself abusing).

5/Pipeline State Objects

PSO are an obvious huge difference in how you set your pipeline.

Before in DirectX11, you have to set several states individually (blend/rasterizer, shaders...), which of course provided a reasonable level of flexibility (really easy to switch from solid to wireframe for example), but at the expense of a performance hit (as well as have to implement state tracking).

Now the whole pipeline (shaders/states/input layout, all except resource binding and render targets) is created up front and sent in a single call.

This offers the obvious advantage of a very big performance gain (since now the card know up front what to do and doesn't need to "reconstruct" the pipeline every draw, but pipeline states are expensive to create, so don't expect to tweak small states and have an immediate feedback (pso creation cost some tenth of milliseconds, please note they can of course be created async).

PSO can also be cached (hence also serialized to disk), so this can improve loading times drastically for applications in runtime mode.

So that's it for the first overview part, there's of course many more new features and changes, but let's keep those for another post.

Monday, 15 June 2015

After upgrading my particle system, the next part that needs my attention is my deferred renderer.

It has all type of lights (using compute shader when possible, pixel shaders if light also has shadows), and all the standard usual suspects (HBAO, Dof, Bokeh, Tonemap...)

Now I started to upgrade my shaders for the next level:

Msaa support with subpixel (sample shading)

Allow better AO usage (AO Factor can be set as low quality or each light can have an AO factor)

Better organization of some calculations (to avoid them to be done twice).

Some other cool things ;)

Before I start to revamp the glue code to handle all this, as soon as you start to use Msaa targets (this is no news of course), your memory footprint grows quite drastically.

In my use case, since I'm not dealing with the usual "single level game scenario", I can also have several full HD (or more, or less) which all need to be rendered every frame and composited.

I looked a bit at my memory usage, and while it's not too bad (reasonably efficient usage of pools and temporary resources), I thought I could start to have a proper think about it before to start coding ;)

So generally when we render scenes, we have several type of resources lifetimes:

"Scene lifetime" : Those are resources which live with your scene, so until you decide to unload your whole scene, those resources must live. A good example is some particle buffers, as they are read write, they need to be persisted across frames.

"Frame lifetime" : Those are the ones that we use for a single frame, often some intermediate results, that needs to be persisted across a sufficiently long part in the frame duration. For example, Linear Depth is quite often required for a long part in your post processing pipeline, since it's used by a decent amount of post processors.

"Scoped lifetime" : Those have a very short lifetime (generally within a unit of work/function call)

When I did a first memory profile test, I could see that actually a lot of my memory footprint is caused by those Scoped resources, so I decided to first focus on those.

So as a starter, here are my local resources for some of my post processors

While this is a natural pattern in c#, it is not designed to work well with real time graphics (resource creation is expensive, and creating / releasing gpu resources all the time is not such a good idea, memory fragmentation looming).

Resource Pool : instead of creating resources all the time, to create a small wrapper around it to keep a isLocked flag. This looks this way in c#:

Code Snippet

var buffer = Device.ResourcePool.LockStructuredBuffer<float>(1024);

//do something with buffer : buffer.Element.ShaderView;

buffer.UnLock(); // Markas free to reuse

When we request a resource, the pool will check if a buffer with the required flags/stride is available, if this is the case, mark it as "In Use" and return it. If no buffer matching specifications is found, create and return a new one.

This scheme is quite popular (I've seen this in Unity and many other code bases), has an advantage of being simple, but also has some issues.

First, you need to keep a pool per resource type, eg: one for textures (of each type), one for buffers.

Second, the biggest disadvantage (specially for Render Targets), we need an exact size and format.
We can certainly optimize format support using Typeless resources (but you still need to create views in that case), but for size that's a no go (or at least a non practical thing), since we would often need to implement our own sampler (which is not a big deal, except for Anisotropic, but that would badly pollute our shader code base). Also we would need a bit of viewport/scissor gymnastic. Again, not that hard but really not convenient.

So if you render 3 scenes, each with different resolutions, your pool starts to collect a lot of resources of different sizes, your resource lists become bigger....

Of course you can clear any unused resources from time to time (eg: Dispose anything that has not been used), add a number of frames since not used and threshold that (yay, let's write a garbage collector for GPU resources, hpmf....).

Nevertheless, I find that for "Frame lifetime" (and eventually a small subset of "Scene lifetime") resources, this model fits reasonably well, so I'll definitely keep it for a while (I guess DX12 Heaps will change that part, but let's keep DirectX12 for later posts ;)

So now we have clearly seen the problem, for my scoped resources, if I go back to the ao->dof->bokeh case, I have to create 6 targets + one buffer (one of them can be reused, but lot of intermediates are in different formats depending on which port processing I'm currently applying)

Adding a second scene with a different resolution, that's of course 12 targets.

One main thing is, all this post processing is not applied at the same time (since your GPU serializes commands anyway). So all that memory could be happily shared. But we haven't got enough fine tuned access to gpu memory for this (again, resources pointing to the same locations are now trivial in dx12, but here still in dx11). So it looks like a dead end.

In the mean time in the Windows 8.1 world, some great Samaritan (s) have introduced a few new features, one of them called Tiled Resources.

Basically a resource created as Tiled has no initial memory, you have to map Tiles (which are 64k chunks of memory) to them. Memory tiles are provided by a buffer created with a tile pool attribute.

Code Snippet

BufferDescription bdPool =newBufferDescription()

{

BindFlags =BindFlags.None,

CpuAccessFlags =CpuAccessFlags.None,

OptionFlags =ResourceOptionFlags.TilePool,

SizeInBytes = memSize,

Usage =ResourceUsage.Default

};

So you can create a huge resource with no memory, and assign tiles to some parts of it depending on your scene.

This of course has a wide use for games (terrain/landscapes streaming, large shadow map rendering), and most examples follow that direction (check for sparse textures if you want documentation about those).

Then I noticed in one slide (forgot if it was from NVidia or Microsoft), "Tiled resources can eventually be used for more aggressive memory packing of short lived data".

There was no further explanation, but that sounds like my use case, so obviously, let's remove the "eventually" word and try it (here understand : have it working).

So in that case we think about it in reverse. Instead of having a large resources backed by a pool and which is partly updated/cleared, We provide a back end, and allocate the same tiles to different resources (of course they need to belong to different unit of work, the ones that need to be used at the same time must not overlap).

So let's do a first try, and create two Tiled buffers, which point to the same memory location.

Next a simple test for it, create an Immutable resource, with some random data, same size as our tiled buffer (not pool)

Use copy resource on either Buffer1 or Buffer2 (not both).

Create two staging resources s1 and s2

Readback data from buffer1 into s1 and Buffer2 into s2.

Surprise, we then have data uploaded and copied from our immutable buffer in both s1 and s2.

So now we have the proof of concept working, we create a small tile pool (with an inital memory allocation).

Now for each post processor, we register our resources (and update tile offset accordingly to avoid overlap, we also need to eventually pad buffers/textures since our start location needs to be aligned to a full tile).

Before that, we need to check if our pool is big enough, and resize if required, the beauty of it, this is not destructive (increasing pool size will allocate new Tiles but will preserve mappings of existing ones), so this is done as:

Now of course I had a performance check Pool vs Pool, which ended up on a draw (so I keep the same performances, no penalty, always a good thing), and here is a small memory profiling.

I render one scene in full hd + hdr, and a second scene in half hd + hdr

Case 1 (old pool only):

Global pool : 241 459 200

Tiled Pool: 0

Case 2 (starting to use shared pool):

Global pool: 109 670 400

Tiled pool : 58 064 896

Total memory: 167 735 296

So ok, 70 megs gain in a day where some people will say that you have 12 gigs of ram in a Titan card is meaningless, but well :

Most people don't have a Titan card. (I normally plan on 4gb cards when doing projects).

Adding a new scene will not change the tiled pool size, and increase the global pool in a much smaller fashion.

If you start to add Msaa or render 3x full HD, you can expect a larger gain

When you start to have a few assets in the Mix (like a 2+ gigs car model, never happened to me did it? ;) a hundred meg can make a huge difference.

For cards that don't support tiled resources, the technique is really easily swappable, so it's not a big deal to fallback to global pool if feature is not supported (or leave the user decide).

I applied it quickly as a proof of concept and only on 3 effectors, now this work I can also more aggressively optimize the post processing pipeline chain in the same way (and actually also anything that needs temporary resources in my general rendering, and there's a lot).

That's it for now, as a side note, I have some other pretty cool use cases for this, they will likely end up here when I'll have implemented them.