Monday, 15 June 2015

After upgrading my particle system, the next part that needs my attention is my deferred renderer.

It has all type of lights (using compute shader when possible, pixel shaders if light also has shadows), and all the standard usual suspects (HBAO, Dof, Bokeh, Tonemap...)

Now I started to upgrade my shaders for the next level:

Msaa support with subpixel (sample shading)

Allow better AO usage (AO Factor can be set as low quality or each light can have an AO factor)

Better organization of some calculations (to avoid them to be done twice).

Some other cool things ;)

Before I start to revamp the glue code to handle all this, as soon as you start to use Msaa targets (this is no news of course), your memory footprint grows quite drastically.

In my use case, since I'm not dealing with the usual "single level game scenario", I can also have several full HD (or more, or less) which all need to be rendered every frame and composited.

I looked a bit at my memory usage, and while it's not too bad (reasonably efficient usage of pools and temporary resources), I thought I could start to have a proper think about it before to start coding ;)

So generally when we render scenes, we have several type of resources lifetimes:

"Scene lifetime" : Those are resources which live with your scene, so until you decide to unload your whole scene, those resources must live. A good example is some particle buffers, as they are read write, they need to be persisted across frames.

"Frame lifetime" : Those are the ones that we use for a single frame, often some intermediate results, that needs to be persisted across a sufficiently long part in the frame duration. For example, Linear Depth is quite often required for a long part in your post processing pipeline, since it's used by a decent amount of post processors.

"Scoped lifetime" : Those have a very short lifetime (generally within a unit of work/function call)

When I did a first memory profile test, I could see that actually a lot of my memory footprint is caused by those Scoped resources, so I decided to first focus on those.

So as a starter, here are my local resources for some of my post processors

While this is a natural pattern in c#, it is not designed to work well with real time graphics (resource creation is expensive, and creating / releasing gpu resources all the time is not such a good idea, memory fragmentation looming).

Resource Pool : instead of creating resources all the time, to create a small wrapper around it to keep a isLocked flag. This looks this way in c#:

Code Snippet

var buffer = Device.ResourcePool.LockStructuredBuffer<float>(1024);

//do something with buffer : buffer.Element.ShaderView;

buffer.UnLock(); // Markas free to reuse

When we request a resource, the pool will check if a buffer with the required flags/stride is available, if this is the case, mark it as "In Use" and return it. If no buffer matching specifications is found, create and return a new one.

This scheme is quite popular (I've seen this in Unity and many other code bases), has an advantage of being simple, but also has some issues.

First, you need to keep a pool per resource type, eg: one for textures (of each type), one for buffers.

Second, the biggest disadvantage (specially for Render Targets), we need an exact size and format.
We can certainly optimize format support using Typeless resources (but you still need to create views in that case), but for size that's a no go (or at least a non practical thing), since we would often need to implement our own sampler (which is not a big deal, except for Anisotropic, but that would badly pollute our shader code base). Also we would need a bit of viewport/scissor gymnastic. Again, not that hard but really not convenient.

So if you render 3 scenes, each with different resolutions, your pool starts to collect a lot of resources of different sizes, your resource lists become bigger....

Of course you can clear any unused resources from time to time (eg: Dispose anything that has not been used), add a number of frames since not used and threshold that (yay, let's write a garbage collector for GPU resources, hpmf....).

Nevertheless, I find that for "Frame lifetime" (and eventually a small subset of "Scene lifetime") resources, this model fits reasonably well, so I'll definitely keep it for a while (I guess DX12 Heaps will change that part, but let's keep DirectX12 for later posts ;)

So now we have clearly seen the problem, for my scoped resources, if I go back to the ao->dof->bokeh case, I have to create 6 targets + one buffer (one of them can be reused, but lot of intermediates are in different formats depending on which port processing I'm currently applying)

Adding a second scene with a different resolution, that's of course 12 targets.

One main thing is, all this post processing is not applied at the same time (since your GPU serializes commands anyway). So all that memory could be happily shared. But we haven't got enough fine tuned access to gpu memory for this (again, resources pointing to the same locations are now trivial in dx12, but here still in dx11). So it looks like a dead end.

In the mean time in the Windows 8.1 world, some great Samaritan (s) have introduced a few new features, one of them called Tiled Resources.

Basically a resource created as Tiled has no initial memory, you have to map Tiles (which are 64k chunks of memory) to them. Memory tiles are provided by a buffer created with a tile pool attribute.

Code Snippet

BufferDescription bdPool =newBufferDescription()

{

BindFlags =BindFlags.None,

CpuAccessFlags =CpuAccessFlags.None,

OptionFlags =ResourceOptionFlags.TilePool,

SizeInBytes = memSize,

Usage =ResourceUsage.Default

};

So you can create a huge resource with no memory, and assign tiles to some parts of it depending on your scene.

This of course has a wide use for games (terrain/landscapes streaming, large shadow map rendering), and most examples follow that direction (check for sparse textures if you want documentation about those).

Then I noticed in one slide (forgot if it was from NVidia or Microsoft), "Tiled resources can eventually be used for more aggressive memory packing of short lived data".

There was no further explanation, but that sounds like my use case, so obviously, let's remove the "eventually" word and try it (here understand : have it working).

So in that case we think about it in reverse. Instead of having a large resources backed by a pool and which is partly updated/cleared, We provide a back end, and allocate the same tiles to different resources (of course they need to belong to different unit of work, the ones that need to be used at the same time must not overlap).

So let's do a first try, and create two Tiled buffers, which point to the same memory location.

Next a simple test for it, create an Immutable resource, with some random data, same size as our tiled buffer (not pool)

Use copy resource on either Buffer1 or Buffer2 (not both).

Create two staging resources s1 and s2

Readback data from buffer1 into s1 and Buffer2 into s2.

Surprise, we then have data uploaded and copied from our immutable buffer in both s1 and s2.

So now we have the proof of concept working, we create a small tile pool (with an inital memory allocation).

Now for each post processor, we register our resources (and update tile offset accordingly to avoid overlap, we also need to eventually pad buffers/textures since our start location needs to be aligned to a full tile).

Before that, we need to check if our pool is big enough, and resize if required, the beauty of it, this is not destructive (increasing pool size will allocate new Tiles but will preserve mappings of existing ones), so this is done as:

Now of course I had a performance check Pool vs Pool, which ended up on a draw (so I keep the same performances, no penalty, always a good thing), and here is a small memory profiling.

I render one scene in full hd + hdr, and a second scene in half hd + hdr

Case 1 (old pool only):

Global pool : 241 459 200

Tiled Pool: 0

Case 2 (starting to use shared pool):

Global pool: 109 670 400

Tiled pool : 58 064 896

Total memory: 167 735 296

So ok, 70 megs gain in a day where some people will say that you have 12 gigs of ram in a Titan card is meaningless, but well :

Most people don't have a Titan card. (I normally plan on 4gb cards when doing projects).

Adding a new scene will not change the tiled pool size, and increase the global pool in a much smaller fashion.

If you start to add Msaa or render 3x full HD, you can expect a larger gain

When you start to have a few assets in the Mix (like a 2+ gigs car model, never happened to me did it? ;) a hundred meg can make a huge difference.

For cards that don't support tiled resources, the technique is really easily swappable, so it's not a big deal to fallback to global pool if feature is not supported (or leave the user decide).

I applied it quickly as a proof of concept and only on 3 effectors, now this work I can also more aggressively optimize the post processing pipeline chain in the same way (and actually also anything that needs temporary resources in my general rendering, and there's a lot).

That's it for now, as a side note, I have some other pretty cool use cases for this, they will likely end up here when I'll have implemented them.

Tuesday, 9 June 2015

In the last 2 posts, I explained how I moved my particle system to have GPU managed counter and to allow regions to have better emission/behaviour control.

This already offers a pretty great deal of flexibility, but then there are 2 parts that are missing:

Now with regions, my valid particles can be scattered within my main buffer, that means emit count is not for the full particle system anymore, but per region.

Many times I want to be able to only apply behaviours to particles that satisfy a particular condition (within an area, only particles that collided at leqst once, any particle within mouse raycast....)

In a funny way both problems can be solved using a similar technique... Selectors

So the idea is really simple, you have a predicate function (eg: anything from a particle that returns true or false), if a particle satisfies this predicate, then apply effectors, else do nothing.

As the amount of buffers to copy would be huge, we will do it in the most simple way eg: If a particle satisfies condition then add the particle index into a selection buffer, then create another dispatcher and only process the particles that satisfied the condition.

The only difference now is that behaviours must be aware that they operate either on plain buffer, or on a selection.

Next we do a little cooking with our append buffer to generate new dispatch calls.

Now one difference is that our behaviours also need to fetch data either in a linear fashion (no selector), or using that lookup table, so a couple helper functions and a few defines do a great job for this purpose:

Code Snippet

#if PARTICLE_USE_SELECTION ==1

uint GetParticleIndex(uint tid)

{

return SelectionIDBuffer[tid];

}

bool IsOverBounds(uint tid)

{

uint selectionCount = SelectionCountBuffer.Load(0);

return tid >= selectionCount;

}

#else

uint GetParticleIndex(uint tid)

{

return tid;

}

bool IsOverBounds(uint tid)

{

return tid >= EmitCount;

}

#endif

So for each behaviour, we now have 2 shaders, one to process linearly, one to process using lookup table.

So now here is an example of a "Within Sphere" selector

Code Snippet

float4 sphere : SPHERE;

[numthreads(64,1,1)]

void CS(uint3 i : SV_DispatchThreadID )

{

if(IsOverBounds(i.x))

return;

int idx = GetParticleIndex(i.x);

float3 p = PositionBuffer[idx];

float d = length(sphere.xyz - p)- sphere.w;

AppendToSelection(idx, d <0.0f);

}

As any acute reader will notice, I'm also using the helper functions in the case of a selector, which means a selector can operate on a selection!

This part means you can stack selectors (in which case you have a "AND" operator. i did not implement that yet, since it might make things overcomplex, but there is also another rationale over it.

You remember that I said that now with regions my emitted particles are scattered inside the buffer, so that means any selector/global behaviour needs to operate only on emitted particles.

For global elements, you can call them once per region, but if you have 20 that multiplies your Dispatches a lot.

So instead, let's think simpler, we can easily for each region create a global lookup table that only contains emitted particles.

Code Snippet

cbuffer cbRegionInfo :register(b11)

{

uint StartRegionOffset;

uint RegionElementCount;

}

//This is called once per region

[numthreads(64,1,1)]

void CS_CopyRegionIndices(uint3 tid : SV_DispatchThreadID)

{

if(tid.x >= EmitCount)

return;

//Get location in global buffer

uint id = tid.x + StartRegionOffset;

//Push particle ID

AppendSelectionIDBuffer.Append(id);

}

Yes, it is that simple, we call this shader once per region (we clear Append buffer counter on first call, but preserve it on next iterations), and we have our global emission count.

So now we can select from within this buffer (which contains global indices, not local ones), and perform our behaviours/display only on emitted elements.

So now selectors can be owned by 2 "containers"

Within the particle system: We can select particles and apply effectors

Outside of the particle system : So we can select particles to route them to a custom render shader.

Since my selectors don't own any buffers, but are provided contexts, this is really easy.

So of course in case of selectors, you can also have them within a region, or globals, as the screenshot here shows:

You can see the Filter nodes grabs a particle selector for custom display, the Selector node can allow to select elements within the particle system, they both use the same "WithinSphere" node.

So this provides a huge amount of flexibility, and still keeps pretty high performances (as stream compaction makes sure that we only operate on needed elements).

But now we can also see that we operate in "Immediate mode", eg:If the particle satisfies the condition, apply, else , don't. This is done every frame.

In some cases this is not ideal, we want to say: If the particle has satisfied the condition once (for example , mouse raycast), then apply the effector until we release the button.

So in that case the switch is simple, as you seen previously, I the particle satisfies a condition, it is pushed into an append buffer.
So now let's change the code by:

Code Snippet

#if PARTICLE_SELECTION_RETAINED ==0

void AppendToSelection(uint id,bool predicate)

{

if(predicate)

{

AppendSelectionIDBuffer.Append(id);

}

}

#else

void AppendToSelection(uint id,bool predicate)

{

if(predicate)

{

RWSelectionIDBuffer[id]=1;

}

}

#endif

We just write in a flag in order to tell that we satisfied the condition once.

Then our selector becomes:

Code Snippet

[numthreads(64,1,1)]

void CS_Reset(uint3 tid : SV_DispatchThreadID)

{

if(IsOverBounds(i.x))

return;

int idx = GetParticleIndex(i.x);

RWSelectionIDBuffer[idx]=0;

}

[numthreads(64,1,1)]

void CS_Append(uint3 tid : SV_DispatchThreadID)

{

if(IsOverBounds(i.x))

return;

int idx = GetParticleIndex(i.x);

uint isFlagged = SelectionIDBuffer[idx];

if(isFlagged)

{

AppendSelectionIDBuffer.Append(idx);

}

}

So we add a second pass, which grabs every particle that has been flagged at least once (until we reset), and we now have our retained collector.

Here we go, revamping this particle system was actually really fun, allowed to have a reasonable code clean up alongside improvements, and now I got a lot of new funky effectors/emitters and selectors to write ;)

For the next posts, I'm not too sure yet, maybe some pipeline tricks, but there is something else coming that is much more interesting (hint for readers: read a bit of F# code, that will help you) :)

Monday, 8 June 2015

So in previous post I explained how to move the count management to be managed purely in GPU.

That's of course the initial prerequisite to some more interesting concepts.

So now one common issue is that I need some emitters not to overlap which each other, some behaviours to only apply to some emitters, and some to apply globally.

So that means I need to define locations where to emit, and also restrict some behaviours/colliders to those locations.

I could of could manage some offset tables in all my compute shaders, but that means you need to change code in all of them to eventually take that into account.

So for example, for a simple gravity:

Code Snippet

[numthreads(64,1,1)]

void CS_Accumulate(uint3 i : SV_DispatchThreadID )

{

if(i.x > EmitCount) { return; }

RWForceBuffer[i.x]+= g;

}

Is now replaced by:

Code Snippet

cbuffer cbParticleRangeData :register(b5)

{

uint startOffset;

};

[numthreads(64,1,1)]

void CS_Accumulate(uint3 i : SV_DispatchThreadID )

{

if(i.x > EmitCount) { return; }

RWForceBuffer[i.x + startOffset]+= g;

}

This is not a big change, but it needs to be done in every behaviour, so that takes a bit of time.

Also for emitters the logic is a bit more complex, since you also need to enforce the fact that you don't span regions (eg: don't write a particle in another region location).

So maybe there must be a way to be able to do that in a more elegant way....

Of course there is!

So first, all my particle data is stored in a ParticleDataStore, which contains all buffers, SRV and UAVs. So any particle effector can request to attach any attribute for read or write, depending on use case.

Also every effector receive a dedicated context which also restricts what the effectors have access to (for example, a collider can't attach attributes for writing, except the collision buffer).

So now the idea is to create a new data store, which operates on the same global buffers as the main particle system, but are only allowed in a subset of that buffer.

This is easily possible in DirectX11, which allows to create views for only a part of a buffer.

When we create a buffer (let's say 1024 elements), we also can create default views, which are done as this :

We keep the same buffer, but in that case we specify which locations our view are operating on.

So for example, we can say:
This view operates from element 512, and has 200 elements.

The pretty neat thing is, now in our shader, when we say for example

RWForceBuffer[0] = force;

We are actually operating on element 512!

Pretty neat no?

So instead of adding all this crazy offset logic, we just create a new datastore (which operates on he same buffers), but that creates restricted views, and pass that to our effectors, which don't even know they operate on a subset of the data (and they should not even know it).

So next part was simply to add a region handler (which can accept his own effectors), and a new Particle system which can also accept global effectors (now particle system operation with regions can't accept emitters anymore, we build regions to avoid overlap, so emitters should only be restricted).

Particle system now, instead of dispatching globals, now has to apply locals pre region, then globals, little bit of extra work but nothing too hard.

So now here is a small example (super basic rendering but gives the idea):

Here random emitter and sphere emitter both operate on a separate region (you can stil stack several emitters in the same region of course).

Each one has it's own color palette, sphere emitter has damping, but random emitter doesn't.

Random emitter has one extra collider.

Gravity, and the 2 colliders are linked to the particle system, so they apply to all regions.

As much as it is a simple example, I'm pretty sure it's easy to see the potential and the flexibility it provides, and the great thing is that I did not have to change a single line of code in my effector/behaviour/colliders shader code (life is good at times).

Here we are for part 2, for the next one I'll explain another cool feature (codename: Selectors)

Friday, 5 June 2015

I did not post for a little while, been attending several events in the mean time:

Kinecthack London: Worked on network Kinect stream and real time point cloud alignment

Revision : Did my first production in a demoparty (ranked 8th in demo category)

Node15: Workshops and some quick sneaky DX12 presentation (Note: I'm not part of Early Access Program, I just figured the API myself and managed to get some render system up and running fighting with drivers, so I'm not on NDA ;)

Ok all those events were great fun, I should explain some technical parts about it, will do at some point, but for now let's go into some other technical parts.

I wanted to revamp my particle system for a while, at some point I was thinking maybe a scratch rewrite would be fine since it's not an insane code base, but as usual my sensible way of doing comes back and I decided to just improve and refactor, which is always a better decision ;)

So I got a lot of nice features already, tons of emitters (Position, Distance Field, Texture, Mesh, Kinect2, Point clouds...), some nice interaction parts, including advanced effectors like Sph,(accelerated with spatial grids), plenty of various force fields, and collider system (mostly distance field based, after all, a plane collider is just a distance field check with a specific function).

Many effectors can also be controlled via "micro curves" (which is basically a 1d texture rendered from a track in my timeline, and driven by particle age.

All the simulation part is entirely manager in my GPU (compute shader), and is pretty fast so I did not feel that part needed a major rewrite (but will have improvements, that's for next post).

Particle counter is basically a ring buffer, and is currently managed in CPU, which is not a big deal for effectors, but a real problem for emitters, and since now most of my machines are fully dx11.1 enabled (eg: I got access to UAV at every stage feature), this becomes quite a blocker (as it opens a lot of possibilities that I'll explain later).

So I decided to revamp (here understand : improve, not rewrite) this part, but first let's explain the problem.

So we have a small structure that maintains the emission counters, like:

Code Snippet

[StructLayout(LayoutKind.Sequential)]

publicstructEmitData

{

publicuint EmitCount;

publicuint MaxParticles;

publicuint EmitOffset;

publicuint ThisFrameCount;

publicuint ThisFrameStartOffset;

privateVector3 dummy;

}

As you see it's just an evil mutable struct (it's only used to copy to constant buffer, don't worry ;)

Now emitter just implements a simple interface:

Code Snippet

publicinterfaceIParticleEmitterObject : IParticleEffectObject

{

int Emit(RenderContext context, ParticleSystem particles);

}

As you see, emitters returns how many particles they did emit, then the particle system, after each emitter has been processed, updates the counters structures no next emitter starts in the right location.

This works really well for simple emitters (32 particles randomly placed), but now I also have emitters that take data from GPU.

So let's take another example, emit from texture.
This is done in the following steps:

Create a buffer (pos+color), with an Append flag

Dispatch a compute shader that will go read each pixel

If pixel satisfies a condition (luminance for example, but can be anything else), append the pixel position + color in the Append buffer.

Copy buffer counter.

Dispatch N particles (where N is in CPU), get a random pixel that did satisfy the condition and push it to the particles buffers.

Now as you can clearly see that creates a problem, the amount of particles to emit is set in CPU, but we have no idea how many pixels did satisfy the condition. So we can have 3 edgy (and one can be nasty) cases (consider our emit count is 32)

We can have 20000 pixels that pass the test in frame 1, and emit 32 particles from that, int the next frame, I can emit 32 particles from a 200 pixel buffer. This means we have poor coverage control, I'd like to emit more particles if more pixels pass the test.

Less than 32 pixels passed the test (like 10), so some element get emitted several times.

Worst case of all, no pixels at all pass the test, so result can be... unpredictable.

So to handle those cases, we have several options:

In the compute shader emitter, if thread ID is > amount of elements that pass the test, push a "degenerate particle" (like position = float.maxvalue). This is of course ugly, needs to be repeated in every shader that take a gpu counter, but at least it works (even tho it could provide problems at simulation level).

Get the counter back in CPU: This is simple (just copy the counter back in a small staging buffer and read it back into cpu, then choose what you do), but it creates a stall as we need to flush and process the command buffer, and wait for it to be fully executed before to get back our 4 precious bytes. So again, not ideal.

Do it properly :) Move the counter data in GPU, and code to maintain that data to compute shader, and profit :)

So first thing, since counter data is in a constant buffer, let's not change all the shaders and the logic, so we use a small structured buffer which contains an exact copy of the data (so we process in structured buffer, and then use CopyResource to copy from StructuredBuffer to Constant Buffer

Did not run: Emitter did not run at all, I decided to have it as a case instead of returning 0 (so you can also explain why it did not run.

Static : This is the same case as our previous cases

UnorderedView : Counter is located in UAV, which is the case we we use Emit/Counter buffers to push particles. So for example if we want to emit every pixel that did pass the test in our previous case, we do an indirect dispatch and return the view (which contains the counter)

Buffer : This is in a GPU buffer (we also need to provide location in that case). This is very useful for coverage based emitters (for example, we could say, emit 50% of the elements that passed the test every frame, in that case we need to process the counter in a small compute shader to generate a custom dispatch call).

So from there, we can easily update our structured buffer above (using compute shader, but actually kept the readback version for debug purposes)

Now the only small difference is when processing effectors, we don't know the particle count anymore, so instead of using Dispatch we use DispatchIndirect (which is trivial to implement), and use Indirect buffers for drawing as well so, DrawIndirect to draw as sprite, and DrawIndexedInstancedIndirect to render particles as geometry.

So here we go, from there we have a fully fledged counter system in our graphics card, which also mean, for any type of emitters where we want to push every element that pass the test, we can now do it in a single pass (no more need for intermediate buffer, use a CounterBuffer or use InterlockedAdd).

And now since got access to UAV at every stage, it's possible to load balance using tessellation/domain shader

Here is an example of an hybrid particle emitter (which doesn't draw anything on screen but just push adaptive amount of particles depending on triangle size)