Sunday, 25 October 2015

In the previous post, I spoke about the ability to perform hit detection using analytical functions.

This works extremely well when we can restrict our use case to it, but now we have some other cases where this is not as ideal:

Perform detection on arbitrary shape/3d model.

User input is not a pointer anymore, but can also be arbitrary (threshold camera texture, Kinect Body Index)

Both previous cases combined together

While we can often perform detection for 3d model by using triangle raycast (I'll keep that one for next post), it can be pretty expensive (specially if we perform a 10 touch hit detection, we need to raycast 10 times).

So instead, one easy technique is to use ID map.

Concept is extremely simple, instead of performing hit with a function, we will render our scene into a UInt texture, where each pixel will be object ID.

Of course it means you have to render your scene another time, but in that case you can also easily use the following:

Great thing with this technique, our depth buffer already makes sure that we have closest object ID stored (so we get that for "free").

So now we have our ID map, picking objectID from pointer is trivial:

Code Snippet

Texture2D<uint> ObjectIDTexture;

RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;

float2 MousePosition;

int width =512;

int height =424;

[numthreads(1,1,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint w,h;

ObjectIDTexture.GetDimensions(w,h);

float2 p = MousePosition;

p = p *0.5f+0.5f;

p.y =1.0f-p.y;

p.x *=(float)w;

p.y *=(float)h;

uint obj = ObjectIDTexture.Load(int3(p,0));

RWObjectBuffer[0]= obj;

}

Not much more is involved, we grab the pixel id, store in a buffer that we can retrieve in staging.

In case we need multiple pointer, we only need to grab N pixels instead, so process stays pretty simple (and we don't need to render scene for each pointer).

Now as mentioned before, we might need to perform detection against arbitrary texture.

As a starter, for simplicity, I will restrict the use case to single user texture.

So first we render user into a R8_Uint texture , where 0 means no active user and anything else = active.

We render our object map next in the same resolution.

We create a buffer (same size as object count, uint), that will store how many user pixel hit an object pixel.

Dispatch to perform this count.

Use another Append buffer, that select elements over a minimum account of pixel (this is generally important to avoid noise with camera/kinect textures).

Accumulating pixel hit count is done this way:

Code Snippet

Texture2D<uint> ObjectIDTexture;

Texture2D<float> InputTexture;

RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;

float Minvalue;

int maxObjectID;

[numthreads(8,8,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint obj = ObjectIDTexture[tid.xy];

float value = InputTexture[tid.xy];

if(value > Minvalue && obj < maxObjectID)

{

uint oldValue;

InterlockedAdd(RWObjectBuffer[obj],1,oldValue);

}

}

Make sure you use InterlockedAdd, as you need atomic operation in that case.

Next we can filter elements:

Code Snippet

StructuredBuffer<uint> HitCountBuffer;

AppendStructuredBuffer<uint> AppendObjectIDBuffer : BACKBUFFER;

int minHitCount;

[numthreads(64,1,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint c,stride;

HitCountBuffer.GetDimensions(c,stride);

if(tid.x >= c)

return;

int hitcount = HitCountBuffer[tid.x];

if(hitcount >= minHitCount)

{

AppendObjectIDBuffer.Append(tid.x);

}

}

This is that easy, of course instead of only rendering ObjectID in the map, we can easily add some extra metadata (triangle ID, closest vertexID) for easier lookup.

Now in order to perform multi user detection (for example, using Kinect2 body Index texture), process is not much different.

Instead of having a buffer of ObjectCount, we create it of ObjectCount*UserCount

Accumulator becomes:

Code Snippet

Texture2D<uint> ObjectIDTexture;

Texture2D<uint> UserIDTexture;

RWStructuredBuffer<uint> RWObjectBuffer : BACKBUFFER;

float Minvalue;

int maxObjectID;

int objectCount;

[numthreads(8,8,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint obj = ObjectIDTexture[tid.xy];

uint pid = UserIDTexture[tid.xy];

if(pid !=255< maxObjectID)

{

uint oldValue;

InterlockedAdd(RWObjectBuffer[pid*objectCount+obj],1,oldValue);

}

}

And filtering becomes:

Code Snippet

StructuredBuffer<uint> HitCountBuffer;

AppendStructuredBuffer<uint2> AppendObjectIDBuffer : BACKBUFFER;

int minHitCount;

int objectCount;

[numthreads(64,1,1)]

void CS(uint3 tid : SV_DispatchThreadID)

{

uint c,stride;

HitCountBuffer.GetDimensions(c,stride);

if(tid.x >= c)

return;

int hitcount = HitCountBuffer[tid.x];

if(hitcount >= minHitCount)

{

uint2 result;

result.x = tid.x % objectCount;//objectid;

result.y = tid.x / objectCount;

AppendObjectIDBuffer.Append(result);

}

}

We now have a tuple userid/object id instead, as shown in the following screenshot:

Please also note this technique can also easily be optimized with stencil, setting a bit per user. You get then limited to 8 users tho (7 users in case you also want to reserve one bit for object itself).

You will need one pass per user also (so 6 pass with proper depth stencil state/reference value).

If you lucky enough and can run on Windows10/DirectX11.3, and have a card that allows you, you can also simply do :

Code Snippet

Texture2D<uint> BodyIndexTexture :register(t0);

uint PS(float4 p : SV_Position): SV_StencilRef

{

uint id = BodyIndexTexture.Load(int3(p.xy,0));

if(id =255)//No user magic value provided by Kinect2

discard;

return id;

}

Here is a simple stencil test rig, to show all of the intermediates:

That's it for part 2 (that was simple no?)

For the next (and last) part, I'll explain a few more advanced cases (triangle raycast, scene precull....)

So last month been working on latest commercial project (nothing involving any extreme creative skills), so back into research mode.

I got plenty of new ideas for rendering, and quite some parts of my engine are undergoing some reasonable cleanup (mostly new binding model to ease dx12 transition later on).

There's different areas in my tool that I'm keen on improving, many new parts will be for other blog posts, but one has lately drawn my attention and I really wanted to get this one sorted.

As many of you know (or don't), I've been working on many interactive installations around, from small to (very) large.

One common requirement for those are some form of Hit Detection, you have some input device (Kinect, Camera, Mouse, Touch, Leap....), and you need to know if you hit some object in your scene in order to have those elements to react.

After many years in the industry, I've been developing a lot of routines in that aspect, so I thought it would be nice to have all of that as a decent library (to just pick when needed).

After a bit of conversation with my top coder Eric, we wanted to do a bit of feature list, what do we expect of an intersection engine, then the following came up:

We have various scenarios, some routines are better fit to some use cases, so we don't want a "one mode to rule them all". For example, if our objects are near spherical, we don't want to ray cast mesh triangles, ray cast on bounding sphere is appropriate (and of course much faster).

We want our routines sandboxed, so 4v/flaretic subpatch, it should be one node with inputs/outputs, cooking done properly inside and optimized. That saves us load time, reduce compilation times for shaders (or allow precompiled), and easier to control workflow (if our routine is not needed it costs 0).

We want our library minimal, so actually hit routines should not even create data themselves, they are a better fit as pure behaviours (It also helps to have those routines working in different environments).

We don't want to be gpu only, if a case fits better as CPU, then we should use CPU (if preparing buffers costs more time than performing the test directly, then let's just do it directly in cpu).

Next we wanted to decide which type of outputs we needed, this came out:

bool/int flag, which indicates if object is hit or not

filtered version for hit objects

filtered version for non hit objects

Then here are the most important hit detection features we require (they cover a large part of our use cases in general)

Arbitrary texture to shape (most common scenario for this is infrared camera, or kinect body index texture). In that case we also want the ability to differenciate between user id as well as object id.

In any 3d scenario, we also eventually want either closest object or all objects that get from the test.

Now our buffer also contains our distance to object, the only leftover is to grab the closest element.

We have 2 ways to work that out:

Use Compute shader (Use InterlockedMin to filter closest element, since distance is generally positive there's no float to uint tricks to apply), then perform another pass to check if element distance is equal to minimum.

Use Pipeline ; DepthBuffer is pretty good to keep closest element, so we might as well let him do it for us ;)

Using pipeline is extremely easy as well, process is as follow:

Create a 1x1 render target (uint), Associated with a 1x1 depth buffer

Prepare an indirect draw buffer (from the UAV counter), and draw as point list, write to pixel 0 in vertex, and pass distance so it's written to depth buffer, since code speaks more, here it is:

Wednesday, 12 August 2015

So here we go, Windows 10 is out, and so is "officially" DirectX12 (first official samples are finally now available).

Since I was not part of early access program, I could not see much samples, but setting up pipeline for tests was reasonably easy, even tho you have no idea about "best practices".
I helped a bit fixing some of the remaining issues in SharpDX (and also integrated DirectX11.3 support in there on the way, welcome volume tiled resources and conservative rasterization).

So way before official samples I managed to have most features I needed running, but official samples helped to finally nail down a bit the "last pieces of the puzzle" (mostly swap chain handling and descriptor heaps best practices).

First thing we obviously do is to build a small utility library (to remove the boilerplate and be able to prototype fast), and then play :)

So new API means changes, new approaches, new possibilities, everything that I love.

As I'm generally not dealing with the "general case" (game engines with ton of assets), eg: lot of procedural geometry, tons of different effects permutations, many different scenes... I will of course speak about what it changes for those scenarios.

So.... let's go

1/Resources

Finally (and really I mean it), a resource is just a resource, eg: some place in memory, there are no restrictions anymore about binding.

Before in DirectX11, you could, for example, not create a resource usable both as Vertex/Index buffer and StructuredBuffer.

For vertex buffer it's okay (can just bind as StructuredBuffer and fetch in Vertex Shader using VertexID), but index buffer can only be bound within the pipeline, so either had to copy to byte address, or mimic Append/Counter features with ByteAddressBuffer.

Now I can create a resource, create an Index buffer view for it, and an append or counter structured view as well, no problem, way it should be.

(Note: From DirectX11.2 it was possible to do it in some ways using Tiled resources, you create 2 "empty resources" (one for index, one for structured), and allocate the same tile mapping to each of those, works but a bit hacky).

This is a huge step forward in my opinion, since now I can construct geometry once and seamlessly use it in pipeline or compute.

2/Heaps

As well as having resource as locations, now we have several ways to create resources.

Commited : this is roughly equivalent to DirectX11 style, you create a resource and the runtime allocates backing memory for it

Placed : Now we can create a Heap (eg: some memory), and place resources in there, so several resources can share the same heap (and even overlap, so for small temporary resources this is rather useful)

Reserved : Reserved resources is the Tiled resources equivalent

Heaps can also be of 3 types:

Default : same as previously

Upload : Cpu accessible for writing

Readback : Cpu Accessible for reading

As a side note, the way to upload resources now is totally up to you, recommended way is to allocate an upload resource, write data in there, and use a copy command to copy data to a default resources, since it's mentionned that gpu access to Upload "resources" is slower (which I actually confirmed from some benchmarks).

So you have pretty much full control on how you organize/load your memory (specially as you can have dedicated copy Command Queues, but that one likely for another post).

3/Queries

Queries in DirectX11 were an utter pain (and often some form of disaster waiting to happen).

You had to wait for some point later in the frame, loop until data is available, then get data, Stall party in brief.

Now queries handling is much simpler, you just create a query heap (per query type), which can also contain backing memory for several of those, then use begin/end (except time which now only uses end as it gets gpu timer)

Then when you need to access data, you resolve query in a resource (which can be of any type), so you can either wait for end of frame and readback or also even use that data right away (stream output to indirect draw, I was waiting for this for so long..., bye bye DrawAuto, will no miss you anytime ;)

4/Uav Counters

On the same fashion, uav counters have simply.... (drum rolls) ... disappeared.

Now uav counters are simply backed using a resource, which of course means that you have full read/write control over it in a cpu/gpu fashion (you can even share counter between different resources/uavs, I'm pretty sure I'll find a weird use case for it at some points)

Previously in DirectX11, you could only set initial count before a dispatch in Cpu, which was sometimes quite limiting (quite often just ended up using a small buffer and interlocked functions to mimic append/counter, now all can also be used seamlessly, also a Stream Output query result can now be set as an initial count for an append buffer, which is something I already see myself abusing).

5/Pipeline State Objects

PSO are an obvious huge difference in how you set your pipeline.

Before in DirectX11, you have to set several states individually (blend/rasterizer, shaders...), which of course provided a reasonable level of flexibility (really easy to switch from solid to wireframe for example), but at the expense of a performance hit (as well as have to implement state tracking).

Now the whole pipeline (shaders/states/input layout, all except resource binding and render targets) is created up front and sent in a single call.

This offers the obvious advantage of a very big performance gain (since now the card know up front what to do and doesn't need to "reconstruct" the pipeline every draw, but pipeline states are expensive to create, so don't expect to tweak small states and have an immediate feedback (pso creation cost some tenth of milliseconds, please note they can of course be created async).

PSO can also be cached (hence also serialized to disk), so this can improve loading times drastically for applications in runtime mode.

So that's it for the first overview part, there's of course many more new features and changes, but let's keep those for another post.

Monday, 15 June 2015

After upgrading my particle system, the next part that needs my attention is my deferred renderer.

It has all type of lights (using compute shader when possible, pixel shaders if light also has shadows), and all the standard usual suspects (HBAO, Dof, Bokeh, Tonemap...)

Now I started to upgrade my shaders for the next level:

Msaa support with subpixel (sample shading)

Allow better AO usage (AO Factor can be set as low quality or each light can have an AO factor)

Better organization of some calculations (to avoid them to be done twice).

Some other cool things ;)

Before I start to revamp the glue code to handle all this, as soon as you start to use Msaa targets (this is no news of course), your memory footprint grows quite drastically.

In my use case, since I'm not dealing with the usual "single level game scenario", I can also have several full HD (or more, or less) which all need to be rendered every frame and composited.

I looked a bit at my memory usage, and while it's not too bad (reasonably efficient usage of pools and temporary resources), I thought I could start to have a proper think about it before to start coding ;)

So generally when we render scenes, we have several type of resources lifetimes:

"Scene lifetime" : Those are resources which live with your scene, so until you decide to unload your whole scene, those resources must live. A good example is some particle buffers, as they are read write, they need to be persisted across frames.

"Frame lifetime" : Those are the ones that we use for a single frame, often some intermediate results, that needs to be persisted across a sufficiently long part in the frame duration. For example, Linear Depth is quite often required for a long part in your post processing pipeline, since it's used by a decent amount of post processors.

"Scoped lifetime" : Those have a very short lifetime (generally within a unit of work/function call)

When I did a first memory profile test, I could see that actually a lot of my memory footprint is caused by those Scoped resources, so I decided to first focus on those.

So as a starter, here are my local resources for some of my post processors

While this is a natural pattern in c#, it is not designed to work well with real time graphics (resource creation is expensive, and creating / releasing gpu resources all the time is not such a good idea, memory fragmentation looming).

Resource Pool : instead of creating resources all the time, to create a small wrapper around it to keep a isLocked flag. This looks this way in c#:

Code Snippet

var buffer = Device.ResourcePool.LockStructuredBuffer<float>(1024);

//do something with buffer : buffer.Element.ShaderView;

buffer.UnLock(); // Markas free to reuse

When we request a resource, the pool will check if a buffer with the required flags/stride is available, if this is the case, mark it as "In Use" and return it. If no buffer matching specifications is found, create and return a new one.

This scheme is quite popular (I've seen this in Unity and many other code bases), has an advantage of being simple, but also has some issues.

First, you need to keep a pool per resource type, eg: one for textures (of each type), one for buffers.

Second, the biggest disadvantage (specially for Render Targets), we need an exact size and format.
We can certainly optimize format support using Typeless resources (but you still need to create views in that case), but for size that's a no go (or at least a non practical thing), since we would often need to implement our own sampler (which is not a big deal, except for Anisotropic, but that would badly pollute our shader code base). Also we would need a bit of viewport/scissor gymnastic. Again, not that hard but really not convenient.

So if you render 3 scenes, each with different resolutions, your pool starts to collect a lot of resources of different sizes, your resource lists become bigger....

Of course you can clear any unused resources from time to time (eg: Dispose anything that has not been used), add a number of frames since not used and threshold that (yay, let's write a garbage collector for GPU resources, hpmf....).

Nevertheless, I find that for "Frame lifetime" (and eventually a small subset of "Scene lifetime") resources, this model fits reasonably well, so I'll definitely keep it for a while (I guess DX12 Heaps will change that part, but let's keep DirectX12 for later posts ;)

So now we have clearly seen the problem, for my scoped resources, if I go back to the ao->dof->bokeh case, I have to create 6 targets + one buffer (one of them can be reused, but lot of intermediates are in different formats depending on which port processing I'm currently applying)

Adding a second scene with a different resolution, that's of course 12 targets.

One main thing is, all this post processing is not applied at the same time (since your GPU serializes commands anyway). So all that memory could be happily shared. But we haven't got enough fine tuned access to gpu memory for this (again, resources pointing to the same locations are now trivial in dx12, but here still in dx11). So it looks like a dead end.

In the mean time in the Windows 8.1 world, some great Samaritan (s) have introduced a few new features, one of them called Tiled Resources.

Basically a resource created as Tiled has no initial memory, you have to map Tiles (which are 64k chunks of memory) to them. Memory tiles are provided by a buffer created with a tile pool attribute.

Code Snippet

BufferDescription bdPool =newBufferDescription()

{

BindFlags =BindFlags.None,

CpuAccessFlags =CpuAccessFlags.None,

OptionFlags =ResourceOptionFlags.TilePool,

SizeInBytes = memSize,

Usage =ResourceUsage.Default

};

So you can create a huge resource with no memory, and assign tiles to some parts of it depending on your scene.

This of course has a wide use for games (terrain/landscapes streaming, large shadow map rendering), and most examples follow that direction (check for sparse textures if you want documentation about those).

Then I noticed in one slide (forgot if it was from NVidia or Microsoft), "Tiled resources can eventually be used for more aggressive memory packing of short lived data".

There was no further explanation, but that sounds like my use case, so obviously, let's remove the "eventually" word and try it (here understand : have it working).

So in that case we think about it in reverse. Instead of having a large resources backed by a pool and which is partly updated/cleared, We provide a back end, and allocate the same tiles to different resources (of course they need to belong to different unit of work, the ones that need to be used at the same time must not overlap).

So let's do a first try, and create two Tiled buffers, which point to the same memory location.

Next a simple test for it, create an Immutable resource, with some random data, same size as our tiled buffer (not pool)

Use copy resource on either Buffer1 or Buffer2 (not both).

Create two staging resources s1 and s2

Readback data from buffer1 into s1 and Buffer2 into s2.

Surprise, we then have data uploaded and copied from our immutable buffer in both s1 and s2.

So now we have the proof of concept working, we create a small tile pool (with an inital memory allocation).

Now for each post processor, we register our resources (and update tile offset accordingly to avoid overlap, we also need to eventually pad buffers/textures since our start location needs to be aligned to a full tile).

Before that, we need to check if our pool is big enough, and resize if required, the beauty of it, this is not destructive (increasing pool size will allocate new Tiles but will preserve mappings of existing ones), so this is done as:

Now of course I had a performance check Pool vs Pool, which ended up on a draw (so I keep the same performances, no penalty, always a good thing), and here is a small memory profiling.

I render one scene in full hd + hdr, and a second scene in half hd + hdr

Case 1 (old pool only):

Global pool : 241 459 200

Tiled Pool: 0

Case 2 (starting to use shared pool):

Global pool: 109 670 400

Tiled pool : 58 064 896

Total memory: 167 735 296

So ok, 70 megs gain in a day where some people will say that you have 12 gigs of ram in a Titan card is meaningless, but well :

Most people don't have a Titan card. (I normally plan on 4gb cards when doing projects).

Adding a new scene will not change the tiled pool size, and increase the global pool in a much smaller fashion.

If you start to add Msaa or render 3x full HD, you can expect a larger gain

When you start to have a few assets in the Mix (like a 2+ gigs car model, never happened to me did it? ;) a hundred meg can make a huge difference.

For cards that don't support tiled resources, the technique is really easily swappable, so it's not a big deal to fallback to global pool if feature is not supported (or leave the user decide).

I applied it quickly as a proof of concept and only on 3 effectors, now this work I can also more aggressively optimize the post processing pipeline chain in the same way (and actually also anything that needs temporary resources in my general rendering, and there's a lot).

That's it for now, as a side note, I have some other pretty cool use cases for this, they will likely end up here when I'll have implemented them.