Understanding Structured Buffer Performance

Structured Buffers were a new addition to DirectX11. They offer expanded compute
capabilities making them useful for techniques like tile based deferred shading.
They offer a very convenient solution to representing data structures on the GPU
that are more than simply colors or 4-component vectors. As such, they are a great
tool to use in GPU programming.

As titles designed to a D3D11 baseline have recently become more common, we’ve
noticed a pitfall that developers should be wary of and consider when developing
their code. Structured Buffers are by definition tightly packed. This means that
the following code generates a buffer with a stride of 20 bytes:

That may not seem terrible, but it does have some performance implications for
your code that may not be immediately obvious. The fact that the structure is not
naturally aligned to a 128-bit stride means that the Position element often spans
cache lines, and that it can be more expensive to read from the structure. While
one inefficient read from a structure is unlikely to damage your performance terribly,
it can quickly explode. Something like a shader iterating over a list of complex
lights, with more than 100 bytes of data per light, can be a serious pothole. In fact,
we recently found prerelease code where whole-frame performance was penalized by over
5% by just such a difference.

To avoid these pitfalls in your own code, only a couple simple steps are required:

Aim for structures with sizes divisible by 128 bits (sizeof float4)

Pay attention to internal alignment so that vector types are ‘naturally’ aligned

You might waste some memory or need slightly more complex code to accomplish these goals,
but the costs are generally pretty small compared to the 20+% performance that can be lost
on a shader from hitting these pitfalls.

If this topic was something you found interesting, please stay tuned as we
have some additional structured buffer tips coming in a follow-up post.