Native Shadow Mapping

Native support for shadow map sampling & filtering was introduced ages ago (GeForce 3) by NVIDIA.
Turns out AMD also implemented the same feature for DX10 level cards. Intel also supports it
on Intel 965 (aka GMA X3100, the shader model 3 card) and later (G45/X4500/HD) cards.

The usage is quite simple; just create a texture with regular depth/stencil format and render into
it. When reading from the texture, one extra component in texture coordinates will be the depth
to compare with. Compared & filtered result will be returned.

Depth Buffer as Texture

For some rendering schemes (anything with “deferred”) or some effects (SSAO, depth of field,
volumetric fog, …) having access to a depth buffer is needed. If native depth buffer
can be read as a texture, this saves both memory and a rendering pass or extra output for MRTs.

Depending on hardware, this can be achieved via INTZ, RAWZ, DF16 or DF24 formats:

INTZ is for recent (DX10+) hardware. With recent drivers, all three major IHVs expose this.
According to AMD [1],
it also allows using stencil buffer while rendering. Also allows reading from depth texture
while it’s still being used for depth testing (but not depth writing). Looks like
this applies to NV & Intel parts as well.

RAWZ is for GeForce 6 & 7 series only. Depth is specially encoded into
four channels of returned value.

DF16 and DF24 is for AMD and Intel cards, including older cards that don’t support INTZ.
Unlike INTZ, this does not allow using depth buffer or using the surface for both
sampling & depth testing at the same time.

Using INTZ for both depth/stencil testing and sampling at the same time
seems to have performance problems on AMD cards (checked Radeon HD 3xxx to 5xxx,
Catalyst 9.10 to 10.5). A workaround is to render to INTZ depth/stencil first,
then use RESZ to “blit” it into another surface. Then do sampling from one surface,
and depth testing on another.

Depth Bounds Test

Transparency Anti-Aliasing

NVIDIA exposes two controls: transparency multisampling (ATOC) and transparency supersampling (SSAA) [4]. The whitepaper does not explicitly say it, but in order for ATOC render state
(D3DRS_ADAPTIVETESS_Y set to ATOC) to actually work, D3DRS_ALPHATESTENABLE state must be also set to TRUE.

Render Into Vertex Buffer

Similar to “stream out” or “memexport” in other APIs/platforms. See [2] for
more information. Apparently some NVIDIA GPUs (or drivers?) support this as well.

Geometry Instancing

Instancing is supported on all Shader Model 3.0 hardware by Direct3D 9.0c, so there’s no extra hacks
necessary there. AMD has exposed a capability to enable instancing on their Shader Model 2.0 hardware
as well. Check for “INST” support, and do dev->SetRenderState (D3DRS_POINTSIZE, kFourccINST);
at startup to enable instancing.

I can’t find any document on instancing from AMD now. Other references: [6] and [7].

ATI1n & ATI2n Compressed Texture Formats

Compressed texture formats:

ATI1n, also called 3Dc+, or BC4 in DirectX 10 and later. This is single channel, 4 bits per pixel;
basically DXT5/BC3 alpha block.

ATI2n, also called 3Dc, and almost BC5 (see below) in DirectX 10 and later. This is two channel, 8 bits per pixel;
basically two DXT5/BC3 alpha blocks right after each other.

Since they are more or less just DX10 formats, support is quite widespread, with NVIDIA exposing it
a while ago and Intel exposing it recently (drivers 15.17 or higher, since 2011 or so).

“Almost BC5” part: ATI2n/3Dc has the red & green channels swapped compared to BC5. This is seemingly
not clearly documented anywhere, but ends up working like that. ATI Compressonator
source code
seems to agree (for ATI2N format, it puts X channel data after Y), even if the
header comment says that BC5 is identical to ATI2N :)

Compression tools like Compressonator have something called “A2XY” (CMP_FORMAT_ATI2N_XY there), which actually matches
BC5 layout. However, neither NVIDIA nor AMD drivers (as of mid-2016) expose this FOURCC format at runtime. So if you want
your DX9 runtime to match what DX11/GL/Metal is doing with BC5, you’ll have to use ATI2n format and swizzle the texture
data yourself at upload time (for each 16 bytes, swap the 8-byte parts).

Caveat: when DX9 allocates the mip chain, they check if the format is a
known compressed format and allocate the appropriate space for the smallest mip levels. For
example, a 1x1 DXT1 compressed level actually takes up 8 bytes, as the block size is fixed
at 4x4 texels. This is true for all block compressed formats. Now when using the hacked
formats DX9 doesn’t know it’s a block compression format and will only allocate the number of
bytes the mip would have taken, if it weren’t compressed. For example a 1x1 ATI1n format will
only have 1 byte allocated. What you need to do is to stop the mip chain before the size of
the either dimension shrinks below the block dimensions otherwise you risk having memory
corruption.

Another thing to keep in mind: on Vista+ (WDDM) driver model, textures in these formats will
still consume application address space. Most regular textures like DXT5 don’t take up additional
address space in WDDM (see here). For
some reason ATI1n and ATI2n textures on D3D9 are deemed lockable.