Nvidia: bindless textures - first experiences and bugs

Hi,
I am working on virtualization and visualization methods for large image and volume data sets. My current approach is to use large texture atlases to store the page data on the GPU. With volume rendering a problem with large 3D textures is their cache inefficiency under most viewing angles (caused by their GPU internal, more 2D texture optimized, layout). So i was stoked by the bindless extension, because it allows me to store the pages of the virtual textures in single smaller textures and access them in the shaders through an uniform block or better a larger texture buffer holding the 64bit resident texture handles. So my experiments began....

First experiment: Simple volume ray caster using a single 3D volume texture. You can find the complete source code of the experiment here.1. Store the texture handle in a uniform block

2. As uniform buffer storage can get pretty limited when trying to input a large number of page textures into a single shader (64KiB overall storage at max, used with other uniform data...), i tried to store the texture handles in a texture buffer with a RG_32UI format, which in the shader is converted back to a uint64_t which can be interpreted as a sampler:

Results: Quite unexpectedly, this works perfectly! (This use case was not expressed in the bindless textures spec)

3. Now i tried to only fetch and translate the sampler once at the beginning of the shader and store the sampler in a global variable:

Problem: The glsl compiler does not allow to assign values to global sampler variables (the spec implied this should work).
So i just stored the uint64_t handle in a global variable and translate this value to a sampler right before taking a texture sample:

Results: This works, but it is _much_ slower than the previous result fetching the handle for every sample!

These tests were made using a smaller volume with dimensions 501x401x576 using 8bit scalar values with a simple transfer function on a 1600x1024 viewport. The plain non-bindless version ran with 2.5ms per frame, the uniform block bindless version ran with 2.8ms per frame. The first texture buffer version ran with 3.2ms per frame and the version using the texture buffer trying to just once fetch the sampler ran with 3.8ms per frame.

Second experiment: Modify virtualization renderer to use a texture buffer containing individual page texture handles instead of a large volume texture containing the page textures. (not openly available at this point in time).

I changed my data structure containing the indirection information into the texture atlas to contain a simple index into the texture buffer. Additional changes were made to the methods retrieving the actual texture samples from the atlas. What is important is that for the octree traverser i use to retrieve the volume brick for the ray traversal i store the temporary data in the following struct, which is filled out by the traversal function when querying the octree for a texture coordinate:

Problem: Simple put: The temporarily stored sampler does not work, i always get vec4(0.0) back from this lookup.
So i checked where the problems start and i found that the uint64 handle was ok, so i stored this handle additionally and used it during the lookup:

Results: This works, but it is much more slower than my current texture atlas approach by a huge factor (3x to 6x slower in my experiments).

I used 512MiB worth of 64³ smaller volume page textures for my experiments (resulting in 2048 resident 3d textures). I ran all tests under Windows 7 x64 using a GeForce GTX 680 with the 301.32 driver.

I know that this functionality is pretty new and the drivers surely need to mature. I was surprised how far i got with my experiments and a hope that the performance situation can be improved drastically because i see this way of handling virtual volume textures overcoming a lot of our problems with larger 3D texture atlases.

(just beware of the memroy layout rules, slightly different from UBO's std140 (nv allows dense packing of scalars, and not the vec4 expansion)
Also gives possibility to using pointers for hopping around in more complex data structures.

Btw you might want to use NV_shader_buffer_load to pack your textures into buffers instead of samplerBuffer, safes the unpack work.

e.g
struct entry {
sampler3D tex;
..
}

uniform entry* entries;

..
tex3D(entries[idx].tex, ...)

(just beware of the memroy layout rules, slightly different from UBO's std140 (nv allows dense packing of scalars, and not the vec4 expansion)
Also gives possibility to using pointers for hopping around in more complex data structures.