How do you guys handle large numbers of textures on the GPU? You can either pass them all in as separate arguments with a huge switch/if statement, or you can pack them into image arrays or 3d textures (which wastes memory). Lastly, you could pack them into a large texture atlas, which may be the best method.

If we create a global illumination renderer like path tracing, using a dedicated texture type seems not plausible because we cannot know which texture is used and a lot of textures should be bound into kernel.

In my OpenCL renderer, I used a single uchar* argument of a kernel as a pointer to texture storage.The renderer samples a texture at an index specified by a material descriptor (passed as another argument).

In this way, it is impossible to use HW accelerated texture filtering (but I think view dependent texture filtering makes the renderer biased/inconsistent, so simple bilinear filtering is only valid).Additionally I don't know if this approach is good at performance (maybe not so good).

However this approach seems the most generic way.

Last edited by shocker_0x15 on Thu Dec 08, 2016 11:40 am, edited 2 times in total.

From CUDA 6.0 or so onwards, and on Kepler and newer, you can use texture objects. I'd guess that they are quite efficient, just as bindless textures in OpenGL are. You can have a dynamic number of texture objects / can pass a list (e.g. thrust::device_vector) of texture objects to the kernel.

For legacy architectures or with OpenCL I'd probably go with a texture atlas and references into that. I however decided to limit my ray tracing lib to Kepler+ just because of the availability of texture objects.

shocker_0x15: I don't understand, I thought in OpenCL you had to use an image2d_t type in order to get hardware filtering/sampling. How are you getting filtering using a global char*?

szellmann: yes, for new hardware that makes the most sense. I have to support old hardware, so I've gone with the texture atlas option and it works well. Kind of a pain packing textures and keeping up with offsets, but the solution is efficient in the kernel.

"I don't understand, I thought in OpenCL you had to use an image2d_t type in order to get hardware filtering/sampling. How are you getting filtering using a global char*"

I think the idea is that the __global char* textureData is just one huge array of all texels for all textures.

The idea with __global char* textureData is the following. You would usually have something like a TextureHeader struct "__global TextureHeader* textureHeaders" which holds some information about the texture like: width, height, offset, textureType, filteringType, etc. Then, some material or shader stores the index to the correct TextureHeader.

At runtime in your kernel you would get something like:TextureHeader texHeader = textureHeaders(shader.texid); // texid is the offset to the correct texture header which provides with all the data we need.TextureType texType = texHeader.type; // Let's say it was a RGBA float i.e. float4float2 texDimensions = make_float2(texHeader.width, texHeader.height);__global float4* myTexture = (__global float4*)(textureData[textureHeader.offset]); // Here's the float4*

With myTexture you can do anything you like. This texture loading code goes wherever you want it do go e.g. functions with switches for filterType, textureType, etc. For a (bi/uni) directional path tracer it is benificial to sort the materials / hitpoints to get optimal performance i.e. utilization and cache coherence with texture lookups. In my personal experience the __global char* approach is very fast even if it's not hardware accelerated.

My rt lib has a dedicated CPU API where you can call texture access "intrinsics" like tex2D, etc., which emulate CUDA behavior. So I did the comparison once, both with CUDA, and on the CPU. I compared 3D texture access (so 1x mem access for nearest, 8x mem access for linear!) I remember sth. along the lines of the CPU linear filter being 5x slower than nearest filtering. In contrast to that, On the GPU, with HW accelerated filtering, I found there was virtually no difference in performance between nearest and linear filtering. So I'm curious how emulating textures in GPU DDR memory, which is basically what you propose, compares to that.

szellmann wrote: I remember sth. along the lines of the CPU linear filter being 5x slower than nearest filtering. In contrast to that, On the GPU, with HW accelerated filtering, I found there was virtually no difference in performance between nearest and linear filtering.

How did you lay out your textures in memory? It probably wouldn't make much of a difference if you're performing nearest-neighbor sampling incoherently. But if you're performing linear filtering, then I would expect tiling and/or swizzling textures would make the memory fetches more coherent for a single texture sample (at the cost of more computation).

I used tiles/bricks, used morton curves and my texturing functions also support SoA access w/ SSE and AVX. Of course the outcome of the tests depends on overall coherence. Baseline was that linear filtering had a significant impact when texturing was emulated, while there was virtually no difference with HW support.