slow local arrays in GLSL

Hi,
I'm surprised at the cost of declaring arrays in GLSL programs.
The array doesn't need to be initialized, all I do is write to a random element and read from the same one (so the array can't be optimized out by the compiler). Simply declaring a larger array causes a significant slowdown, which seems to increase linearly with size.

I would quite like to know why this happens. What is the GPU doing which takes longer? Are there any tricks to circumventing this (the array cannot be constant/uniform buffer)?

Re: slow local arrays in GLSL

The reason behind this is that usually a single processing core of a GPU can execute as many threads as there is register memory. The threads don't actually run in parallel, but when a thread is scheduled out, e.g. in order to hide the latency of a texel fetch, another thread is executed. This way the cores are kept busy all the time.

However, if you have large local memory usage, i.e. register memory usage, that means that less threads can be executed concurrently on a single core, thus performance is decreased.

As a generic guideline, one shall always use the minimum possible register memory in a shader to ensure high performance.

Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
Technical Blog: http://www.rastergrid.com/blog/

Re: slow local arrays in GLSL

Working on a single cell was to stop the compiler optimizing the array out, so I could confirm the array size was my problem, and not the operations I was performing on it.

Just to clarify, from what you say there are two issues. The first, threads use a shared pool of register memory. When the total memory used overflows, threads get dropped. I assume this mean cores/"stream processors" become inactive (not sure of the terminology). I would assume the amount of memory and is hardware specific but it would be possible to work out the maximum amount of memory usable before concurrent threads are reduced.

The second issue is the thread memory block being copied back and forth during global memory operations, such as an imageLoad(). In the example above, this shouldn't happen. However, if I were to fill the local array from global memory these copy operations would delay the next set of threads from running.

Is there a way I could manipulate the cache to store my array data, avoiding these issues? For example, if OpenGL knows each thread operates on its own block of global data, that data does not need to be coherent and can be stored/operated on in cache. Should this happen anyway, if I simply operate on the data directly with image load/store (I've tested with various image unit modifiers and had no luck so far)?

I also encountered this problem several days ago, and spent several days to optimize the algorithm to minimize registers usage. Occasionally I found that nv had updated the driver(301.24,beta) for geforce, so downloaded it and setup and suddenly..... my program got a 5 times speedup on gtx570 over the original version. try it.

But it is still slow on other cards (fx 3800 and quadro 4000). so my work on optimization is still useful