Can you guess what happens? Elapsed time does not depend on the complexity of the function. There is a fixed portion for setup (about 14.7us on my laptop) and a portion that directly depends on the number of vertices (about 22.5ms for 1e7 vertices).

Does anybody have any suggestion on measuring GLSL function execution time?

In fact, I need to compare efficiency of some implementations. So it is not important to have absolute values. On the other hand, I don't want to measure execution time of the application when they are applied, since it is quite specific and subjected to optimization related to certain implementation.

Measuring the performance of an independent function in a vacuum is pointless. GLSL is not C, where you could expect the performance of a particular function to be invariant with other changes. In shader compilation, functions will be inlined, instructions will be statically reordered to hide latencies for various operations, and so forth.

You can never assume that a function X which is faster than function Y in your vacuum test will always be faster in your application. Once you put it in your real shader(s), it may be faster or it may be slower.

For example, let's say you have some function that does purely math stuff. So it has some particular performance X. And let's say you have another function that does a texture fetch, then a small number of math computations. It has some particular performance Y.

It is entirely possible that, when you call one after the other in your real shader, the overall performance is not X+Y. It could just be as small as max(X, Y).

So the exercise you propose is simply not useful. If you want to optimize a shader, you're going to have to do so in the actual context of the overall code you're trying to make faster. The only thing you can test is how long it takes to execute it.

Also, if you're measuring shader performance, why are you using transform feedback?

My question was an a consequence of the late-night desperate thinking.
In fact, the case is quite clear.

If there are no dependences and pipeline stalls, only parameter I could measure is a single-step interval.
Let's assume we have M processing units and want to execute N function calls.
The whole processing time is equal to:

If you don't use nSight, I'd suggest you use a fragment shader on a fullscreen triangle. (to avoid high primitive-setup costs, and <= 4 primitives per cycle setup, and transform-feedback setup/memwrite). With manually-unrolled looping and care to not let the compiler optimize stuff out.
Things that can skew results are texture fetches (longest stall), access to limited ALU units (trigonometry), and register bank clashes (fmad r0, r4, r8, r12). I guess they will appear to have a +1 cycle execution time, in perfect circumstances. (if the other warps happen to not use those resources).
Still, the vast majority of instructions will be fmad-like, executing in a single cycle (effectively, even if they have a latency of 10-20 cycles). So, you can infer how many simple instructions a GLSL function consists of.

You can often accurately measure high latency limited-resources' minimum effective execution time by padding them with simple ALU ops, in a ratio. E.g loop(10){1 fsin, 8 fmad } if the trigonometry units are 8x fewer than ALU. Same for texture fetches, except that you have to make the texture small enough to fit in cache, with nice access-patterns.

If the difference is too small to measure you could loop inside the shader ( but I would make sure something changed or the compler might optimise the loop away).

Interesting idea, but I'm going on a trip so I will try it next week.

Originally Posted by Ilian Dinev

If you don't use nSight, I'd suggest you use a fragment shader on a fullscreen triangle. (to avoid high primitive-setup costs, and <= 4 primitives per cycle setup, and transform-feedback setup/memwrite). With manually-unrolled looping and care to not let the compiler optimize stuff out.

In testing shader I don't have any drawing. Even explicitly call glEnable(GL_RASTERIZER_DISCARD). Whole transformation is done in vertex shader.

I'm not sure I have understood the rest of your suggestion. First, I want to use GLSL not the assembly language, so I have no control over instructions that are executed. Furthermore, GLSL compilers are very aggressive in optimizations.

What you call ALU is actually SFU (Special function unit) used for transcendental functions. Addition, multiplication and logical operations are done at SPU/DPU (some GPUs use same logic for both single and double precision, like Fermi, while others have separate DP units, like Kepler). SFU count is pretty high on modern GPUs. I don't think they can make any trouble.