own functions will never be faster since built-in functions are optimized. you could do dot3 yourself as well but as built-in function on “dedicated” hardware it uses way less cycles to execute. always avoid writing your own code (reflect goes into same category and many others, ftransform). besides with a different GPU the speed of built-in functions might increase.

Not only clamp/min/max are implemented in hardware and do no branching, but if you have simple code like this:
if(var1>1.33){
var2 = 7;
var3 = var4-5.0;
}

There will be no branching, either. Thanks to conditional execution (a flag specifying whether/when the instruction should be executed).
x86 cpus have CMOVxx instructions that do the same (but are limited to “mov”), and ARM cpus have exactly the same flags on every instruction.
Also, if real branching is done on all gpu cores at the same instruction (coherent branching), it only takes 2 gpu cycles. Coherent branching is obviously guaranteed if you loop uniform_N times. The slowness with uniform-looping comes mostly from the extra loop-preparation instructions that compilers still don’t optimize well enough.

Ouch . No need for arithmetic like that.
GPU hardware is not as ridiculous as a 386 cpu. The silicon logic’s schematic for min/max/clamp is really easy, it’s just been missing from Intel cpus until SSE came.