I'm not sure how it works in the world of shaders but in CUDA, when you have branches, but all threads take the same branch, as I understand there is no slow down. The slowdown occurs from divergence as some of the processors have to sit idle if they need to take one branch and let the others take the other branch.

So say I was doing a renderer in CUDA.

I could say, enable light 0 and make it a point light.Disable light 1.Disable light 2.

Then I'd have the entire program loaded and just run something like this:

There are branches, but all processors take the same path in the branches and don't cause divergence.

In the world of Shaders I've seen people recommending something like uber shaders to handle different permutations of rendering states to avoid branches since they are supposed to be slower. But is it really slower in shaders when all of them take the same path in the code? You end up having to compile many different shaders and changing which one is loaded based on what you are rendering, which causes some slowdown.

Is there a reason to not just have these much bigger shaders with some if statements?

all threads take the same branch, as I understand there is no slow down

No, when all threads take the same branch, there is no extra slowdown on top of the regular number of cycles it takes to process the condition and branch instructions -- every instruction still has a cost (unless it's optimised out -- N.B. some graphics drivers can optimise out these kind of non-divergent branches when you issue your draw-call, if it's able to determine that all threads will take the same path).

You also have to watch out for increased register pressure from having too many branches, since the compiler will need to allocate enough registers to handle both sides of the branch.

Anyway you have to realize that a lot of advice regarding graphics is going to come from the era before DX10/Cuda-capable GPU's. This is because old information hangs around the internet instead of dying out, and also because that level of hardware is still prevalent in consoles, mobile devices, and PC's. Before DX10 hardware, branching was generally a much less appealing proposition.

On older hardware such as the xbox 360 you can go so far as to explain to the compiler exactly what you are trying to do with the branching.
Its far safer to go the permutation route with older hardware

You also have to watch out for increased register pressure from having too many branches, since the compiler will need to allocate enough registers to handle both sides of the branch.

Just highlighting this. If you mix shaders that do a lot of complex lighting math with shaders that are relatively simple, the register requirements of the complex case will kill your warp occupancy (and hence performance) in the simple case.

There is the increased instruction count, but I was thinking maybe it would be balanced out by not having to constantly switch the loaded shader as you are drawing different materials.

I can see the register count being a problem... But if you have a complex shader using many registers, and a branch that is simpler and uses less, isn't that not really made worse since there are times when you would have the more complex shader loaded and be using a lot of registers anyway?

When you take a branch, I'm pretty sure it uses the same set of registers. I'm not sure how it is on the GPU but on the CPU branch A won't have registers 1-5 reserved and branch B won't have registers 6-10 reserverd.

With good optimizations it should work something more like, branch A would say, I need 3 registers, branch B would say I need 10 registers. So if you take branch A you use registers 1-3, if you take branch B you use registers 1-10.

So the provided example may not issue branch instructions at all. This is a candidate for uniform branching, in which case the runtime or driver may choose to produce multiple compilations of the shader where all of the branches have been resolved and the loops unrolled. This was really common before we had hardware branching, in the 2.x days. I'm not sure to what extent it's still used now, but you can hint the compiler to unroll loops and avoid branches.

Uber shaders give you much more precise control over compilation though.

There is the increased instruction count, but I was thinking maybe it would be balanced out by not having to constantly switch the loaded shader as you are drawing different materials.

I can see the register count being a problem... But if you have a complex shader using many registers, and a branch that is simpler and uses less, isn't that not really made worse since there are times when you would have the more complex shader loaded and be using a lot of registers anyway?

When you take a branch, I'm pretty sure it uses the same set of registers. I'm not sure how it is on the GPU but on the CPU branch A won't have registers 1-5 reserved and branch B won't have registers 6-10 reserverd.

With good optimizations it should work something more like, branch A would say, I need 3 registers, branch B would say I need 10 registers. So if you take branch A you use registers 1-3, if you take branch B you use registers 1-10.

The way registers work on the GPU is that on each hardware unit there is a fixed size register file, and the number of registers used by a shader determines how many threads can be in flight simultaneously. So far example if you had 10k registers and you were running a shader that used 10 registers, then you could have 1k threads in flight. Those threads don't all run concurrently of course, but having lots of threads allows the hardware to swap out threads stalled on memory accesses for other threads that can perform ALU work.

The reason this is a problem with branching is that the shader has to allocate for the worst case. So if you have a complex path that's rarely taken and requires 20 registers and a simple path that only requires 4, each thread will allocate 20 registers for their entire lifetime even if none of threads ever take that branch. This means your occupancy is always determined by your worst case. In a permutation scenario you could draw your simple case objects with a simple shader and they would have good occupancy, and only the objects requiring the complex shader path would suffer the performance effects of having high register pressure.

So the provided example may not issue branch instructions at all. This is a candidate for uniform branching, in which case the runtime or driver may choose to produce multiple compilations of the shader where all of the branches have been resolved and the loops unrolled. This was really common before we had hardware branching, in the 2.x days. I'm not sure to what extent it's still used now, but you can hint the compiler to unroll loops and avoid branches.

Uber shaders give you much more precise control over compilation though.

That sounds really nice. So it basically knows which compiled version to use depending on what arguments I send it?

I would say, glUseProgram(someShader), and based on the parameters I set, it would actually select the real shader I want? In many cases, the branches are super obvious, like this material is not using normal maps, or this light is not casting specular reflections...

Right now I have a system that uses bit masks to figure out the permutations of the uber shader to load and it's cool and all, but if I can have something much simpler, that would be awesome.

And yeah I can see the problem with using too many registers now. It's always good to know how something actually works.

Let me add one further note on the register allocation problem. If we can decide on one branch before we issue the draw call, Dx has some sweet candy for us. Formerly we would have branched depending on some constant buffer value and probably uniform branching would have kicked in. But now, Dx11 brought us interfaces to HLSL. With those we can define methods, which can be implemented by multiple classes. Before issuing a draw call we can assign a particular class that should be used for an interface variable. The good news is that the driver inlines the hardware native shader code of the methods - declared in the interface and implemented by the selected class - at bind time (!), thereby choosing the optimal register count.

This is supposed to be the solution to the dilemma: ubershaders vs. many specialized shader files. It has two upsides: We can stop worrying about the register allocation (since we’re not branching) and the code becomes cleaner (neither huge branch trees nor dozens of shader files for the permutations).
Of course on the downside it can only optimize the function bodies independently. :-/ But still, it's a very helpful tool.

Allison Klein (GamesFest 2008, slides and audio track are online on MSDN) and Nick Thiebieroz (GDC 09) talked a little on this.
(Edit: In OpenGL the concept is called Subroutine Functions and is basically doing the same.)

Let me add one further note on the register allocation problem. If we can decide on one branch before we issue the draw call, Dx has some sweet candy for us. Formerly we would have branched depending on some constant buffer value and probably uniform branching would have kicked in. But now, Dx11 brought us interfaces to HLSL. With those we can define methods, which can be implemented by multiple classes. Before issuing a draw call we can assign a particular class that should be used for an interface variable. The good news is that the driver inlines the hardware native shader code of the methods - declared in the interface and implemented by the selected class - at bind time (!), thereby choosing the optimal register count.

This is supposed to be the solution to the dilemma: ubershaders vs. many specialized shader files. It has two upsides: We can stop worrying about the register allocation (since we’re not branching) and the code becomes cleaner (neither huge branch trees nor dozens of shader files for the permutations).Of course on the downside it can only optimize the function bodies independently. :-/ But still, it's a very helpful tool.

Allison Klein (GamesFest 2008, slides and audio track are online on MSDN) and Nick Thiebieroz (GDC 09) talked a little on this.(Edit: In OpenGL the concept is called Subroutine Functions and is basically doing the same.)

This is also available in nVidia's Cg library and works on a much wider array of hardware-- it was even working on the old GeForce 6800s way back when GPU Gems (1!) was the hot new thing.

Just FYI

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

This is also available in nVidia's Cg library and works on a much wider array of hardware-- it was even working on the old GeForce 6800s way back when GPU Gems (1!) was the hot new thing.

Just FYI

Nice, thanks a lot! That’s very good to know! GPU Gems 1 is indeed quite antique. Kind of cool that those things were possible for so long.

How does Cg handle this? Is it compiling and optimizing the function bodies individually, too, and inlines them at bind time? Or does it compile all permutations completely? How can I – as a programmer – decide which permutation to pick for the execution?

Can you tell me how the whole thing is called in the Cg terminology, so I can find it easier?I was curious and started browsing through the Cg specification to find out more. Do you mean “Overloading of functions by profile” (page 170)? Also a nice feature, but that’s not it, isn’t it? This doesn’t seem to solve the permutation issue - or does it?

EDIT: If they don't mention it in the new language manuals, I may stand corrected here. Wonder if it's been removed/deprecated somehow.

EDIT 2: Based on the API descriptions provided, it's probably the first approach, inlining/AST substitution.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

I've been working with OpenGL for a while and recently got into GLSL and never even touched Direct X yet. Does cg run pretty well with OpenGL? I think I remember seeing an extension or something for Cg. Is it a good idea to switch over to Cg and try to make use of that feature?

Knowing OpenGL seems very useful since it pretty much runs on every device I've tried, Windows, Mac, Linux, iPhone, Android...

I've been working with OpenGL for a while and recently got into GLSL and never even touched Direct X yet. Does cg run pretty well with OpenGL? I think I remember seeing an extension or something for Cg. Is it a good idea to switch over to Cg and try to make use of that feature?

Knowing OpenGL seems very useful since it pretty much runs on every device I've tried, Windows, Mac, Linux, iPhone, Android...

Yes, very much so. It can be pretty accurately described by the phrase 'HLSL for OpenGL,' in fact.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.