Scalar binary/intermediate shader code

I think its high time we get standard binary shader format. But it should be scalar and NOT vectored! Why scalar? Here are my reasons:

1) Lets define the term "general scalar" to be anything that can (relatively) easy be converted to and from simple scalar form, including any kind of instruction-level parallelism but excluding any kind of data-level parallelism.
As it turned out, "general scalar" GPU architectures are more efficient than vectored ones - they are able to utilize the hardware resources better. For this reason all major GPU architectures (since 10+ years now) are "general scalar". For them any vectored code is converted to their native "general scalar" code before it is executed. Thus vectored code only remains useful as a syntactical convenience, but this only applies to high-level languages intended to be used by people. The binary code in question is not intended to be used for directly writing shaders in it.

2) Converting code from vectored to scalar form is easy and incurs no code quality degradation. (In contrast, efficient conversion from scalar to vectored code is very hard problem.) This means a scalar binary code would not cause additional burden for the compilers. Actually its just the other way around because:

3) Scalar code is much easier for optimization algorithms to analyze and process it. This reason makes scalar ultimately better than vectored.

I have been watching how badly the microsoft's HLSL shader compiler performs. The code it generates is awful mainly because it has to deal with the extreme burden that is the vectored model.

I think its high time we get standard binary shader format. But it should be scalar and NOT vectored! Why scalar? Here are my reasons:

1) Lets define the term "general scalar" to be anything that can (relatively) easy be converted to and from simple scalar form, including any kind of instruction-level parallelism but excluding any kind of data-level parallelism.
As it turned out, "general scalar" GPU architectures are more efficient than vectored ones - they are able to utilize the hardware resources better. For this reason all major GPU architectures (since 10+ years now) are "general scalar". For them any vectored code is converted to their native "general scalar" code before it is executed. Thus vectored code only remains useful as a syntactical convenience, but this only applies to high-level languages intended to be used by people. The binary code in question is not intended to be used for directly writing shaders in it.

Depends on the hardware. Both Intel and AMD are SIMD based. Intel hardware is SIMD8 based. For fragment shading, the scalar story is fine since it will invoke a SIMD8, SIMD16 or SIMD32 fragment shader to handle 8, 16 or 32 fragments in one go. However, vertex and geometry shader for Ivy and Sandy Bridge the hardware does 2 runs per invocation, so the it really wants the code vectorized as much as possible. When looking at tessellation evaluation shader stage, that performance can be important since it might be invoked a great deal.

2) Converting code from vectored to scalar form is easy and incurs no code quality degradation. (In contrast, efficient conversion from scalar to vectored code is very hard problem.) This means a scalar binary code would not cause additional burden for the compilers. Actually its just the other way around because:

3) Scalar code is much easier for optimization algorithms to analyze and process it. This reason makes scalar ultimately better than vectored.

I have been watching how badly the microsoft's HLSL shader compiler performs. The code it generates is awful mainly because it has to deal with the extreme burden that is the vectored model.

I think that vectorizing code is hard. It is heck-a-easier to optimize scalar code and just run with scalars and then try to vectorize aftwerwards. The issue is that various optimization on the scalars will then potentially disable a vectorizer from do its job. I am not saying it is impossible, but it is really freaking hard a times.

However, the entire need to vectorize will become mute as SIMD-based hardware shifts to invoking N vertex, geometry or tessellation instances per shot where N is the width of the SIMD. Once we are there, then we can utterly not worry about vectorization. Naturally NVIDIA can be giggling the entire time since there SIMT based arch is scalar based since GeForce8 series, over 7 years ago.

IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.

IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.

My personal pet-preference would be that the IR chosen would be scalar based and LLVM. This way, all the optimization passes embodied in the LLVM project become available. However, then a very, hard nasty part come in: writing an LLVM backend for a GPU. That is hard and really hard for SIMD GPU's because there are so many different ways to access the registers. The jazz in LLVM to help write a backend (CodeGen or TableGen or SomethingGen) really is not up to handling writing backends for SIMD based GPU's. So, GPU vendors need to roll their own in that case.

kRogue, AMD hardware is not SIMD, it uses a large "meta instruction" (i'm not sure what terminology they use) that is actually a packet of several independent scalar instructions. This is "instruction-level" parallelism. In practical compiler-optimizing terms it is as good as plain scalar: a real scalar code is almost trivially converted to this model without degradation. As for Intel, i read some of their specs and it looks like their architecture is somewhat a mix. It can operate both as scalar and as vectored. It was mentioned that their execution units run in scalar more when processing fragments but run in vectored mode while processing vertices. So i guess a scalar binary format would do well with their hardware too.
I checked for the PowerVR hardware (it is the most popular in the mobile space) and sure enough, it is scalar too.

The reason why scalar GPU architecture is better is because in the real-world shaders often big percentage of the code is scalar (simply the shader logic/algorithm is such) and when it runs on vectored hardware only one of the 4 channels does useful work. This is a great waste.
This problem does not exist on scalar architecture. Unlike the CPUs, the GPUs have another way to parallelize the work - they process many identical items simultaneously (fragments, vertices). So the hardware architects figured that instead of staying idle the 3 of the 4 channels may process other fragments instead. Thus they arrived to the conclusion that scalar GPU is better than vectored. This is not true for CPUs because there you have no easy source of parallelism and so they better provide SIMD instructions. Even if they are not often used, the little uses they have are still beneficial.

IME, the optimizing GLSL compiler takes less time to parse+optimize than the backend, that further optimizes and converts to HW instructions. Thus, a binary format will speed-up compilation 2 times at most. It'll also introduce the need to slowly try to UNDO optimizations that are bad for the specific gpu.
What needs to be done, is spread awareness, that if you don't instantly look for results of glGetShaderiv()/glGetProgramiv() after glCompileShader() and glLinkShader(), the driver can offload compilation/linkage of multiple shaders to multiple threads, speeding-up the process 8-fold. Also, use shader-binary.

I think you misunderstood the new AMD architecture. It is vectored, but each channel of the vector is a separate work item (fragment, vertex). From the point of view of single work item, the architecture is scalar. At leas thats what i gather from the pdf you pointed. Unless i misunderstood something, it will be perfectly happy with scalar-only code

Note that the vector width of the "vectored instructions" is 64 and not 4. Also note that among the "scalar instructions" there are no floating-point ones, only integer, which means they are not intended to be for general-purpose calculations, but are mostly for control-flow logic. Also, as you pointed, there are no swizzle and no write masks. This too speaks alot. How do you imagine would the real-world shaders fit in such model when they always heavily rely on swizzles?

Actually AMD switched from VLIW to pure scalar in the sense i was talking about. Like are nvidia.
Please make difference between "internal" vectored operation that is part of the shader math (e.g. vec4 sum) and "external" vectored operations that span across separate fragments. The first are manually written by the shader author but the later can be done automatically by the GPU because it processes many similar items at the same time.
Regarding the "itnernal" ones, the new AMD architecture doesn't have such, so it is scalar. This is all that is important for us here. How it optimizes it's work by processing many fragments at once is not our concern here

IMHO optimizing the generic shader code (except for size maybe) is a bad idea, because a GPU vendor will do a HW-specific optimization of the code anyway. If the generic optimizer decides to unroll a loop, but on the target hardware a loop would be faster, the optimizer would have to detect that there was a loop that had been unrolled and to un-unroll (re-roll ?) it.

There are both HW-specific (low-level) and general (high-level) optimizations. Both are very important. The high-level ones deal with things like e.g. replacing one math expression with another that is equivalent but faster to execute, common sub-expression elimination, copy propagation, dead code removal, etc. The high-level optimizations are much easier and powerful to do on scalar code.

I think that vectorizing code is hard. It is heck-a-easier to optimize scalar code and just run with scalars and then try to vectorize aftwerwards. The issue is that various optimization on the scalars will then potentially disable a vectorizer from do its job. I am not saying it is impossible, but it is really freaking hard a times.

The idea is that a re-vectorizer is not needed because all modern GPUs are scalar. So my suggestion is to only support vectored format in the high-level language, which will be converted to scalar by the parser, and then only work with the easy scalar format.