What do you suppose would generate that binary? From what? How to add new features and extensions?
How would you distribute the compiler? Who will create it, manage it, what license. How will compiler-bugs be reported, fixed and then distributed to end-users?

All this was solved automatically for shader-binary.

05-26-2014, 01:23 PM

l_belev

Well one way is to use the existing infrastructure of glGetProgramBinary/glProgramBinary with special new enum for the binaryFormat parameter. In this case the compiler will be built-in (as is now) but will also allow any external compiler, since the binary format is standard. A little tricky detail is that glProgramBinary is about the whole program object and not about specific shader. Maybe the new format will only be allowed for separable program objects that contain single shader. Or maybe allow any program object if its not a problem.

Ah, since "binaryFormat" is an output parameter for glGetProgramBinary, then in order to tell this function to generate the standard format we want, we may have a new program object parameter, e.g.

While this parameter is not GL_ANY (the default), glGetProgramBinary will return binary in the specified format. Something like that.
This function may fail with GL_INVALID_OPERATION if the given program object can't be retrieved in the requested format. For example if it was loaded from other binary format (and the driver can't recompile/convert) or when the program object contains more than one shader.

06-17-2014, 04:52 AM

kRogue

Quote:

kRogue, AMD hardware is not SIMD, it uses a large "meta instruction" (i'm not sure what terminology they use) that is actually a packet of several independent scalar instructions. This is "instruction-level" parallelism. In practical compiler-optimizing terms it is as good as plain scalar: a real scalar code is almost trivially converted to this model without degradation. As for Intel, i read some of their specs and it looks like their architecture is somewhat a mix. It can operate both as scalar and as vectored. It was mentioned that their execution units run in scalar more when processing fragments but run in vectored mode while processing vertices. So i guess a scalar binary format would do well with their hardware too.I checked for the PowerVR hardware (it is the most popular in the mobile space) and sure enough, it is scalar too.

I am going to put my little bits in on the hardware about differences between NVIDIA, AMD and Intel. Caveat: I've spent far less time with AMD than the other two, and Intel I've spent way too much time.

Here goes.

Intel is SIMD based all the way. The EU is a SIMD8-thing. Issuing a scalar instruction means that the results of 7 of the 8 slots is ignored completely. You can see this real easily when looking at the advertised GFLOPS, clock speeds and number of EUs [beware when talking GFLOPS everyone counts MADD as 2-flops]. Just to be clear: in Gen7 and before, vertex, geometry and tessellation shaders process 2 vertices per invocation, so it really wants, so badly wants, vec4 ops in the code; vecN operations, for N<4, have N/4 utilization for vertex, geometry and tessellation shaders. So for Gen7 and before, the compiler works hard to vectorize and that sucks. Starting in Gen8, there is 8-wide dispatch so 8 invocations are done at a time so, utilization is 100%. For fragment shading there are several modes: SIMD8, SIMD16 and SIMD32. The benefit of the higher modes is more fragments are handled per instruction, additionally since the EU really is SIMD8, SIMD32 is great since instruction scheduling almost does not matter [think of one SIMD32 instruction as 4 SIMD8's, by the time the last SIMD8 is started, the first SIMD8 finishes].

NVIDIA is SIMT. An excellent article about it is here: http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html. When one first looks at SIMT vs SIMD it seems the difference is hair splitting. However, SIMT makes divergence of source of data so much easier to handle, where as in SIMD it is a giant headache [scatter-scatter-read anyone?]. Additionally, it dramatically simplifies the compiler backend work.

AMD is, as far as I know, SIMD with similar magicks as Intel does [though I think that for AMD it has each shader invocation is many vertices just like Intel Gen8, but details are different].

What this auto-SIMD'ing of Intel and AMD drivers do is that it makes it look like everything is scalar based, but the hardware still is a SIMD thing.

Lastly, that is really about int, float multiply and add. The other operations: reciprocal, exp, ln, trig functions, are usually handled by a dedicated unit that operates on scalars, so those are expensive, i.e. rather than say 8 reciprocals per N clock cycles (where N is the number of iteration steps), it is just 1 reciprocal per N clock cycles. I do not know what NVIDIA really does, but I do not think each CUDA core has anything beyond an ALU to handle int and float multiplicities and adds, so those iterative operations are also much more expensive.

But now getting back to a nice IR form, the purpose of the thread. For NVIDIA, they want a scalar IR form as much as possible. Intel for Gen7(Ivy Bridge and before) wants vectorized for everything but fragment shader and scalar for fragment shader. For AMD, I am pretty sure it would want scalar too. However, for low power things, where float16 is important, the return to want vectorized will come back. The reason being that a SIMD-N thing is N-floats per op, will then also be 2N-fp16's per op, so the compiler backend will need to vec2 vectorize fp16 at the shader level to get maximum utilization.

It would be really neat to have what D3D has had for ages: ability to send byte code to driver rather than source and that byte code does not depend on hardware or driver. The main issue, as someone already pointed out, who would create and maintain that dedicated compiler to that byte code format? Personally, I am all for a LLVM based solution that is scalar based, but it won't be trivial. Even with the LLVM battle, making a backend is diamond-rock-hard. To put it mildly, using LLVM CodeGen does not go well and so, life is still hard.

06-19-2014, 02:14 AM

l_belev

Quote:

Originally Posted by kRogue

Intel is SIMD based all the way. The EU is a SIMD8-thing. Issuing a scalar instruction means that the results of 7 of the 8 slots is ignored completely.

If issuing scalar instruction means 7/8 of the hardware resource is wasted then issuing vec4 instruction means 4/8 (50%) of the resources is wasted. But 4 is the biggest vector size available in glsl (and i don't suppose their compiler is super-humanly smart as to be able to convert any and all 4-or-less vectored code to 8-vectored) so at all times at least 50% of the hardware resources is wasted? Unless the intel engineers are complete and utter idiots (which obviously is not true), what you state is simply impossible. You should have got something wrong

Please pay attention and DONT mix these two notions: 1) inter-work-item-SIMD and 2) intra-work-item-SIMD. All modern GPUs are the first but NOT the second! Being the first doesn't mean they are not scalar from the POV of a single work item, but being the second means they are not scalar. My argument is that the assumption that no modern GPUs have the property 2), is good enough. If this assumption can really be made then we can have scalar-only binary code standard, which would greatly help the compilers.

I am pretty sure that i read in some of intel pdfs that their GPU execution units can be configured to work either as scalar or as vectored and their driver uses the scalar mode for fragment shaders but vectored mode for vertex shaders.
As for amd and nvidia, i am 100% sure their architectures are scalar all the way, and this has been so for a long time now. Of course I mean scalar from the POV of single work item, which is what concerns us here.

For nvidia and amd i can confirm this by actual performance tests: I have written some converter from microsoft's binary shader code for dx9/dx8 (that code is unbelievable disgusting mess beyond words, full of exceptions, exceptions from the exceptions, nasty patches and hacks and so forth) to glsl and i implemented the converter in 2 variants, one that preserves the vectored operations and another that converts vectored to scalar. On both nvidia and amd both perform equally. There is no detectable slowdown for the scalar code. Haven't tested on intel though because currently their opengl drivers are too buggy to run the application in question.

I don't know what they do but indeed their program "binaries" appear to be textual ARB assembly plus some binary metadata.
On the other hand their CUDA/OpenCL uses their own assembly language called "PTX" which is much closer to their architecture and is scalar.
I wonder why they don't use it with OpenGL too. Smells like their opengl code contains thick layers of history of the kind that no one has the guts to attempt to dig into.

06-20-2014, 10:32 AM

elFarto

I've been looking at this very issue the past few days, and I think the best thing for OpenGL would be to ARBify NV_gpu_program{4,5} and friends (with a few tweaks) as use that as the base for all features. You can then modify the reference GLSL compiler to output ARB_gpu_program{4,5} programs, or have whatever middleware you're using generate it directly.

I've also noticed that the D3D10 HLSL bytecode maps almost perfectly to NV_gpu_program4 (minus the differences in samplers/textures D3D has).

Also, a ARB_gpu_program{4,5} plus a ARB_separate_shader_samplers would make porting D3D games over to OpenGL very easy.

Regards
elFarto

06-29-2014, 11:21 PM

kRogue

Quote:

Originally Posted by l_belev

If issuing scalar instruction means 7/8 of the hardware resource is wasted then issuing vec4 instruction means 4/8 (50%) of the resources is wasted. But 4 is the biggest vector size available in glsl (and i don't suppose their compiler is super-humanly smart as to be able to convert any and all 4-or-less vectored code to 8-vectored) so at all times at least 50% of the hardware resources is wasted? Unless the intel engineers are complete and utter idiots (which obviously is not true), what you state is simply impossible. You should have got something wrong

Please pay attention and DONT mix these two notions: 1) inter-work-item-SIMD and 2) intra-work-item-SIMD. All modern GPUs are the first but NOT the second! Being the first doesn't mean they are not scalar from the POV of a single work item, but being the second means they are not scalar. My argument is that the assumption that no modern GPUs have the property 2), is good enough. If this assumption can really be made then we can have scalar-only binary code standard, which would greatly help the compilers.

I am pretty sure that i read in some of intel pdfs that their GPU execution units can be configured to work either as scalar or as vectored and their driver uses the scalar mode for fragment shaders but vectored mode for vertex shaders.
As for amd and nvidia, i am 100% sure their architectures are scalar all the way, and this has been so for a long time now. Of course I mean scalar from the POV of single work item, which is what concerns us here.

You are missing my point. Lets just first operate on the hardware, pure hardware first and then state how it is used in implementing an API. Here goes. Intel is a SIMD8 beast. It has a really flexible way to address registers, but at the end of the day the ALU is a SIMD8 thing at the ISA level. There are ways to send instructions to do operations on more than 8 things with one instruction, coming from the flexible addressing system it has.

Now, how that is used for implementation of graphics. For Gen7 and before, one vertex/geometry ISA invocation can do -2- vertices at a time. So if the GL implementation can vectorize everything to full used vec4 operations, then one gets 100% ALU utilization. For fragment shading, there are several modes: SIMD8, SIMD16 and SIMD32 which means that 8, 16, or 32 fragments are processed per fragment ISA invocation. The punchline is that the GL implementation does not need to vectorize for fragment shading at all. As a side note, the registers in Intel Gen are 8-floats per register and there are 128 registers.

Don't take my word for it, open up within Mesa, the i965 open source driver implementation from Intel at src/mesa/drivers/dri/i965/ and see for yourself. For a -user- of Intel hardware this means that functionally, fragment shader is scalar based and vertex shading is vec4 based (for Gen7 and before).

Talking about "work items" and such is really talking about the software API, not what the hardware actually is.

I agree with you, once the API makes it scalar looking it does not matter (mostly) to a software developer. However, for Gen7 before on Intel, a scalar based IR for vertex and geometry shaders will mean something has to vectorize it back to vec4 operations which is not pleasant work. This is my point. There is hardware out there that a scalar based IR is not all cupcakes and cookies, alteast that hardware is older.

Worse, once we get to fp16, even the fragment shader will want to be vec2 vectorized atleast. So a purely scalar based IR is not going to be ideal for when fp16 support is wanted. The reason one want fp16 is that one can get twice as many ops compared to fp32 per clock.

Quote:

For nvidia and amd i can confirm this by actual performance tests: I have written some converter from microsoft's binary shader code for dx9/dx8 (that code is unbelievable disgusting mess beyond words, full of exceptions, exceptions from the exceptions, nasty patches and hacks and so forth) to glsl and i implemented the converter in 2 variants, one that preserves the vectored operations and another that converts vectored to scalar. On both nvidia and amd both perform equally. There is no detectable slowdown for the scalar code. Haven't tested on intel though because currently their opengl drivers are too buggy to run the application in question.

For NVIDIA, they have this SIMT thing, which means they are very, very scalar happy. Also, that test does NOT prove anything. Indeed, since the scalarization is machine generated and code is not optimized out, then a half-decent vectorizer could re-vectorize the code. Though, for NVIDIA I know they don't. For AMD I suspect they do not need to vectorize; though AMD is the one that contributed lots of vectorization magicks to LLVM project.

Again, Intel for Gen7 try to keep your vertex shaders vec4-y to keep ALU utilization higher. However, the vast majority of applications are not geometry limited, so even if the ALU utilization for vertex shading is at 25%, it won't matter. Indeed, most of the time Intel Gen is not even limited by float operations at all, it is limited by bandwidth. To get an idea of why: Intel Gen uses same memory as system RAM, so that is DDR3 with bandwidth around 20-30GB/s (higher numbers for newer hardware) and shared with the CPU. In comparison a dedicated video card, even a midrange one, using GDDR5 gets 200-300 GB/s.

But what I know from intel's documentation is that the issue with vertex shaders is not actually hardware but software-related, that is, it is their driver that puts the hardware in vectored mode for vertex shaders and scalar mode for fragment shaders.
In other words the hardware can actually work in scalar mode for vertex shaders too, it's up to the driver. They could change this behavior by driver update, which would be needed anyway in order to support the hypothetical new binary shader format.

When the vectored mode is left unused, they could clean-up their hardware from this redundant "flexibility", which would save power consumption and dye area. Thats what all other GPU vendors figured out already, some of them a long time ago.
That would also ease the job of their driver team.
One would expect that they should have learned lessons from their long history of chip-maker that over-engineering stuff does not result in more powerful hardware but in weaker one (remember itanium?).
Nvidia also learned a hard lesson with geforce 5 when they made it too "flexible" for supporting multiple precisions.

07-01-2014, 07:13 AM

elFarto

Quote:

Originally Posted by l_belev

In other words the hardware can actually work in scalar mode for vertex shaders too, it's up to the driver.

Yes and no. Yes they could change it, but no, they still might be limited by hardware, specifically bandwidth.

Quote:

Originally Posted by l_belev

They could change this behavior by driver update, which would be needed anyway in order to support the hypothetical new binary shader format.

A new binary shader format that doesn't support already existing hardware isn't a particularly good format.

Also, I'm not sure I follow your reasoning in your first post. You say a new shader format should be scalar, but then proceed to say that it's easy and loss-less to convert from vectored code to scalar code. So wouldn't keeping a vector format be preferable since it's compatible with either hardware setup? Of course it's not quite true to say that going from vectored to scalar code is loss-less, as you do lose the semantics, which are as you mentioned expensive to recover.

But as I said before, the NV_gpu_program{4,5} format is perfect for an intermediate format. Not only does the extension already exist, but all the drivers have partial implementations of it as it's based on ARB_{fragment,vertex}_program so there wouldn't be as much work to do for them to support it.