Questions and more bugs in OpenGL driver (II)

Also two questions before second bug report..
1.is multi_draw_indirect fully supported in current Nvidia HW (and Fermi) without CPU support in driver or is implemented some part in software I say that because this originated from AMD extension so seems AMD really supports in HW..
2.seems fbo_no_atachments extensions is equivalent to Dx11.1 TIR target independant rasterization if so how it's possible implementable in Nvidia first gen d3d11 HW (fermi) if I read on some sites that TIR was just one thing current D3D 11 HW doesn't supported.. what are the differences then?
Second sample is in 430-compute shader sample seems I get fixed by removing readonly and writeonly qualifiers from storage buffer and then fixing a bug by g-truc
ie changing

More compute shader bugs and questions!

One question is compute shader stage being exposed in Nvidia Cg language (where to ask?) similar to tesselation stages where added in Cg 3.0.. also if not, really another question is if Nvidia Cg compiler cgc.exe will at least get support for that so we can compile GLSL compute shaders to NV_compute_shader assembly similar to which I get from driver when assembly errors occur (see below)..
So this is really the kind of bug in compute shader that I'm afraid Nvidia will not fix for security/hanging kind of thing:
I was happy to see I could port Nvidia matmul CUDA sample to OGL compute shader without much effort basically doing two things:
*Adding defines:

First let's be clear this code works in CUDA and recommended by Nvidia for it's GPUs.. second it's using a fine detail in GPUs in that barriers can be in control flow assuming all warps in block execute follow same path in control flow..
why not in OGL compute shaders?..
also see the "flow control" is cause by this loop:

Code :

for (uint a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep)

not a branch and even more is executed same iterations for every thread but it's not trivial in the form stated to infer by compiler so I thought hey fix it i changed to (in my sample matrix w=h=1024 and block size=32 so iters=1024/32):

even with this I get same error so unique way to fix is to unroll loop manually which is ugly at least.. I think even I added #pragma unroll to loop to force programatically to expand the compiler for me but not working .. there is a Nvidia way to force unroll like in CUDA with #pragma unroll
Full shader:

From post below I want to ask some questions/suggestions:
*Would be good similar to Nvidia exposing NV_CG_shader extension for allowing "some" Cg functions (and syntax?) in GLSL code if exposed some call NV_CUDA_shader extension exposing to relaxation things in compute shaders to allow easier porting of CUDA kernels to GLSL (I'm smoking crack?):
first defines for all CUDA grid, threadid syntax and the likes having GLSL equivalents like:
#define blockIdx gl_WorkGroupID
#define __syncthreads() barrier()
#define threadIdx gl_GlobalInvocationID
#define __shared__ shared
and defining like int and not uint like intrinsics in GLSL for mentions mentioned in previous post for portability questions
and second allow __shared__ definitions inside compute shader functions which compiler will put out internally if needed..
with that these matrix mul code could be ported immediately!..
*Specification doesn't inform about any restriction on usage of barriers inside control flow or restrictions in that case similar to CUDA for threads to follow same path reaching barrier.. it's this going to be fixed in spec or is IHV dependant..
if some restrictions apply in spec Nvidia could allow similar relaxation as done in CUDA by an extensions GL_NV_barriers_in_control_flow or similar to inform of that support..
*Also sorry but what is Directcompute restrictions in that case? there are any?
*Finally allow to programatically exposing work group size like by requiring local_size_x variables not to be constant like (new extension like GL_NV_uniform_local_size)

or better new function
DispatchComputeSetLocalSize(globx,globy,globz,locx ,locy,locz)
equivalent of CUDA support
kernelname<<globalsize,localsize>>(..)
I know that DirectCompute doesn't support exposing local group size but CUDA yes..

One more question..

I have fixed all using [Code].. thanks alfonse..

Another bad news altough this would be Nvidia only anyway for now: pointers on compute shaders not working on Nvidia..

I have tried to answer another question towards easy portability of GPGPU codes in GLSL compute shader:
using pointers in compute programs! of course using GL_NV_shader_buffer_load..
if I not enable parser correctly points:

but GL_NV_shader_buffer_load is not supported in every stage? why not in compute?!
so Nvidia this is a limitation of implementation for now or forever (well more or less ).. anyway this is supported on CUDA so surely supported in GPU HW for ages..

Hi oscar,
ran into the barrier problem during development as well, currently it seems like an oversight in the spec itself, if you look into 4.3 GLSL :

The barrier() function may only be placed inside the function main() of the tessellation control shader
and may not be called within any control flow. Barriers are also disallowed after a return statement in the
function main(). Any such misplaced barriers result in a compile-time error.

This hasn't been specced in detail for compute yet. Ideally spec and implementation can be relaxed accordingly!

In general things are in an early stage yet, it will take a couple revisions until the implementation has all the features it should have.

Regarding your twitter on NV_shader_buffer_load, a quick compile test using

Code :

#extension GL_NV_gpu_shader5 : enable

did not give any errors when declaring some pointers. Although use of pointers on shared memory currently won't be possible

Thanks for the bug reports. We have looked at the g-truc OpenGL 4.3 samples and have a few driver fixes pending. There are also a few bugs in the OpenGL 4.3 samples, which I have reported to g-truc.

1) The issue with calling imageSize() on an image declared with "coherent" has been fixed in the driver. The beta will be updated soon with this fix. Until then you can workaround the issue by removing "coherent" from the declaration.

Another bad news altough this would be Nvidia only anyway for now: pointers on compute shaders not working on Nvidia..

Sorry I'm behind on looking at these forum updates.

The error with #extension GL_NV_shader_buffer_load is a compiler bug that I was able to reproduce. The fix was an oversight in the logic handling "what #extension" directives are supported in which shader types/architectures. I'll work on a fix for the next beta driver update.

As Christoph Kubisch mentions, one workaround for this is to enable the extension GL_NV_gpu_shader5, which also includes the ability to store to pointers. GL_NV_shader_buffer_load only allows pointer reads.

Many thanks for responses Christoph and Pat!
some updates on my findings:
First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it's here the place to motivate Nvidia to relax this restriction..
Reasons to allow it are big first and foremost:
1.D3D11 compute shader allows it!.. sorry to be pedantic but OGL ARB can't claim OGL 4.3 is a superset of D3D11..
for example I have found in D3D SDK ComputeShaderSort11 sample in kernel BitonicSort
this code

well this code can't be without control flow which I verified getting compiled DX IL assembly by fxc which has barriers inside "loop" in DX IL parlance keyworks.. even more I compiled down to AMD 5850 ISA using AMD GPU shader analyzer and yes barriers is there between control flow..
2. Even in D3D 11 don't allowed it, lots of *basic* GPGPU algorithms are fully optimized in this conditions like sorts, perhaps even scans and many more I think searching CUDA code over web will grasp the importance of this support in CUDA..
3. I have found that AMD is going to allow it! well details follow..
by using some leaked AMD drivers (9.00) and headers about it I found about AMD_compute_shader in fglrx drivers and seeing inside driver binary searching which keywords are implemented from AMD_compute_shader and which ones from ARB_compute_shader I have compiled successfully a compute shader having barrier inside control flow impossible to remove (the control flow) from the compiler.. I can post full code and binary to check if you dont' believe..

*Allow selective unroll of loops similar to D3D [unroll] (i.e. better control over #pragma optionNV(unroll all))
Motivation:
Another question is well matmul code after all has the loop with fixed iters (assuming tuned for a specific matrix dimensions) so we can unroll I have found using #pragma optionNV(unroll all) solves the problem and compiles the kernel *BUT* produces big binary so question there is a pragma in NV compiler to allow unrolling of selected loops similar to D3D [unroll].. hey that's another place D3D11 CS are better than OpenGL.. with that support I could unroll outer loop of matmul code posted before and the loop between barriers don't be unrolled..

I have found using GL_NV_gpu_shader5 works perfectly.. and allows using without less massage more cuda codes.. I found happily that using OpenCL codes (CUDA code also use it) that pass shared mem pointers as function args compile correctly up to assembly compile that founds
something like a LDS.? line and returns with error..
So I have hope with your comment "Although use of pointers on shared memory currently won't be possible" that this will be possible in the future.. Please allow that use case!

First I think that comment about not currently possible to use codes that have barriers with control flow is correctly to be posted here because if not allowed in spec it's here the place to motivate Nvidia to relax this restriction

Presumably by releasing an extension, right? Because we don't want anyone, AMD, NVIDIA, Intel, etc just ignoring what the spec says because that's not what it is that they want. Like your AMD_compute_shader example is a separate specification, thus allowing AMD to expose whatever functionality they choose.