OpenGL Compute Shaders vs OpenCL

Hi,

I've been working on computing image histogram using OpenGL compute shaders, but it's very slow. What I do is to divide image into rows between threads and each thread computes the histogram of the respective rows. I use imageLoad() function to read pixels from a image texture.

I tried to measure OpenGL compute shaders performance just to sum up a constant value, but it's still very slow

I want to know if OpenGL compute shaders are running into the OpenGL rendering pipeline or on the CUDA Multiprocessors. Now it seems like the code above runs as slow as a fragment shader code.

On my GTX 460 I have 7 CUDA Multiprocessors/OpenCL compute units running at 1526 Mhz and 336 shader units. It should be possible to execute the above loop extremely fast on a 1526 Multiprocessor, shouldn't it?

Please clarify for me the difference between OpenGL compute shaders and OpenCL. Where do they run? What's the cost of switching between OpenCL and OpenGL?

Re the difference between OpenCL and OpenGL compute shaders: compute shaders are easier to use if you need to add a bit of compute to an OpenGL application, because you don't need to deal with all the complications of sharing devices and resources between OpenGL and OpenCL. Depending on the GPU vendor, compute shaders may also have less run-time overhead (particularly if the vendor doesn't implement ARB_cl_event, which I don't think NVIDIA did last time I checked). On the other hand, compute shaders are a bit more limited since it's just GLSL with a few extras bolted on to support workgroups, rather than a language designed for compute.

Regarding number of threads: you definitely need more threads e.g. a thread per pixel rather than per row. Depending on how many histogram bins you have you might get reasonable performance just by using atomic image operations to write a single histogram.

Here is the code. Please tell me if something is wrong performance wise. I want to compute histogram for a 640x480 image. I'm putting the histogram in shared memory as the access to it should be a lot faster than using imageAtomicAdd which operates on global memory.

I need further clarification. GPU Caps Viewer says that GFX 460 has 336 shaders and 7 Multiprocessors with warp size 32. From what I understand vertex and fragment shaders should run on the two 336 shaders and CUDA/OpenCL/OpenGL compute shaders on the 7 Multiprocessors. Please correct me if I'm wrong.

I did then a performance test with this code as I suspected that imageLoad and shared memory should be slow. I removed imageLoad operations and access to shared memory, but it is still very slow.

I also tried to compute the histogram inside a fragment shader using imageAtomicAdd but is it like 1 ms slow due to the fact that it accesses global memory and that is very slow. I moved then to compute shaders as they give access to shared memory and that should be faster than global memory.

Those 336 shader cores are practically the same as the 7 CUDA cores, just each CUDA core has then 48 shader core in it. It's just defined on a different granularity. Kind of like OpenCL processing elements vs compute units. Believe me, no matter whether you run D3D, OpenGL or any other graphics shader or you run CUDA, OpenCL or OpenGL compute shaders, it will all run on the same piece of hardware. That's why it is called a unified architecture.

The main problem with your code anyway ends up begin having a tight loop in the shader.

Also, regarding performing atomic global memory operations with images is in fact slower than accessing shared memory, the problem is that if you want to use shared memory, you must have all your shader instances (or kernel instances) running on the same compute unit, which means in your case that you can only use one of your CUDA cores, as you cannot share shared memory between compute units, thus even if you have thousands of threads and properly parallel code, you practically will serialize your work on a single compute unit.

On the other hand, using atomic global memory operations (i.e. image load/store), you can use all the computing resources of your hardware.

Actually there are way better ways to maximally utilize both the power of shared memory and global memory atomics, but then you must learn more about parallel computing and GPU parallelism in general. For now, I think image load/store is your best bet. You should still be able to get the histogram of thousands of textures a second anyways.

Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
Technical Blog: http://www.rastergrid.com/blog/

Also, regarding performing atomic global memory operations with images is in fact slower than accessing shared memory, the problem is that if you want to use shared memory, you must have all your shader instances (or kernel instances) running on the same compute unit, which means in your case that you can only use one of your CUDA cores, as you cannot share shared memory between compute units, thus even if you have thousands of threads and properly parallel code, you practically will serialize your work on a single compute unit.

Ok, I will then dispatch 7 work groups, each one with its shared memory. Each work group has to compute 640/7=91 rows. Each work group will have 45 threads, each will have to compute 91/45=2 rows. At the end, one thread from each work group will update global histograms using imageAtomicAdd. I will use barrier to synchronize threads before updating global histogram.

I've been working on computing image histogram using OpenGL compute shaders, but it's very slow.

Smart GPU guys at (probably) all of NVidia, AMD, and Intel have already written fast histogram reduction code (and other fast reduction code in general) -- I know NVidia has -- just check out their GPU computing SDK for things with histogram and/or reduc in the filename.

One thing you can do to help accelerate your development is look at their OpenCL, CUDA, and/or D3D computer shader code for histogram reduction, and convert it to OpenGL compute. Should give you a leg-up on seeing how to employ shared memory and threading to maximum benefit.

How would that be any good? You still have a two level nested loop inside your shader. Good would mean you have none.

The shader code you write is executed for each work item and for each work group, thus it practically the shader itself should not deal with anything else than a single pixel of the image (again, no loops).

Also, don't use image variables if you don't need to. Using a read-only image does not necessarily perform the same as a texture, at least I can guarantee you that it doesn't perform the same on all hardware. If you need only read-only access use texture fetches instead of image loads.

A simple pseudocode to demonstrate (assuming a work group size of 256):

So your work group size will be 256 (let it be 16x16) and then your total work item count (domain) should be practically the size of the texture you calculate the histogram of.

Sure, you could use other work group sizes, e.g. 64 or 128, then each shader invocation has to take over multiple exports to the final histogram image.

First of all, forget about aligning your work group size to your particular hardware, it won't help. Always use a work group size of at least 64 (power of two values preferred) but even more wouldn't hurt, especially for more complicated compute shaders (unlike this one).

Anyways, the inefficiency of your approach DOES NOT come from the fact that you use image load/store, but because you code is serial, not parallel. Once you remove all loops, you'll see that no matter you use shared memory or whether you use work group size of A or B, it will be still at least an order of magnitude faster than what you have now.

Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
Technical Blog: http://www.rastergrid.com/blog/

I have framework with OpenGL 4.2 & OpenCL. Now i'm updating it to OpenGL 4.3 version and trying to use compute shaders. But computer shaders are ~20% slower than OpenCL shaders in my case.

Did anybody "seriously" compare compute shaders vs OpenCL?
What is the strategy of work with global memory in compute shaders (i.e. shader_storage_buffer or imageStore/imageLoad or imageStore/texelFetch)?
Do i always need glMemoryBarrier() after glDispatchCompute()?
Is "fullscreen quad" better than compute shader?

You practically hijacked the topic because this isn't really related to the original question, but I'd still try to answer you, because you brought up an interesting topic.

So here are my answers:

Originally Posted by megaes

But computer shaders are ~20% slower than OpenCL shaders in my case.

It is possible that there is a completely different compiler behind OpenGL compute shaders and OpenCL kernels.
Also, not all compute capabilities present in OpenCL are available in OpenGL.
Not to mention that the way synchronization is handled in OpenCL is wildly different than that of OpenGL.
Finally, it may even vary from hardware to hardware.

Originally Posted by megaes

What is the strategy of work with global memory in compute shaders (i.e. shader_storage_buffer or imageStore/imageLoad or imageStore/texelFetch)?

All of these are options. Personally, for read-only data I'd prefer using texture fetches, simply because on some hardware might have a different path for storage buffers or load/store images as those are R/W data sources.
Also, there could be a difference between storage buffer and load/store image implementations as well, as the later has a fixed element size while the former doesn't really have the definition of an element at all, thus especially dynamic indexing could result in different performance in the two cases.
Another thing is that storage buffers, image buffers and texture buffers access linear memory, while other images and textures usually access tiled memory, thus there can be a huge difference in performance because of this as well.

Originally Posted by megaes

Do i always need glMemoryBarrier() after glDispatchCompute()?

No, why would you? Unless you plan to use data written by the compute shader through image stores, storage buffer writes or atomic counter writes you don't have to. The memory barrier rules are the same as before.
Also note that while calling glMemoryBarrier is not free and people are afraid of its performance, don't think that other write-to-read hazards, like those in case of framebuffer writes or transform feedback writes are free, just they are implicit, no additional API call is done, but still might happen behind the scenes, which is even worse than the new mechanism, as here at least the app developer has explicit control over whether he needs sync or not.

Originally Posted by megaes

Is "fullscreen quad" better than compute shader?

Maybe on some hardware, maybe not on others. Fragment shaders are kind of different than compute shaders. They are instantiated by the rasterizer which means that the granularity (work group size in compute shader terminology) might be different. Compute shaders provide more explicit behavior. If you specify a work group size of 16x16, you are guaranteed that those will be on the same compute unit as they may share memory, while the number fragment shader instances that are issued on a single compute unit and which fragments they processed is determined by the rasterizer and can vary wildly between different GPUs.

Also, the individual shader instances might be submitted in a different pattern to the actual ALUs for compute shaders and fragment shaders thus access to various types of resources (linear or tiled) also result in different patterns, thus one can be worse than the other. But all this depends on the GPU design, the type of the resource you access and the access pattern of your shader.

A benefit of using fragment shaders is that you can use framebuffer writes to output data which is almost guaranteed to be faster than writing storage buffers or performing image writes. You can even perform limited atomic read-modify-write when doing framebuffer writes thanks to blending, color logic op and stencil operations.

Finally, note that GPUs don't rasterize quads, thus if you do compute with a fragment shader you actually rendering two triangles, which means across the diagonal edge where the two triangles meet, on some hardware, you might end up having half-full groups of shaders being executed on a compute unit which will on its own already result in a slight drop in overall performance.

To sum it up, there is no general answer whether OpenCL is better than OpenGL compute shaders, or that OpenGL compute shaders are better than fragment shaders. It all depends on the hardware, driver, your shader code, and the problem you want to solve.

What I can suggest based on what I've heard from developers though is that if you want to do some compute stuff in a graphics application that already uses OpenGL, you better of not using OpenCL GL interop as it seems that the interop performance is usually pretty bad, independent of GPU generation or vendor.

Last edited by aqnuep; 01-05-2013 at 10:31 PM.

Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
Technical Blog: http://www.rastergrid.com/blog/