The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Optimisation tips for fetch intensive kernel on ATI

Dear OpenCL users, I recently ported a kernel from CUDA to OpenCL.
This kernel process a 2D image (~512²) and for each pixel, fetch ~8000 coordinates in global memory.
Then for each pixel it will fetch ~8000 times in the 2D image using this coordinates.

The profiler says the bottleneck is mem fetches, not ALUs
On Nvidia 570, kernel has identical performances in CUDA or OpenCL
When running a Radeon 7850 (I think performances should be close to the GTX570), code is 5 times slower.

I changed my code to use shared memory and reduce the amount of global memory fetches.
Now the profiler says the bottleneck is ALU Ops.
But the 7850 is still 2.5x times slower that the GTX570.

Any tips regarding:
- the reason why ATI is slower for this kind of kernel
- optimization of this Kernel for ATI (my coordinates array is constant for all kernel launches)

PS: the 2D image is in fact a 32bit greyscale pic.
I'm currently using a CL_R - CL_SIGNED_INT32 image format.
Could this explain bad performances of my read_imagei() calls?

PPS: I changed this to a CL_ARGB, and updated the kernel to handle 4 consecutive pixels. Same performances

Re: Optimisation tips for fetch intensive kernel on ATI

I had one bit of code - a face detector - which had a big performance hit on AMD hardware, going from GTX 480 to HD 6950 was of a similar order to yours.

And I think with some work I ended up about 2x faster than that - but still 3x slower than the 480.

I can't remember the details fully but I think it was a fight between memory bandwidth and empty slots in the VLIW instructions (and too many non-unrollable loop/branches). IIRC i couldn't try to do more work per work-item because the memory bandwidth was already saturated, but just doing one lot of work left a lot of empty alu slots.

I'll be getting a 7970 soon so i'll see how that compares. I expect it should 'solve' the `problem', but it might not. This was only hobby code - fortunately none of my work stuff suffered and some ran faster.

Apart from local, trying to use the constant cache is about the only other big thing to try. If it's alu bound, remove loop unrolling or work with narrower data-sizes (although you probably already are).

PS actually, I did had some monsterous performance (and correctness) problems with quite a lot of code, but those were all from using #pragama unroll which triggered some bugs in the amd compiler at the time (which may be fixed now?): removing them ALL was the easiest solution.

Re: Optimisation tips for fetch intensive kernel on ATI

Many things can cause such a performance fallback.

AMD (or ATi) cards suffer when switching between FETCH and ALU clauses on the binary level. Try to group your FETCH, ALU,and WRITE operations together respectively. This allows the compiler to reach higher VLIW pack ratio. This efficiency can be queried with offline Kernel Analysis tool also, as well as the runtime AMD APP Profiler, which can shed a lot more light into why your kernel performs poorly.

Cache hit ratio could also be looked at as a source perf degradation. Reorganizing your work-item's READ/WRITE operations might help that.

If you have to switch between clauses too often and memory operations cannot be hidden by multiple wavefronts in a work-group, you will get a high ALU Stall ratio.

Using only a single channel is not the best case scenario, since texture fetch units are tuned for 128-bit loads, but there are some optimizations for scalar 32-bit operations starting from HD6xxx.

Just a few ideas. Download AMD APP Profiler and profile your program to find the real bottleneck. Let us know if any of these helped.

Re: Optimisation tips for fetch intensive kernel on ATI

If you are running on AMD GPU you surely have an AMD OpenCL runtime installed, so that will do for the profiler. The tool will install a tab inside Visual Studio somewhere next to your Solution Explorer. You will have to launch your application there and run until a few of your kernels have been launched.

Try to dump kernel binary (or just compile your kernel with the AMD APP Kernel Analyzer) and see how long the FETCH and ALU clauses are. I fear this is will simply alternate the two, which is quite painful. Using somewhat more registers could help, but indeed, reduction cannot be done many times faster.

I do not know how often you have to do this, or what percentage of the run time is this summation, but do take a look at ISA binary.