Second OpenCL Test: PostFX

The new thing of this demo is the use of local or global memory. GPU Caps has a command line option to enable the use of the local memory (/cl_demo_use_local_mem). By default the use of local memory is disabled because the use of local memory crashes on Radeon cards.

The local memory is a very fast access on-chip memory (also called scratchpad memory) and is magnitude faster than the global memory (which is localized on graphics memory outside the GPU). Original NVIDIA implementation uses only local memory with explicit workgroup size. I added another codepath to disable the use of local memory and enable or disable the explicit workgroup size.

As for the surface deformer demo and with NVIDIA’s first OpenCL drivers, there was a huge gain in performance when explicit work group size in used. Now this statement is no longer true.

These graphs show an interesting fact: local memory has a huge impact on GeForce GTS 250, only a small one on GTX 280 and no visible effect on Radeon.

This post processing kernel is quite demanding and we clearly see the difference between a HD 5870 and a HD 5770. We also see that the GTX 280 dominates the test. The kernel comes from NVIDIA’s OpenCL SDK and I imagine that it has been optimized for NVIDIA hardware.

WARNING for Radeon users under Windows XP and Seven: the use of local memory crashes the demo and the VPU Recover resets the graphics driver. Vista users are not affected.