The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Re: how fast is it as compared to OpenMP?

Unfortunately it's impossible to answer that question. Here are a few parameters to consider:

- If your algorithm maps well to a GPU, then you can see 10x-100x speedup
- If your algorithm maps well to the vector units on a CPU and you're not already using them, then you can see a 2-4x speedup
- If the OpenCL implementation has performance bugs then you can see a slow down

So without knowing more about your algorithm there's no way to answer that question. The first things I would ask are:
- How data-parallel is your algorithm? (the more the better)
- How much inter-thread synchronization do you need? (the less the better)
- What is your computation-to-communication ration? (the higher the better)
- Do you need double precision? (only on a few GPUs, and slower currently)
- How much data do you need? (GPUs generally have <2GB of storage)

Re: how fast is it as compared to OpenMP?

On a purely data-parallel operation (such as convolution) there is no reason OpenCL on a CPU should be any slower than OpenMP on the same CPU. They should both be able to split the work into large chunks and therefore have negligible overhead. If you are seeing OpenCL running significantly slower than OpenMP on such code it is likely to be due to performance issues with the OpenCL implementation.

One thing to consider is that the local work-group size is an artificial construct on todays CPUs. By that I mean CPU cores only run one work-item (=thread) at a time*, so there will be overhead from having multiple work-items in a work-group as that has to be handled either through multiple threads (inefficient) or compiler tricks to approximate threading (complicated). GPUs physically execute multiple threads concurrently in a work-group so this is a natural concept for them. It might be worth investigating using a local size of 1 and a global size of 2-4 * number of cores to see if you get better performance. That configuration should give you the best performance on current CPUs.

*SMT does run multiple threads on a core at once, but the OS sees them as separate threads which implies much more costly synchronization than the work-items on a GPU.

Re: how fast is it as compared to OpenMP?

Ultimately perf depends on the particular (OCL or OMP) implementation.

On the AMD implementation for CPUs:
- work-items to OS-thread mapping is many-to-one.
- use large work-group size (definitely greater than 1) to get good performance.
- use vectorized ops to get faster performance than OMP.