The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

I noticed a few strange things and I hope somebody can help explain.
The Size parameter was set to Exp2(1.

1.) Local size was fixed at 128 because the performance was best with this setting.
2.) I started with Global_Size parameter equal to Size and reduced in steps by 2 measuring the speed. The speed remained constant down to global_size being 32x less than vector Size. Specifically 8192. Further decreasing dramatically increased the time of processing. Why?
3.) Relating to the question 2#. Why is that a stride step of 8192 within a loop executing 32x works so fast in compare to a stride step of 1? I simply couldn't measure any benefits, but rather penalty, when the stride step was one in the following kernel:

The time of execution in this case was proportional to the value of Len parameter on GPU.

On CPU, one would expect exactly the opposite. Is there no vectorization optimization possible for loops inside kernels?
4.) The first Kernel ran 20% slower than without the for-loop, but allows arbitrary setting of the local_work_size. Are NVidia Open CL drivers still sensitive to local_work_size and require that to be set by the user? I tried with local_work_size of 1 on AMD and speed degraded sharply, but leaving it to automatic works fine (fastest). Intel also recommends local_work_size to be set to auto with their drivers.

Re: Local work size!

Size is equal to length of the vectors or arrays. BlockLen describes how the long vector is broken down in to short vector. (in to many pieces each of BlockLen size). The idea with BlockLen is (or was) that thread internal for-loop would have lower overhead then if the for-loop is thread based (one thread for each item).

>Why does i increase in steps of get_global_size

It is a way to achieve two things:
a.) User controlled distance between items being read regardless of the amount of data that needs to be processed (Size).
b.) It allows the user to set local_work_size and global_work_size independently from the Size.

The two kernels compare two approaches:
1.) Strided address reads within each thread. (step larger than 1).
2.) Consecutive address reads within each thread

I was expecting that kernel2 would work fast on GPU as well. Instead kernel2 works fastest on CPU and kernel1 is fastest on GPU even when the for loop in each kernel does the same number of iterations in both cases. (and the only difference being is the stride in the for loop). (Differences are in the range of 10x).

Note that you cant test this with kernel #1 alone, because if the stride would be set to 1, (global_work_size = 1) the local_work_size would also be 1 and the OpenCL can launch only one thread per compute unit. (instead of 32 or 64 or 12.

Re: Local work size!

Sorry, I'm still confused about what the code is doing. Can you post all the kernels and the value of the kernel arguments you used? Is this some sort of micro-benchmark to see how strides affect performance?

I started with Global_Size parameter equal to Size and reduced in steps by 2 measuring the speed. The speed remained constant down to global_size being 32x less than vector Size. Specifically 8192. Further decreasing dramatically increased the time of processing. Why?

Surely if you reduce the global size so much the amount of work that can potentially be executed in parallel is also reduced accordingly? That's what the global size represents, which is why I find it so hard to understand why are you intentionally reducing it.

Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Result: The speed is highest for BlockLen = 1 and kernel1 is always faster than kernel2.
The only real difference between the two approaches: the stride between accessed array elements. The second example has stride of 1 and the first example has stride of at least 8192. Tests were run on AMD Juniper.

Re: Local work size!

The only real difference between the two approaches: the stride between accessed array elements.

I must be misunderstanding something. The global size changes dramatically in the two examples. If BlockLen = 1, then get_global_size(0) is equal to Size, the size of the vector. That naturally has a performance benefit.

As for kernel #1 always being faster than kernel #2, notice that kernel #1 only has a loop while kernel #2 has some other code to handle extremes. Given that the kernel is extremely simple, with only three memory operations and one ALU operation, the code outside of the loop may be more significant than you might think at first. There could be something else going on as well that I haven't thought of.

How does the performance of these kernels compare with a na´ve version like this?