The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Executing a large kernel

Hi everyone.
Now I have been writing a raytracing kernel using OpenCL.The kernel is big because it includes lots of procedures.
And it takes an argument which specifies the number of iterations in the kernel.

When I set the argument by the value 1.
The program works fine.
But when I set the argument by a larger value for example 16, the program does not return at queue.finish() at all.
Things get complicated because sometimes the program works fine even if the value is large.

There are various ways of extending or removing the OS timeout / driver reset, but they are work-around for a the larger issue. Ideally, kernels should not exceed dozens of milliseconds if you want your system to remain interactive (windows, buttons, etc.). In a special case of a raw compute machine, perhaps hundreds of milliseconds. But if you're approaching seconds I'd recommend to find a way to divide up your work into small units so your kernels don't kill the system. The GPU is also needed for UI. Alternatively, add a second GPU and use it only for compute (no monitor). NVIDIA Tesla cards have a mode for this on Windows at least; they can compute for hours if need be.

Thank you for your replying.
I have had a misconception.
The problem is not a time for each work-item execution, it is the time for the sum of them.
Therefore the problem can be solved by reducing the computational time of enqueueNDRangeKernel(). right?

I try to divide a image plane to a few tiles, the program works fine even when I set the argument by a large value.

So here is something to think of. You can slice up your execution of data many ways, but ultimately the total amount of computation will remain the same, however memory accesses and load balancing potential will change. Let me say that a different way. The total number of instructions you have to execute to do your algorithm is pretty much the same no matter how you structure your code. Some code structures, however, are going to be slower for memory accesses and some are going to be faster. It is also the case, however, that a few big chunks of work and many small chunks of work are going to take roughly the same time to complete.

Now that I have probably confused you, here is something to consider. Let your uint spp parameter dictate the number of "jobs" to compute, rather than using it internally within a for-loop within the work-item. One thing you can do is to use a 2D execution of num_items x spp, another thing you can do is just a big pool of num_items x spp calculations. You need to be careful of memory accesses to maintain good performance.

You might also want to structure your code in such a way that you can reduce the amount of work you give to a single kernel. This can help eliminate issues with kernel execution limits, and probably won't hurt your performance too much.