>> Tuesday, September 27, 2011

I've been experimenting with OpenCL work-group sizes and I've learned something I should have known before. For a given kernel, work-groups are always the same size. To test this, I configured clEnqueueNDRangeKernel to execute a kernel with 991 work-items. I set the local_size argument to NULL so that OpenCL will determine its own work-group size, and it created 991 work-groups with 1 work-item each. I've tried this with other prime numbers and it does the same thing.

I'm surprised. How can 991 separate work-groups execute faster than, for example, four groups of 200 work-items each and one group of 191 work-items? As another example, when I specified 651 work-items, OpenCL creates three groups of 217 items each. But when I specify 653 work-items, OpenCL creates 653 groups containing one item each.

Now that I'm working with OpenCL-OpenGL interoperability, dealing with arbitrary data sizes is a frequent concern. My work-groups can't communicate with one another, so my host application receives one result from each group. But the host can't allocate memory for the result array if it doesn't know in advance how many work-groups will execute the kernel.

The only solution I can see is to pad the number of work-items until the total number is a multiple of the maximum work-group size (or a similar number obtained through CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE). This means that, if the data size requires 991 work-items, I should pad the value to 1024, which is a multiple of 256, which is the maximum number of items per group on my GPU.

But now there's another problem: the padded work-items execute the kernel just like the regular ones, which means they may cause an error in the result. To get around this, I'm using the following code in my kernel:

In this code, max_items is the unpadded size of the data. That is, if there are 991 data points to process, max_items is set to 991. This is not elegant code, but I can't see an alternative. I'll keep thinking.