The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

GL-CL interoperability performances

Hi.
i'm developing a sample app that has to do the following:

1) render a single frame to an offscreen framebuffer
2) analyze pixel by pixel the generated image (in different manners).

i developed a standard openGL app, which draws offscreen, calls the glReadPixels to retrieve rendering result and then do its stuff. this one takes about 0,15 seconds to perform 100 runs on a small rendering (300x300).

then, i developed an opencl app that:
1) prepares an opengl context
2) prepares an opencl buffer from the framebuffer
3) computes the rendered image on the GPU side (there is no explicit data copy between RAM and VRAM)
4) retrievs the result of the evaluation from the GPU memory (this is just one float number)

Re: GL-CL interoperability performances

It looks like your global work size is 1, which means you're only using 1/16th of 1 of the streaming processors on the GPU. (Which is far slower than 1 core on a CPU.)

You should set your global work size to 3000, 3000 and remove the for-loops in your kernel. (BTW -- do you really mean 3k by 3k or do you want 300 by 300? The image is only 300x300.) At the end you then use an atomic add to increment the number matching if there is a match. Actually, since the atomic add is going to be very slow, you might want to set your global size to be much smaller (say 100, 100 for a 300x300 image) and then count up to 9 matches per work-item. That would reduce the number of atomic adds you'd have to do.

Make sure you verify that atomics are supported on the device you're using, though!

Re: GL-CL interoperability performances

Hi,
i tryied using several work sizes like suggested by you.
i got the best performances using small computation areas for each kernel, such as 100 pixels for each kernel for a 300x300 image.
now i have execution times of about 0,3seconds in openCL vs 0,1 seconds for the CPU algorithm which uses getPixels to retrieve the whole framebuffer from the VRAM.
do you think this is right or not?

as you can see, each work item save in a different location in the results array, so there is no atomic code in the kernels.
the final sum is made in CPU land, after reading back data.
note that this is not actually a bottleneck (even with this part of code commented out performances are mostly the same).

Re: GL-CL interoperability performances

Re: GL-CL interoperability performances

Originally Posted by whites11

by now i set global work items number to 900 and local size to 1.
what should i set?

Local work size seems pretty small to me. On NVidia architecture, it must be a multiple of 32 for maximum efficiency. When no collaboration is required inside work group threads (ie no communication through shared mem), the local work size should be optimized using CUDA occupancy calculator (an excel spreadsheet available part of the SDK). What you want to do is maximize multiprocessor occupancy factor.