The Khronos Group - a non-profit industry consortium to develop, publish and promote open standard, royalty-free media authoring and acceleration standards for desktop and handheld devices, combined with conformance qualification programs for platform and device interoperability.

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Re: why CPU execution time is less than GPU time?

The problem you are running into is that GPUs are designed to handle huge amounts of work, not small amounts. To give some more detail, there is a certain amount of overhead present in copying data from host memory to device memory, then there is a certain amount of overhead in launching a kernel, and finally some overhead in copying the results back. The kernel launch overheads are fairly constant, while the transfer overheads depend on the size of the data plus a constant overhead from the driver.

If you send a lot of work to the GPU then these overheads account for a proportionally smaller part of the processing time.

The other problem is that your problem size is so small it does not even use a single streaming multiprocessor. To fully utilise a GPU, each work group should contain a few hundred threads and there should be a few hundred work groups.

With regard to our previous discussion on how many threads run in parallel, I should probably elaborate further. Each streaming multiprocessor executes a certain number of threads in parallel, 48 in your case in the best case scenario. Lets say these threads get to a memory access instruction. It doesn't matter if it is global or local memory, both take a certain amount of time to return the data to the thread. During that time, these threads all block, waiting for the data from memory. Rather than sitting idle, the thread scheduler schedules more threads from a different thread group, called a warp in Nvidia terminology. Each multiprocessor can keep track of the execution status of several hundred threads - upto 1536 on your GPU if memory serves. That is why you need so many threads to make sure that the GPU does not sit idle.

Now for a comment on your chosen problem, squaring each element of a vector. This is not a good problem for a GPU because the amount of maths done for each element is less than the amount of memory access operations. PCIe bandwidth also hampers you here - it is actually the biggest short coming of your code. Using plain reads and writes, your maximum observed PCIe bandwidth could get to about 5GB/s. Since a float is 4 bytes, you can send about 10^9 floats to the GPU each second. Since each element results in one floating operation, that means you will cause about 10^9 floating point operations per second or 1GFLOPS. Even a single core of your CPU can beat that number of operations per second, plus it has slightly fewer overheads because the transfer from host memory to device memory would just be a copy from one location in RAM to another, which is faster than sending data over the PCIe bus. You need many more operations per element of data before a GPU becomes worthwhile.