CUDA Support/Measuring kernel runtime

There are a number of possible ways to measure the runtime of a CUDA kernel (or any other operation). The most portable option is to use CUDA's built-in timer functions, which will work across different operating systems. To measure runtime, you need to add two steps to your code:

1. Before calling the kernel, you need to create and start a timer:

unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);

2. After calling the kernel, you need to make sure that the kernel has finished and then stop and read the timer.

The call to the cudaThreadSynchronize function is necessary because CUDA kernel calls are non-blocking, meaning that the statement immediately following the kernel invocation can be executed before the kernel has actually completed. The cudaThreadSynchronize function explicitly forces the program to wait until the kernel has completed.

Note that the value returned by the cutGetTimerValue function is the elapsed time in milliseconds, so you need to divide by 1,000 in order to get the time in seconds. Also, if you reuse a timer by calling cutStartTimer and cutStopTimer multiple times on the same variable without calling cutDeleteTimer, the cutGetTimerValue function will return the average time across the multiple runs, not the total elapsed time.