The problem may come from the last cudaMemcpy() : cudaMemcpy2D(h_output[0], host_orig_pitch, d_array, pitch, N2* sizeof(Complex), M2, cudaMemcpyHostToDevice); It will copy data from host to device, and my guess is that you are trying to copy from device to host, just like you did a few lines above : cudaMemcpy2D(h_output[0], host_orig_pitch,...

The cufft library has gotten considerably larger from 4.2 to 7.0 and it results in substantially more initialization time. If you remove this initialization time as a factor, I think you will find there will be far less than 1000x difference in execution time. Here's a modified code demonstrating this:...

My blind guess is that, though the TK1 has a more modern core, the memory bandwidth dedicatedly available to the 144 cores of your 635M is significantly higher than that of the Tegra. Furthermore, CUDA is always a bit picky on the warp/thread/grid sizes, so it's perfectly possible that the...

What I'm doing is to create and lauch a new CUDA stream as a result of a complete pulse transmission. Re-use the streams, rather than creating a new stream each time. Then you can re-use the plan created for that stream ahead of time, and you have no need...

This code sequence is illegal: for (unsigned int i = 0; i < SIGNAL_SIZE; ++i) { d_signal[i].x = 2*d_signal[i].x; d_signal[i].y = 2*d_signal[i].y; } d_signal has been previously created as a device pointer (via cudaMalloc). It is illegal to directly use i.e. dereference a device pointer in host code. One possible...

If you use Advanced Data Layout, the idist parameter should allow you to set any arbitrary offset between the starting points of 2 successive transform input sets. For the 1D case, the input will be selected according to the following based on the parameters you pass: input[ b * idist...