I have a code in fortran (does allot of fluid mechanics calculations) that I think I can improve the speed of it by using cuda fortran way of parallel computation, and I was able to write a code and calculate the output of multiplication of two dimensional matrices (400,400) in almost 1/80 of the time needed by the cpu to perform the same multiplication.
But, unfortunately, when you want to copy from device memory to host memory (you need to do that in case you want to output it to a file, or do further processing on CPU), it will take ridiculously long time that it is not worth it to perform the task on GPU anymore! I hope this is not the case.
does any one know what mistake I did?
can you please help? I found that there is little (or almost nothing) written guidance on how to program using cuda fortran, not like cuda c.

How long is "ridiculously long" and "almost 1/80 of the time needed"? How are you measuring your performance?

How are you copying your arrays? Copying whole arrays is often faster then copying sub-arrays. A whole array can be copied in one contiguous block while sub-arrays need to be copied in many small chunks. Given the high overhead of data movement, I find that the minimising the number of copies is more important then the amount of data being copied. For sub-arrays, it's better to gather the arrays into a single block before coping. I touch upon this topic in my article Multi-GPU Programming Using CUDA Fortran, MPI, and GPUDirect

If you are copying the whole array using array syntax (i.e. Arr_host = Arr_device ), add the "pinned" attribute to your host array. This will request, but not guarantee, that the OS put the array in physical non-swapping memory, i.e. "pinned". In order to perform a DMA transfer (i.e. copy data to the device), the memory must be pinned. So without the "pinned" attribute, the memory must first be copied from virtual to physical memory and then transferred to the device. The "pinned" attribute eliminates the need for this extra copy. The caveat of using "pinned" is that this data is managed by the CUDA device driver. Hence, if you destroy your CUDA context, this memory will be destroyed as well.

If both of these don't help, then your stuck. Your only other options are to reduce the frequency of the copies or increase you computation on the device.

Another thing to check is that your device is performing as expected. I've seen instances where the card was installed in the wrong PCI slot with a single channel as opposed to a quad channel giving 1/4th the memory performance. You can check this by running the "pgaccelinfo" utility and look at the bandwidth test. For comparison, here's the output from my C2070 system.

thanks Mat for your prompt reply, I have two questions to ask you:
1) how can I run the accelinfo?? do I run a special command in cmd??
2) how can I check that memory bandwidth are the optimal value?? I am new to this technology (NVIDIA and CUDA)
I don't have the Tesla, I wished that our budget can afford to buy one, instead I thought I should start something simple like NVIDIA GeForce 460 graphic card, installed on a Dell XPS 8300 (8 i7 cores), hope to run a test to see what is my GPU capable of.
also, regarding your question, what is the time needed to copy, I have developed the code bellow, please look at it, it might make it easier for you to understand what I am trying to do, I made that by my self with my little knowledge of cuda fortran since there is no books available other than some cuda c.
I used a function (CALL DATE AND TIME) to calculate the run time to see how many milli second (1/1000 of second) needed to do multiply by the gpu kernel, it was 1 msec, for the same task done by CPU, it took the cpu 80 msec, so that's a big achievement for me, but, when I try to copy from gpu to cpu memory, there I get the big shock! it took 227 msec to copy! its like 3 time longer than the cpu doing the multiplication, but I did not use pinned attribute, this will be my next task.

You can run "pgaccellinfo" from a command line shell. Are you using PVF? If so, then open a PGI DOS command shell from the "Start" menu.

Quote:

2) how can I check that memory bandwidth are the optimal value??

Actually, I'm not really sure. The product spec page lists the memory bandwidth but this is the on-chip bandwidth, not the host to device. I suspect that it will depend upon a lot of factors besides the card itself.

Quote:

here is the code:

First off, your timing code is incorrect. Kernels are launched asynchronously from the host code. Hence, after the call the host continues until it reaches a synchronisation point, which in this case is the data transfer of the C array. So here you're timing the data transfer to the device plus a little overhead of calling the kernel, but not the kernel itself. To fix, either add a call to "cudaThreadSynchronize" just after your kernel call, or better yet, use CUDA events to time the code. In my article Tuning a Monte Carlo Algorithm on GPUs, I show an example of using cuda events.

Did you mean to only use a block? This will give you very poor performance. So I suspect that the problem here is not the data copy, rather how you are scheduling your kernel. Note that PVF ships with an example CUDA Fortran Matmul projects and the Workstation products ship with an optimised example sgemm.cuf.