Just thought I would ping this thread and ask if there is anything special that we should be doing with the Intel compiler?

I have been picking up similar messages, but specifically for VERY LARGE problems (see the 64bit Integer thread), and am also using the Intel compiler. Unfortunately simply switching to GCC is not an option for me (we also use the Intel compiler under Windows, for example).

Upon further investigation we found that the problem is with Intel's compiler. This CUDA release note summarizes the issue:

There is a known bug in ICC with respect to passing 16-byte aligned types by value to GCC-built code such as the CUDA Toolkit libraries (e.g., CUBLAS). At this time, passing a double2 or cuDoubleComplex or any other 16-byte aligned type by value to GCC-built code from ICC-built code will pass incorrect data. Intel has been informed of this bug. As a workaround, a GCC-built wrapper function that accepts the data by reference from the ICC-built code can be linked with the ICC-built code; the GCC-built wrapper can then, in turn, pass the data by value to the CUDA Toolkit libraries.

Until we implement and release the workaround suggested we recommend the use of gcc to compile MAGMA.Stan

I also recall seeing that message in the release notes a while back, but since it is not mentioned in the CUDA 4.1 or 4.2 release notes (I have not checked 4.0), I assumed this was no longer an issue. Could it be that since the new CUBLAS interface passes the scaling values (alpha in the case of ZGEMM for example) as a call by reference parameter, that this is the reason that it is not mentioned. The legacy API still uses call by value parameters.

One thing that makes me wonder if the launch failures I am experiencing (see viewtopic.php?f=2&t=536) is not simply due to compiler incompatibility is that they occur on Windows as well. Am I right in assuming that the Windows CUDA binaries are not built using GCC?