I am getting NaN values for the norm result with testing_dpotrf and testing_dpotrf_gpu.

The results are inconsistent and confusing as follows.

1. spotrf, cpotrf and zpotrf do not show the problem2. dpotrf shows it sometimes and it depends what has run before I run it. If I have just run zpotrf I get some good values.3. They worked O.K. yesterday when I made some initial tests and changed from a single threaded BLAS to use GotoBLAS. I have some files to prove it!

I have explored a number of things but cannot get consistent behaviour.

I saw something about NVIDIA drivers which may need updating. I am currently using 260.19.21 and have 260.19.26.

I have installed MAGMA 1.0 RC2 on a system running Ubuntu Linux 10.4 (64 bit) and have CUDA 3.2 installed.I have an 8 core CPU and 8 Gbytes of main memory. The GPU reports as follows:device 0: GeForce GTX 460, 1400.0 MHz clock, 2047.2 MB memoryI am using gcc and gfortran and have GotoBLAS2 installed.

Is there a command I can run to restore the GPU to a consistent start state?

Thanks for this bug report. We couldn't reproduce it so far (we tried GTX 480 and C2050) and I was wondering if you still get that error. The four precisions are generated from double complex so if some of them work a possible problem should be in the BLAS used. I see that in double precision we use magmablas_dtrsm and magmablas_dgemm. If you comment out the redefinitions at the beginning of the files, i.e.,

and recompile, you would be using CUBLAS. I guess this would work. If yes, can you please check if the problem comes from magmablas_dgemm or magmablas_dtrsm (or both) by trying different combinations of redefining CUBLAS with these MAGMA BLAS routines.

Thank you for this and other replies. I have made the changes to comment out both the magma blas routines in dpotrf and dpotrf_gpu and rerun the tests. The NaNs are replaced by the correct small values. I don't regard this as final because I have experienced the problem as intermittent.

Here is my make.inc file. I am currently using GotoBLAS2. Ubuntu 10.4 (64 bit) system (for details see an earlier post).

With cublas I get this, with lower GPU Gflops. I have some cases where I get some correct answers with lower Gflops and some wrong answers with higher Gflops, implying that the intermittence is caused by a different choice of blas routine. I don't know how that could come about.

Unfortunately, I cannot get this to reproduce at the moment, even after shutting down and powering off, and I had lost it off the top of the screen before I could capture it.

In both cases I have modified the functions to use the CUblas and the error has not been seen with those versions.

Incidentally, I think it would help to indicate with some of the tests how much memory is needed to run them. testing_zgetrf_gpu needed most of my 8 gig machine to run, and I think most of the 2 gig on the GPU. As I use the GPU also for screen output and run the system monitor, I notice that the screen display update slows down when the GPU is working on the numerical tasks. I notice that GotoBLAS normally runs 4 of my 8 cores at 100 % when working hard. Only when I was building GotoBLAS did I see more of them running.

Now I can confirm that something weird may be happening. I had done experiments on a GTX480 (which is the closest that we have to yours) before and as I indicated in my previous post in this thread I couldn't reproduce any of the problems you mentioned. After your latest posts I went back at that machine and could reproduce the problems you mention and many others! I requested exclusive use of the machine immediately after it was rebooted. I saw similar problems at the first several runs but than it went into some stable state and now everything works.

I'll do some more experimentation and probably talk to NVIDIA people to see if they have a suggestion what may cause this type of behavior. In our case our system administrator mentioned that he has been getting frequent requests to reboot the machine because the numerical libraries we use get into some problem with the hardware.

Otherwise, I didn't see any problems in your make.inc file. I also compiled with GOTO BLAS and have it run right now. For example, testing dgesv_gpu I get this

Here is the strange output from dgetrf which I can get starting from cold in the morning. I could not get it to repeat last night.

Is there a software tool I could run to see what is happening on the GPU? I usually run with System Monitor on the main display and use NX to run the tests from another computer. I have only ever seen the strange behaviour with the d versions, never s, c, or z.

M N CPU GFlop/s GPU GFlop/s ||PA-LU||/(||A||*N)============================================================can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture can not bind to texture 1024 1024 20.90 46.07 nanArgument 7 of dgetrf had an illegal value. 2048 2048 33.02 520411.45 1.766772e-01Argument 7 of dgetrf had an illegal value. 3072 3072 35.84 1756603.11 1.767735e-01Argument 7 of dgetrf had an illegal value. 4032 4032 37.05 3971886.55 1.767382e-01Argument 7 of dgetrf had an illegal value. 5184 5184 37.86 8442055.40 1.767285e-01Argument 7 of dgetrf had an illegal value. 6016 6016 39.56 13194270.78 1.767584e-01Argument 7 of dgetrf had an illegal value. 7040 7040 39.49 19382027.38 1.767586e-01Argument 7 of dgetrf had an illegal value. 8064 8064 38.59 31778048.19 1.767457e-01Argument 7 of dgetrf had an illegal value. 9088 9088 37.90 45486777.31 1.767460e-01Argument 7 of dgetrf had an illegal value.10112 10112 38.19 62660668.82 1.767458e-01

I have some information that may shed some light on this. I'm working with a quantum molecular dynamics code called RMG that does subspace diagonalizations. It works well with lapack and scalapack but I ran into the same sort of problems described here when I tried to use magma. Wrong, inconsistent or error results including illegal parameters from DLASCL or texture errors. I had been doing my tests on my workstation which has a GTX560 card but since that does not support ECC I decided to try it on Blue Waters which has K20x cards in order to rule out memory errors.

Well things were different. Still not right but I noticed something interesting. RMG uses some random starting vectors. On Blue Waters the initial iteration with magma_dpotrf_gpu differed by a large amount from the lapack version but the results started to converge after a few more iterations. So I then tried a starting position that did not include random vectors and I got identical results from magma and lapack. In the random case the matrix passed to dpotrf will not be diagonally dominant while in the second case it will be. It appears to me that magma_dpotrf_gpu is not handling things correctly when the matrix is not well conditioned. I'll run some additional tests to try to confirm this.

Thank you. I have not been active here for a long time. I am just restarting. I can see some light being shed on the reasons for strange inconsistencies. If they are specific to cheaper hardware such as mine that is worth knowing.

I just came across this as I was going to post something similar myself. I experienced some NaN errors when using magmablas_dgemm. I switched to cublasDgemm and this rectified the issue. The data, arguments and it's position in the code was the same for both. I am using a Tesla C2075 and shared version of the magma(blas) libraries (1.3). I'm not sure if this helps, but I could also try and reproduce the error if that would be more use.