Hello,I find strange thing:When I use dportf and decompose the matrix and then I do the multiplication I don't get the previous matrix. What going on?Maybe I have bad calling. Can anybody confirm that decomposition works fine? How should look the call of this method for lower triangular square matrix?

Yes, I can confirm it works. You can look in testing_dpotrf.cpp. The examples there compares theresults from LAPACK and MAGMA. If you want to do the test by computing A - L L^T, using the notationsin testing_dpotrf.cpp, you can do something like

oh, i find out what is wrong by viewing testing_dportf.cpp and showing the input matrix. If i have lower triangular matrix i must pass the "U" argument instead of "L". I don't know why, but it works. Is it possible that there is such a bug in magma library?

Stan Tomov wrote:Yes, I can confirm it works. You can look in testing_dpotrf.cpp. The examples there compares theresults from LAPACK and MAGMA. If you want to do the test by computing A - L L^T, using the notationsin testing_dpotrf.cpp, you can do something like

First thing:This code has a bug, because h_R matrix cannot be allocated via cudaMalloc function. h_R should be allocated on host not on device.

Second thing:You don't understand me. Give me code that i can see that last 10x10 matrix of input matrix and matrix from multiplication after decomposition are the same. Try to printf some elements from end of matrix. My point is that the decomposition has bug or very small precision. I find out that about 3000th element of matrix after multiplication differ from the input about more then e-15. At the end the difference is about e-3. If You want to prove me that magma works fine give me code that shows that these matrices are the same.

Regarding the first remark, note that the allocation is done using cudaMallocHost (not cudaMalloc), so h_R is allocated on the CPU (host) as intended.Regarding the second remark, can you please post the code that you think gives you wrong result and we will look into it. All our tests are passing without problem. The code that I posted above gives you the residual so you can print the last 10x10 block directly from there.

on my machine code with cudaMallosHost doesn't work (segmentation fault), but i think it isn't a reason.i'm doing standard make from testing folder provided with magma.Please point out my mistake in codethx

I get the correct result with your code, just replacing h_R to be allocated with cudaMallocHost. This is needed because the code assumes the memory is pinned (and uses some asynchronous communications that work only with pinned memory). What card is device 3 (in your case)?Stan