I'm having trouble trying to figure out why the size of the matrix A on the device (d_A) in the example testing_sgesv_gpu.cpp is set to ((N+32)*(N+32) + 32*maxnb + lwork+2*maxnb*maxnb). The manual says that rest of the space will be used for workspace. So I have an N x N matrix A on device memory where I need update only the on-diagonal elements of A (e.g. elements of row 0 col 0, row 1 col1, row2 col2.....) before feeding into magma sgesv solver. Currently I have a simple kernel to do this:

where offset = N+1. How would these indices for the on-diagonal elements correspond in d_A of size ((N+32)*(N+32) + 32*maxnb + lwork+2*maxnb*maxnb)? I tried to update the on-diagonal elements my matrix A then copy to d_A using cudamemcpyDeviceToDevice but it does not seem to work. So I would like to copy the elements of A to d_A and update d_A directly but i cannot figure out how the indexing would correspond in d_A. When I print out ((N+32)*(N+32) + 32*maxnb + lwork+2*maxnb*maxnb) it gives me 25937, where N = 73. I am in the process of comparing the performance of MAGMA, CULA (commercial), and CUSP (sparse matrix solver being developed by people from NVIDIA). So far MAGMA seems to have the best results, but if I can figure this out I would be able to present complete results.Thanks!

A strip around the matrix is required for padding (to make the new size divisible by 32). This would result in potentially increasing the leading dimension of the matrix, e.g., see in testing_sgesv_gpu.cpp

Thus, if you have allocated enough memory, your program should work, but to update elements on the diagonal (and in general any element of the matrix) you must use d_update_matrix with offset equal to dlda+1 instead of with offset N+1. Please let us know if this didn't work (i.e., to jump through columns you have to increase your indexes with multiples of dlda, not N) .Thanks,Stan