If you want this on the GPU, a CUDA kernel would be fairly easy. Something like below (completely UNTESTED CODE). I'm not sure what you mean by "j"; here I assumed you meant the row index, starting from i = 0, where v is a column vector. Adjust as desired. Compile with nvcc CUDA compiler. I based this off the magma/magmablas/zaxpycp.cu code.