A bit short of time so I am unable to test this right now. Typically what I do in these situations is to start removing parts of the code, usually the Launch command and then replace this with a CopyOnDevice command. Check that you get back what
you wrote to the device. If good then start with an empty kernel method that just does a copy with same launch parameters - it should give the same result as CopyOnDevice. If good introduce your corner turning operations and so on.

I followed your suggestions. The main problem was that my graphic card doesn't support double precission. After fixing some small issues in the code above, I changed the data types from double to float and it works.

// Loop over all the sub-matrices of A and B that are required to compute Csub Multiply each pair of sub-matrices together and accumulate the results
for (int m = 0; m < (widthA / BLOCK_SIZE); m++)
{
// Get sub-matrix Asub of A
float[,] Asub = thread.AllocateShared<float>("Asub", BLOCK_SIZE, BLOCK_SIZE);