There are two things that will make this compile: One is to reduce all the matrix dimensions (M, N and W) to 8 or smaller. The other possibility is to use a hard-coded loop bound, so in the above example replacing the i < W by i < 10.

yes, it's nice and simple code. I first encountered the problem working on something different, which ran ok in the DP but threw an exception in the Beta, so I decided to start with some simple examples to see if I got the same or similar errors. Adding
arrays was too easy, but the matrix multiplication did reproduce the driver crash.

Back to your questions/suggestions:

1) using direct3d\ref as the accelerator the code works fine.
2) Sorry, I currently do not have a DirectX11 AMD device available, so cannot test this
3) Using 1024 (or 1000, to make it a non 2^n number works fine). I then decided to try a few more small numbers, and the behaviour is erratic, but somehow consistent: 9 + 10 fail, 11 and 12 work, 13 + 14 fail, 15 + 16 work, 17 + 18 fail, and so on. The same
is true for numbers around 1024.