Regarding the read/write performance, if I run the numbers, I'm getting around 160 MB/sec for a read or write of (128*128*128 * float3 * 4 bytes = ~25 MB). That does seem a little slow. There are a couple of things we could look at, picking different data formats, trying async copies and the like. If you like, we could take this offline; just send email to streamdeveloper@amd.com.

When you say you're copying in and out of the GPU every frame, is that inherent in your algorithm? I can see how you'd need to get the data back from the GPU, but shouldn't it be possible to leave every time step on the card for the next frame?