Terrible prefix sum (scan) perform on HD5850 cs_5_0

I've been benchmarking my own prefix sum and radix sort shaders, as part of a big D3D11 compute application. Unfortunately the performance has been terrible, way below naive single-threaded CPU performance for prefix sum. I've already posted the problem here:

After posting that, I did update from Catalyst 9.11 to 9.12, and it improved throughput about 75%, but it is still two orders of magnitude smaller than what I'm hoping for. I'm sure this application is not memory bandwidth bound.. It is probably some issue with groupshared memory. I'm trying to rewrite the shader as a cs_4_0 shader to run on my notebook's NVidia chip, for comparison purposes, but it's very hard to express groupshared writes as only SV_GroupIndex. Makes the downsweep pass a real trick to write.