Are you doing scattered writes? If so, figure out how you can do sequential writes. If you have to, you can do scattered reads with sequential writes in order later to put things right.

Also might consider, do you ned a compute shader? Perhaps you can use transform feedback or standard rasterization to write out your data stream(s). I'm not saying the compute shader is a perf problem -- I don't have much experience with them. Just suggesting something else you might try for a comparison data point.