Either we bounce once cacheline per cpu per tick, yielding n^2 bounces or we just bounce a single..

Also, using per-cpu allocations for the thread-groups complicates the per-cpu allocator in that its currently aimed to be a fixed sized allocator and the only possible extention to that would be vmap based, which is seriously constrained on 32 bit archs.