I still haven't understood why. The DMA controller and the CPU are both able to saturate the memory bus but the CPU can do more than simple copying. How resources (most of all mem access) are distributed among the main CPU and the DMA or the secondary CPU is a matter of the implementation and does not depend on the unit to which the mem interface is assigned to being a DMA controller or a CPU. If the DMA controller or DMA HT is allowed to be "nasty", it could starve the main CPU. In the end there is a fixed memory bandwidth that needs to be distributed between the CPU AOS runs on and the 2nd unit, whether it be a DMA controller or a CPU.

Well... look at RPi3 which is 4 core ARM when compared to single core RPi1 however memory subsystem architecture is almost same - so yes - 4 cores sharing bus with GPU and few other blocks (also DMA) will reach in unavoidable way bus saturation, partially this may be reduced/workarounded by aggressive CACHE design (large CACHE, complex architecture) but at some point you will reach bottleneck case - insufficient bus throughput will affect performance.
You can always use faster RAM, increase bus width, increase clocks, use tricks (interleaving etc) however all this costs and never provide 100% satisfactory solution.