david@l8s.co.uk said:
> However adding an extra memory read:
> 10: ldrb r4, [r1]
> ldrb r4, [r0], #1
> strb r4, [r1], #1
> subs r2, r2, #1
> bne 10b
> puts the destination into the data cache and speeds it up
> to 470
But will slow things down on a machine with write-through caches, since
now we will fetch a line into the cache (taking many cycles) that will
never be used.
I'm not sure that there is a generic solution here. Even the ARMv5 PLD
instruction wouldn't help much.
david@l8s.co.uk said:
> For aligned copies using ldmia/stmia loops forcing a read doesn't help
> large copies. However short copies (ie ones where the source and
> destination stay in the cache) speed up by a factor of 4 if the
> destination is in the data cache. (the source was always cached during
> this test.)
What happens if you 'prefetch' from address + 15? Don't forget that
unless all your lines are cache-length aligned, then the prefetch for the
store will only fetch part of the area you are writing to. If you
'prefetch' from a higher address, then over a long copy you will ensure
that most write data goes into the cache.
R.