Hmm... I would argue calling BlockMove is in itself an expression of simplicity. When you implement your memory allocation just once, you can afford to invest time and effort optimizing that single implementation. See also this article about the danger of naive implementations, which are much more likely to arise in one-off code than in something relied upon by many.

The naive implementation article is a good argument against simplicity.

I disagree. Simplicity only matters for things that actually work. If I can remove a line of code which then prevents my code from compiling, it's not simpler, it's broken.

So to the extent that correctness is necessary, the naive version is broken and therefore how apparently simple it is doesn't matter.

However, it's much simpler then to have one authoritative version of the function, which is well-tested and analyzed, which is then both correct and singular, than having multiple equivalent implementations written with varying levels of naiveté and correctness, which then violates the DRY principle.

Also, memcpy has simpler requirements than memove (which is what BlockMove is), so its a case of comparing Apples to... erm... something non-appley.

The alignment issue (source operand not being 32-bit aligned) is irrelevant to the total running time.

My plan was that if I always guaranteed that the source and destination addresses were properly aligned, then I could avoid all the special-case address checks and have a simple loop reading and writing 32-bits at a time.

In the case where he guarantees all addresses are aligned, he saves a single conditional jump (and a few dozen code bytes). The inner loop still copies 32-bits at a time, and the preamble is at most a handful of instructions where the source isn't aligned. This is just not worth it.