On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst<m.b.lankhorst@gmail.com> wrote:>> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,> and I finally figured out why. I also extended the test to an optimized avx memcpy,> but I think the kernel memcpy will always win in the aligned case.

"rep movs" is generally optimized in microcode on most modern IntelCPU's for some easyish cases, and it will outperform just aboutanything.

Atom is a notable exception, but if you expect performance on anygeneral loads from Atom, you need to get your head examined. Atom is adisaster for anything but tuned loops.

The "easyish cases" depend on microarchitecture. They are improving,so long-term "rep movs" is the best way regardless, but for mostcurrent ones it's something like "source aligned to 8 bytes *and*source and destination are equal "mod 64"".

And that's true in a lot of common situations. It's true for the pagecopy, for example, and it's often true for big user "read()/write()"calls (but "often" may not be "often enough" - high-performanceuserland should strive to align read/write buffers to 64 bytes, forexample).

Many other cases of "memcpy()" are the fairly small, constant-sizedones, where the optimal strategy tends to be "move words by hand".