Recall

Before we move on, let's recall some existing knowledge we have now. Nowadays, or more commonly, we are using 32 bits processor at home, even in industry. General purpose registers like eax, ebx.. etc. are 32 bits. sizeof(int) = 4 (bytes). But not all registers are 32 bits, there are some registers having longer bit length. Since a decade ago, Intel introduced MMX extension, in which there are 8 registers mm0, mm1 .. mm7 having 64 bits length. After that, Intel introduced SSE extension, which has another new 8 registers xmm0, xmm1 .. xmm7 having 128 bits length. If you want to know more details, please go to my Links section. Look for Intel.

Requirement

Ask yourself first, what machine you are using. It should be Intel P3 or newer. You must bear in mind that this optimization method is machine dependant, which means that if your hardware does not support, you won't able to see the difference.

Code

The sample that I created, I purposely made it simple that it runs in console mode. Don't cut and paste, I rather want the reader understand and try it themselves. Here's the sample start..

The demo code will let you see the difference between these two functions that serve the same purpose. Start from here, I won't explain much, you will be alone and please read the comments within the code. I'm sure you will able to catch up. =)

Wait! Get your break point ready first, sit tight. When you do debugging, please try to step through both functions, you will notice the difference.

"DataTransferTypical" it will copy one int per loop (sizeof(int)=4bytes ) whereas "DataTransferOptimised" it will copy four int per loop (4*sizeof(int)=16bytes).

Setting up your Watch window.. In your Watch window, watch "piDst, 101". Then you will see how it is changing...

P.S.: You need to install processor pack in order to get your MSVC++ compile this code. See Links section.

Finally

This is my first article in Code Project, please bear with me if something is not right. Also, I hope the demo that I uploaded here is simple enough for beginners. Nothing fancy. Learning is fun, right? =)

Links

History

I will only update this article when people are requesting. The sample code will not be maintained.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

He started programming in dBase, pascal, c then assembly. Actively working on image processing algorithm and customised vision applications. His major actually is more on control engineering, motion control, machine vision & satistics.
He did like to work on many projects that require careful analytical method.
He can be reached at albertoycc@hotmail.com.

Comments and Discussions

I make some changes in the DataTransferOptimised function, delete some instructions and reorder the instructions to SEMI-avoid RAW hazzard's. Here is the time for some vector's. I put the last DataTransferOptimised function. Sorry about my bad english.

first of all, do u install processor pack?
secondly i would like say is, if u turn on opmization in modern compiler, it will help u to optimise the typical transfer code too. because the code is too simple to optimise.
if u modify to do the code slightly, u may see the different.
other way is, turn off the optimisation during ur release compile option.
then u will see the different too.

the article fail to provide you to show the absolute different between typical and optimise before of a few factors, one of them is modern compile do the job for you since the sample code is too simple. or we have to add some complication to the code, for example add some encrytion and formula to avoid automatic optimised by compiler.

Using the movntdq instruction instead of movdqa for writing back to main memory gives a significant speed improvement.
movntdq writes directly back to main memory bypassing the cache.
The code below with movntdq moves data at about 91% of the theoretical maximum on my P4 2.4GHz Celeron with DDR333 RAM.

Note that using movntdq is slower than movdqa when the entire array will fit in the cache but about 40% faster when the array is much larger than the cache.

The code when using movdqa runs at about the same speed as the memcpy function.

yes unicon..
practically, if u wish to applied this to production code. u may need to rewrite those function to cater for its header and trailer separately.. =) then all the body u can transfer with 128bit registers...

ohya.. "add esi,eax" will faster than "add esi,16" (should be)
but how much clock cycle is needed for register to register and memory to register? may be same?
anyway, can we make this run faster than memcpy? please refer to below feedback.. why the SIMD is so sucks up? because of its lantecy? as been said, movdqa took 6 clock cycle to complete. Zzz..
nice try my friend... =)

From register it's faster. because the opcode is much smaller. for add edi,eax only one or two bytes are fetched. (have to look it up if it's one or two bytes). The add edi,16 is atleast 5 bytes to fetch.

I didn't test if the smd is slower.. I will check it. Maybe it has a big opcode?

If I recall correctly, in debug builds memcpy is implemented as a loop (though a bit more optimized than the one in the article). In release builds with intrinsic functions enabled (/Oi switch), the memcpy call is replaced by a few assembly instructions using 'rep stosd/b' to copy data in 32-bit blocks (and then any trailing bytes). Note that the actual copying is done internally by the CPU within the 'rep stosd/b' instruction, and can thus be expected to be about as fast as you get on the regular instruction set.

That having been said, I have not benchmarked it against MMX or SSE. Since the real strength of MMX/SSE is the SIMD features, faster memory access would merely be a secondary bonus. Has anyone benchmarked against intrinsic memcpy, and learned which method is faster for straight data transfer?

at least we know microsoft's engineers are not overpaid. they did their job!
because recently i read some article saying intel compiler is better than msvc++, they claim that it will optimise better than msvc++. izzit because of they are chip maker?

btw, could u get the source code of memcpy in ver7.1?
and i did check, the movdqa instruction took 6 cycle to complete. a bit latency here.

anyway, thanks for trying!
learning is fun.. let me see whether i can get better example to demonstrate the SSE's stuff.. there are lots more new instruction introduced by intel. most of them are dealing with longer register like xmm.

what machine u are using? as stated in the article, it should be intel p3 or newer.
when u do debuging, please try to step through both function, u will notice the difference.

typical one it will copy one int per loop ( sizeof(int)=4bytes )
where optimised one it will copy four int per loop ( 4*sizeof(int)=16bytes )
in your watch window, watch "piDst,101". then u will see how it is changing...

u must bear in mind that this optimisation method is machine dependant, which means that if your hardware not support, you won't able to see the different.