Recommended Posts

Anyone know of a routine that is faster than memset, or even better for that matter.
ZeroMemory, as far as I know uses the same thing.
Not that i''d be calling this amillion times, but its always good to know u have the fastest thing out there.
Dun mes wit me!

Share this post

Link to post

Share on other sites

Guest Anonymous Poster

Yes.. first.. you must decide if you want to load the area you are filling into the memory caches or not - typically when you are filling memory you do NOT want to also fill the caches but what you are doing there is most certainly filling the caches.

For when you desire/need cache pollution and the destination is 32bit aligned and 512 bytes or less..

sub eax, eax mov edi, StartAddress mov ecx, BytesToFill/4 rep stosd

In all other cases.. use MMX instructions - the AMD optimisation guide (available online) has a nice memory fill routine for MMX that is also near optimal on intel machines.

Share this post

Link to post

Share on other sites

When we first start thinking about it, unrolling loops so we don''t have the extra increment, compare, and jumps every iteration sounds like it''d be faster (less ops have to be done) But in most cases this doesn''t work.

The CPU has two caches, a data cache and a program cache. As our program is ran, the CPU loads chunks of it into the program cache and then runs the instructions from there. We get a speed increase from this because reading from cache is much faster than reading from RAM.

When we unroll a loop we may lower the number of ops the CPU has to perform, but we also greatly increase the size of the program code. With increased program code we can end up with a much higher number of cache misses, which is when the code needed isn''t in cache.

Everytime there is a cache miss the CPU has to flush the cache and refill it from RAM, which we know is quite slow, and while it''s waiting for the cache to be refilled the CPU has no choice but to sit and do nothing. To make matters worse, the increase in program size means there is more code that must be loaded into cache everytime we have a cache miss.

We know that we are going to have cache misses. A couple dozen cache misses per frame isn''t a big deal, but it isn''t hard to imagine how bad it would be if we caused a cache miss for every pixel we plot, every polygon we render, or anything else we do thousands of times per frame. Typically games are made of a couple small pieces of code that is run thousands of times in a row each frame, and with a little care we can get them to fit in the cache, giving us some great performance.

Now back to unrolling loops... The only time we tend to gain anything from unrolling a loop is when we have a small loop (only a few lines of code and few iterations) that gets run a large number of times (such as a loop that does some simple op for each vertex in a tri, and gets run for every tri in the scene). Large loops or loops with a large number of iterations (such as copying memory byte by byte) tend to almost always cause more penalties because of cache misses and code size than they could ever hope to gain by unrolling.

And a bit off topic... Optimizing by unrolling loops is one of the last things you want to do. If you find you are spending a lot of time in a particular loop modifying/changing to a better algorithm, cleaning up code inside loop, and minimizing the number of times the loop will be called will have a much greater effect.