Loop Unrolling Optimization Question

I have two versions of an iterative procedure, one that has rolled-up loops and one that is unrolled. The computation involves three nested loops. By my calculations, the rolled-up version requires 90,460 cycles, whereas the unrolled loop version has a calculated latency of 56,400 cycles.

However, in real execution, the rolled version is running 2-2.5 times faster. My question is why is the rolled version, with the higher latency, executing faster? Am I getting 'beaten' by instruction caching?

For example, the rolled version has much short code, and each time through the loops, the instructions are the same....only register contents change. For the unrolled version, the actual memory references are changing, so the much longer code cannot be cached.

By the way, this is running on a 1200 MHz Athlon (Thunderbird), with ABIT KT7E mainboard.

Any comments regarding this optimization issue are welcome....if it is a caching issue, how much 'unrolling' can be done before caching beats the unrolling?

Comments

I am guessing you are running under windows, if so, it can screw everything up. Especially when and how you are accessing memory. Take a look at the number of page faults, etc... I guess what I am saying is, you need to really analyze what your code is doing and what you want to accomplish. If there is branches, then branch prediction prefetch can take the wrong path and can cause a stall...who knows? Show the code...

: I have two versions of an iterative procedure, one that has rolled-up loops and one that is unrolled. The computation involves three nested loops. By my calculations, the rolled-up version requires 90,460 cycles, whereas the unrolled loop version has a calculated latency of 56,400 cycles.: : However, in real execution, the rolled version is running 2-2.5 times faster. My question is why is the rolled version, with the higher latency, executing faster? Am I getting 'beaten' by instruction caching?: : For example, the rolled version has much short code, and each time through the loops, the instructions are the same....only register contents change. For the unrolled version, the actual memory references are changing, so the much longer code cannot be cached.: : By the way, this is running on a 1200 MHz Athlon (Thunderbird), with ABIT KT7E mainboard.: : Any comments regarding this optimization issue are welcome....if it is a caching issue, how much 'unrolling' can be done before caching beats the unrolling?: : Thanks,: : John:

If you do a read access at the beginning of each cacheline that loads the content of a part of the next cacheline into an unused register, the next line will probably be loaded to the L2 cache without losing much speed, because the memory access has no dependencies.

I don't know if this will bring much speed, or any at all. But if it works, please write back...

: If you do a read access at the beginning of each cacheline that loads the content of a part of the next cacheline into an unused register, the next line will probably be loaded to the L2 cache without losing much speed, because the memory access has no dependencies.:

Actually, I was hoping to do both runs without any caching....just to be sure I am comparing apples and apples. One of the routines (it is a 20x20 matrix multiplication) is very amenable to caching benefits and the other is not....

[b][red]This message was edited by anqe at 2002-12-4 4:27:34[/red][/b][hr]If your compiler/assembler is generating poor machine code then unrolling may not be of any benefit.

visit AMD's website and download their code analyst software and the "AMD Athlon x86 Code Optimization Guide", some machine code instructions may need to be enterred manually to overrride the assembler generated instructiuons.

Like the manual says, avoid vector path instr. and arrange the code out of order to hide any memory latency as much as possible.

Ensure that any constant moves (that can be) are replaced with a clear and add imediate structure.

Don't always optimize based on execution latency, take into account the decoder and register dependancies. Also note that optimizing for your hardware is far more productive than optimizing for other peoples hardware

If you have a K7 don't worry about the Pentium 4, most K7 optmizations have no effect or a negative effect on the P4 (especially shift and (i)mul).