In his talk a few days ago at Facebook - slides, video, Andrei Alexandrescu talks about common intuitions that might prove us wrong. For me one very interesting point came up on Slide 7 where he states that the assumption "Fewer instructions = faster code" is not true and more instructions will not necessarily mean slower code.

Here comes my problem: The audio quality of his talk (around 6:20min) is not that well and I don't understand the explanation very well, but from what I get is that he is comparing retired instructions with optimality of an algorithm on a performance level.

However, from my understanding this cannot be done because these are two independent structural levels. Instructions (especially actually retired instructions) are one very important measure and basically, gives you an idea about performance to achieve a goal. If we leave out the latency of an instruction, we can generalize that fewer retired instructions = faster code. Now, of course there are cases where an algorithm that performs complex calculations inside a loop will yield better performance even though it is performed inside the loop, because it will break the loop earlier (think graph traversal). But wouldn't it be more useful to compare to algorithms on a complexity level rather than saying this loop has more instructions and is better than the other? From my point of view, the better algorithm will have less retired instructions in the end.

Can someone please help me to understand where he was going with his example, and how can there be a case where (significantly) more retired instructions lead to better performance?

Branch prediction, for example. In the linked example - adding instructions (sorting the array before doing the iteration) actually allows faster code, because the branch prediction is MUCH better for the sorted array.
–
amitDec 20 '12 at 13:22

2 Answers
2

The quality is indeed bad, but I think he leads to the fact that CPUs are good for calculations, but suffer from bad performance for memory seek (RAM is much slower then CPU), and branches (because CPU works as a pipeline, and branches might cause the pipeline to break).

Here are some cases where more instructions are faster:

Branch prediction - even if we need to do more instructions, but it causes for a better branch prediction, the pipeline of the CPU will be full more time, and less ops will be "thrown out" of it, which ultimately leads to better performance. This thread for example, shows how doing the same thing, but first sorting - improves performnce.

CPU Cache - If your code is more cache optimized, and follows the principle of locality - it is more likely to be faster then a code who doesn't, even if the code that doesn't do half the amount of instructions. This thread gives an example for a small cache optimization - that the same number of instructions might result in much slower code if it is not cache optimized.

It also matters which instructions are done. Sometimes - some instructions might be slower to perform then others, for example - divide might be slower then integer addition.

Note: All of the above are machine dependent and how/if they actually change the performance might vary from one architecture to the other.

With regards to branch prediction: I'm not sure this is still valid. Yes, there where times when branch prediction had a serious impact (depending on the arch, I'm on X86 here), but if I run the example from the thread on my Xeon 7560 the unsorted version is a few percent faster. So back to square one?
–
grundprinzipDec 21 '12 at 9:15

With regard to your point 2: The example you mention does not change the number of retired instructions, rather the ordering of data structures to exploit the cache structure. Any other example?
–
grundprinzipDec 21 '12 at 9:19

@grundprinzip: Branch prediction is constantly changing, as as said - different from one architecture to another. In order to find a killer case for your hardware, you might need an expert on the specific branch predictor. I am afraid I am not one of them.
–
amitDec 21 '12 at 9:20

@grundprinzip: Regarding #2: You can add a dummy instruction to each iteration in the cache-optimized loop, you will get more instructions and probably faster code in the optimized version (again, architecture dependent, and of course, compiler optimization dependent). Also, there is also a variation of this optimization with disks. A code that have a lot of disk seeks - but they are sequential might be faster then code with fewer random disk seeks, since the disk is not Random Access, and sequential read is much faster then random access read.
–
amitDec 21 '12 at 9:28