What's really happening here? AFAIK cvtsd2si is a big instruction and should be slower than the first version, once added. I am thinking of taking up a new hobby (cooking, painting, ballet classes, piano, mountain climbing) if what I been reading about performance this long was proven wrong altogether.

No, aligning the spin won't help the first version either. You just have to add one more instruction before or after mulsd to make it faster. A simple nop should do, actually. What I don't understand is, WHY?

To be honest, you should first make sure your program runs correctly (including verifying by test suite), then improve the overall (high-level) algorithm, reduce unnecessary slow disk access, and then worry about micro optimizations at the cpu level. (Often times speed isn't important anyways, it's only executed occasionally. There are other concerns like readability or portability.)

As much as I really hate to say it, but sometimes hand-written assembly is slower than HLLs. GCC is fairly complicated and smart by now.

(For instance, I still haven't proven it, but I think one silly program of mine is tons slower because it's not disk-buffered behind the scenes automatically. So the C version is always tons faster, even with the oldest and dumbest compilers. But mine is smaller, of course.)

Quote:

No, aligning the spin won't help the first version either. You just have to add one more instruction before or after mulsd to make it faster. A simple nop should do, actually. What I don't understand is, WHY?

AFAIK, "pause" is an alias for "rep nop", meant for spin loops. Though I don't know if that will help much here.

Thanks for the great answers rugx, although I don't understand half of what you saying because I a complete noob on optimization.

What puzzles me is that the second snippet runs a lot faster than the first despite the fact that;

1. there's an obvious dependency on XMM0
2. cvtsd2si is a 23-clocker

So just by looking at the code, most people would believe that the first one runs faster than the other (who can blame them?), while it actually the opposite.

It breaks my heart to know that at some point, all else equal, adding more instructions can actually make your code runs faster, contrary to my previous belief. Nobody told me this before -_- Even the honorable Mr Agner Fog never mention this or never really put enough emphasis on this type of optimization technique. I guess he's getting balder and older to keep up with the latest offerings. hehehe

Welcome the the world of optimisation. Optimisation is hard. Really hard. And sometimes counter-intuitive.

But your test above is not representative of real world usage. You can't substitute a contrived benchmark and expect it to relate to actual use in a real app.

My suggestion is to not bother with specialised "performance" loops and similar things, they will tell you nothing useful when you want to apply it to the final product.

Also, testing on only one system tells you nothing about performance on other systems.

I guess all of the above comes across as negative, but really, this type of testing above gives no valuable results in almost all situations.

Your statement above "adding more instructions can actually make your code runs faster" is an example of the result of the internal complexity of contemporary CPUs. But it is not always true of course, and it will only be true in certain specific situations. But once again, contriving a specific example is not too hard, but it won't necessarily still be true when code it transplanted into another situation. So in conclusion: Always do your testing on the real world application.

Again, pipelining roughly means that it can start one instruction and finish in the background while also starting newer ones.

Or maybe "loop" is flushing the instruction queue prematurely?

Quote:

2. cvtsd2si is a 23-clocker

Which is still blazingly fast on multi-ghz cpus.

Quote:

So just by looking at the code, most people would believe that the first one runs faster than the other (who can blame them?), while it actually the opposite.

Supposedly the 8086 was more efficient for small and tight code, same as even 386. But for the 486 (pipelining!), it was faster to use simpler RISC-y instructions (mov [di],al // inc di) instead of CISC instructions (stosb). Even the Pentium / 586 (or, should I say 587) was pipelined, allegedly working faster if you interleave FMULs and FADDs (or whatever) to allow them to cooperate, not to mention the U (full) and V (weak) pipes, which was a big deal for compilers back then (e.g. GCC 2.8.x), requiring a recompile to really give significant speedups. 486 itself was allegedly very sensitive to alignment.

(I can also guarantee you that ENTER and LEAVE are much much slower on this [Westmere] Core i5 than simple 8086 instructions. Supposedly they were faster on an actual 186 [clone?], but not anymore.)

Quote:

It breaks my heart to know that at some point, all else equal, adding more instructions can actually make your code runs faster, contrary to my previous belief.

Modern cpus are very very sophisticated. They try insanely hard to figure things out on their own.

I found an old (1999) Dr. Dobbs article the other day on MMX. Here's just a small excerpt to prove my point:

Max I. Fomitchev wrote:

Both Intel's Pentium II and AMD's K6-2 are sophisticated CPUs with complex internal structures. Both CPU families employ superscalar pipelining, dynamic execution, and branch prediction -- and both can execute up to 6 m-operations per cycle.

Of course, there is a difference in the internal architecture. The Pentium II, for instance, has three instruction decoders and the K6-2 has two.

See? Even back then it was complicated. Nowadays it's even MORE complicated! Ugh.

Quote:

Nobody told me this before -_- Even the honorable Mr Agner Fog never mention this or never really put enough emphasis on this type of optimization technique. I guess he's getting balder and older to keep up with the latest offerings. hehehe

There's a lot of reasons. It just takes further study. Don't stress too hard, most things aren't so extremely timing sensitive.

Treat it as a hobby, learn as you go, and have fun. Just explore and investigate.

At worst, like I said, you learn new instructions or better ways to optimize for size, which (unless it horribly slows down everything, but that's rare) is always good in a pinch.

It's very implementation specific, but most recent CPUs have pipelines with depth around 20~30 stages and, depending on the implementation, any branch could incur a flush delay in the order of as many cycles. Hence, you can add an instruction with similar delay without incurring negative effects in the overall time.

In this particular case, the initial tight loop on the same instruction could be hiding further delay due to higher-than-usual instruction fetches, causing additional stalls, and saturation of the bus, which is likely to be wide enough for several instructions and so adding more instructions has no negative effect (possibly even a positive effect).

Might be something about the fact that complex CISC-style instructions generally tend to have bad impact on modern processors’ pipelining than simpler RISC-style instructions.

On the other hand smaller code might be better in terms of caches and processors’ internal instruction prefetch buffers. This plus the fact that most pieces of code are not bottlenecks, and in most cases we don’t have any visible difference.

Revolution/AsmGuru62
'loop' is very handy but it seems to be unfortunately significant slow on many many platforms: http://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently
I am writing at this very moment on a typical/cheap and relatively recent mass market laptop: Lenovo/Silvermont/N2830 with a similar experience: the 'loop' instruction on this computer is ~100% slower than some other instruction alternatives. My performance test code is quasi-compatible with the one from Intel which uses in difference to Agner Fog also the newer RDTSCP/CPUID combination. But anyway, it doesn't really matter in this case. Even if I just started/stopped manually testLoops by key watching the kitchen clock: the general pattern were obvious. However, on my other computer (that's an I7/i860/Lynnfield/Nehalem) the 'loop'-instruction behaves simliar to the other ones - so no huge problem here ...

My conclusion so far is, that if you want write more optimized code then you should avoid using 'loop'. It seems to be never the fastest (at best not slower) on the relevant computers in use but on many definitely slower ... and I might have overseen it, but if I recall it correctly then neither Intel ("64-ia-32-architectures-optimization-manual") nor Agner Fog in his opti-guides are ever using 'loop'. People like Mark Larson advice explicitly not to use it: http://www.mark.masmcode.com/ (please look for: "3. complex instructions")

Welcome to the forum!
When we talk about the performance - we should always consider the need for
optimization. I will give you two examples:

#1:
Lets say I am displaying a dialog box with a list inside filled with some items.
Depending on the amount of items - I will use LOOP or I will use the more performant
option:

Code:

@@:...
subecx, 1
jnz@r

#2:
You are writing a text editor and you need to open 100's of files in one shot.
Those files must be all parsed line-by-line for some features, etc.
In this case I will definitely be optimizing right away, without even measuring the code.
I will align all labels and will not use any LOOP's, because amount of processing is large.

In cases where you suspect that LOOP is slowing you down - you need to measure how
much time your code takes and make a decision based on that, just like revolution pointed out.

Edited:
I must add also that your figure of ~100% is probably not correct.
I measured once my code where I used LOOP vs SUB/JNZ and I came up with ~15% slowdown.
As I mentioned a case of a dialog - this is the time taken for a dialog to be filled with
items and shown to user. Human perception will fail to see a difference in case of
small amount of items.

If you're considering to optimize your code, treat all complex, CISC-style instructions with extreme prejudice and replace them with their plain counterparts.

Instructions like LOOP, STOS, PUSH, POP, and even CALL and RET are generally slower on modern CPUs. Wait, did I mention CALL / RET?

Reasons;

- These high instructions are just a wrapper instructions to their plain siblings when looong time ago memory space was very little and scarce. So intel decided to come up with shorter instructions mnemonics to save up space at the expense of speed.

- These high instructions share the same circuitry with their plain RISC-style siblings. Well, unless AMD/Intel dedicate special circuitry to these instructions, there's no reason to favor these instructions over the others.

- Complex instructions commit more MICROCODE time if compared to their plain RISC siblings. Here's the path taken by the RET instruction. So while people talk about instructio cache, they tend to forget about MICROCODE.

Revolution/AsmGuru62
'loop' is very handy but it seems to be unfortunately significant slow on many many platforms: http://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently
I am writing at this very moment on a typical/cheap and relatively recent mass market laptop: Lenovo/Silvermont/N2830 with a similar experience: the 'loop' instruction on this computer is ~100% slower than some other instruction alternatives. My performance test code is quasi-compatible with the one from Intel which uses in difference to Agner Fog also the newer RDTSCP/CPUID combination. But anyway, it doesn't really matter in this case. Even if I just started/stopped manually testLoops by key watching the kitchen clock: the general pattern were obvious. However, on my other computer (that's an I7/i860/Lynnfield/Nehalem) the 'loop'-instruction behaves simliar to the other ones - so no huge problem here ...

My conclusion so far is, that if you want write more optimized code then you should avoid using 'loop'. It seems to be never the fastest (at best not slower) on the relevant computers in use but on many definitely slower ... and I might have overseen it, but if I recall it correctly then neither Intel ("64-ia-32-architectures-optimization-manual") nor Agner Fog in his opti-guides are ever using 'loop'. People like Mark Larson advice explicitly not to use it: http://www.mark.masmcode.com/ (please look for: "3. complex instructions")

It's time for the good guys like us to enlighten our sinful bretheren like AsmGuru and revolution.

That isn't the microcode, it is the logical description in the manual. At the microcode level it will be very different.

Anyhow, as mentioned above test the code to make sure you get what you expect. There is a lot of old, and just plain wrong, advice on the Internet. Don't blindly trust it. CPUs are constantly changing in their internal designs so you never know what is now better and what is now worse.

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum