Odd assembly problem

This is a discussion on Odd assembly problem within the C++ Programming forums, part of the General Programming Boards category; Ah, now this is a nice inner loop.
Code:
.L2:
movsd &#37;xmm1, %xmm0
decl %eax
mulsd %xmm1, %xmm0
addsd %xmm2, ...

Exactly 100 cycles (again, not counting the "loop" part), a saving of 7 cycles per iteration - or about 6%. Not much, because the sqrt and divide aren't in parallel.

Numbers in brackets are for "new generation AMD" - the Quad Core processors. Now we are talking - 57 clocks for twice the processing [and the same speed for the original code].

If single precision is enough, we could do it even faster: First of all, we could double again the number of calculations, but also change the final calculation to use RSQRTPS - which calculates 1/sqrt(x) in 5 cycles instead of 27, and mulss instead of divps at the end, which is 4 cycles instead of 20, making the total count 19 for the loop, instead of 57. But the inverted square root is only available in single precision [and probably not quite as precise].

How can you double (or even quadruple) the instructions if one loop iteration depends on the result of the previous?

True, you can't. However, if, as I think, the actual code [as opposed to this benchmark code] isn't iteratively calculating the value of x = sin(arctan(x)) for 16 million times, but rather calculating y = sin(arctan(x)) for a great number of x (and y) values - in which case you'd be gaining a fair bit by doubling or quadrupling.

Yes, the final code will of course be performing this calculation on vectors of data with several hundred thousand entries, but I need double precision, otherwise id use single and get the twice as any (and faster) ops per instruction.

so as you can see, the time taken to perform the sin and tan are trivial when compared to the overall calculation. cant really use teh sqrt method as it only hold approx in the positive domain, and there will be results in the negative domain. For those of you that havent picked up on what the above code does, its an SSE implementation of a simple perceptron, using the sin(atan(x)) as the sigmoid.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

Yes, the final code will of course be performing this calculation on vectors of data with several hundred thousand entries, but I need double precision, otherwise id use single and get the twice as any (and faster) ops per instruction.

so as you can see, the time taken to perform the sin and tan are trivial when compared to the overall calculation. cant really use teh sqrt method as it only hold approx in the positive domain, and there will be results in the negative domain. For those of you that havent picked up on what the above code does, its an SSE implementation of a simple perceptron, using the sin(atan(x)) as the sigmoid.

Assmung the proportion of positive and negative inputs to your function isn't mainly negative, I think the gain from using sqrt() for positive numbers is big enough that it's worth doing an "if (input >=0) ... else ..." for the two options. [Unless we can do "temp = sgn(input); input = abs(input); <calculate using sqrt>; output *= temp;" to "fix up" the sign - I don't know if -sin(atan(x)) == sin(atan(-x)) or not].

actually I use __declspec(align(16)) static , I just snipped a lot of the function to just show the part of concern

Isn't an xmm register 16 bytes??

Yes it is, but its exactly the same as 2 doubles right next to each other or 4 flots, depending on which functions you use. This is what makes it so fast at handling vectors, there is no conversion process needed.

Start eax with dwcount, and count it down instead of a two-step add/compare.

in which case it just becomes a 2 step subtract and compare, and it causes cache issues with the read-ahead logic. remember, we ar processing more than one vector member at a time, so a simple LOOP wont work here, although I will optimize the add by putting 0x00000002 into ebx and doing a register to register add, 0x00000008 into ecx and dwCount into edx

Skip tempout++ and used tempout+4 instead?

yeah, I missed that one, forgot that the r/m field has several common increments encoded in it. That will also gt rid of the sloppy C++ in the middle of an asm block

__asm fadd
__asm fstp st(0)Unnecessary operation??

for some reason my compiler refuses to accept the faddp opcode without complaining, so i added this as a quick fix until I figure otu what the problem is, it was late when I wrote that code.

As for the portion of positive to negative inputs and wieghts, its completely unknown, it could be all negative or all positive, or somewhere in between.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

>Start eax with dwcount, and count it down instead of a two-step add/compare.
in which case it just becomes a 2 step subtract and compare, and it causes cache issues with the read-ahead logic. remember, we ar processing more than one vector member at a time, so a simple LOOP wont work here, although I will optimize the add by putting 0x00000002 into ebx and doing a register to register add, 0x00000008 into ecx and dwCount into edx

Eh, no, that's not any better. There is no performance difference betweed a small constant and a register in adding - it's better to let the compiler use those registers for addresssing data, etc.

But dwCount _IS_ the number of items to do, right? So if you cound down to zero/negative, you don't need a compare at all - just a "jump not zero" or "jump not signed" - saves one instruction. Also, try to shuffle this up a few instructions, so that the subtract has time to finish before the branch is taken.

You can also improve the speed by unrolling the loop [calculating more than one set of values each time].

My comment on xmm registers being 16 bytes:
Are you actually intending to calculate

Eh, no, that's not any better. There is no performance difference betweed a small constant and a register in adding - it's better to let the compiler use those registers for addresssing data, etc.

But dwCount _IS_ the number of items to do, right? So if you cound down to zero/negative, you don't need a compare at all - just a "jump not zero" or "jump not signed" - saves one instruction. Also, try to shuffle this up a few instructions, so that the subtract has time to finish before the branch is taken.

You can also improve the speed by unrolling the loop [calculating more than one set of values each time].

My comment on xmm registers being 16 bytes:
Are you actually intending to calculate

Although I expect that MOVAPD would actually fail if you add 8, because the next read is now unaligned.

And the same applies to pWeight, obviously.

Oh, and you still don't need to push and pop eax - the compiler will figure that out itself.

--
Mats

dwCount can be an odd number, and while the vector will always contain th eproper even number of members, the actual dwCount must be preserved elsewhere in the program. You cannot count down because for one, an odd number of dwCount would cause the eax to underrun to 0xffffffff and cause a serious memory problem since eax could never be zero it woudl keep going until it produced a protection fault, adn two there woudl still be the compare, since LOOP uses ECX and decrements it by one on its own, so there woudl still be two instructions to dec it by 2. Not to mention that INC/DEC are deprecated in x64 which would make this a non-portable solution. The primary issue with puttign the constants into registers isnt to speed up the operation, but to reduce the memory bandwidth usage, which in large vector calculations is the REAL bottleneck.

And I do need to push and pop eax because these are __asm lines, not a single __asm {} block. I do this because it improves performance, since an __asm block preserves the entire machine state, whereas I can just preserve the part im going to modify.

You are correct about the 8, it should be 0x00000010 not 0x00000008 , again, sorry it was late when I wrote that code

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

dwCount can be an odd number, and while the vector will always contain th eproper even number of members, the actual dwCount must be preserved elsewhere in the program. You cannot count down because for one, an odd number of dwCount would cause the eax to underrun to 0xffffffff and cause a serious memory problem since eax could never be zero it woudl keep going until it produced a protection fault, adn two there woudl still be the compare, since LOOP uses ECX and decrements it by one on its own, so there woudl still be two instructions to dec it by 2. Not to mention that INC/DEC are deprecated in x64 which would make this a non-portable solution. The primary issue with puttign the constants into registers isnt to speed up the operation, but to reduce the memory bandwidth usage, which in large vector calculations is the REAL bottleneck.

I'm not suggesting you change the content or meaning of dwCount itself. But if you decrement instead of increment, then you do not need a compare instruction - the processor sets the condition flags in decrement. If you have an odd value, you would have to "round up" before you start the loop [assuming it's valid to use the final, uninitialized element - otherwise your code is broken anyways], so that you always have an even value. If you do this, you can finish on zero, so you don't have to compare - and saving one instruction that is dependant on some other instruction is definitely a win - even if the individual instruction makes little difference.

You do not use any memory bandwidth by using register instead of constant for add or subtract - aside from POSSIBLY one byte extra code-space, but the code is in cache once it's been read anyways, so it does not compete with your data fetches.

Also, to clarify your misunderstanding: The INC/DEC instructions are still available in x86_64, but the simple opcode form is no longer available [there are two ways to produce exactly the same instruction in the x86 instruction set - in the 64-bit version, one of them was removed to make it a prefix instead, as there was no other "long sequence of single byte opcodes that can be changed", and double byte prefix quickly become a mess in the instruction decoder and code generators etc]. The assembler/compiler can still generate INC/DEC instructions, they just start with 0xFF ... [if my memory serves right].

And I do need to push and pop eax because these are __asm lines, not a single

Code:

__asm {}

block. I do this because it improves performance, since an __asm block preserves the entire machine state, whereas I can just preserve the part im going to modify.

Ok, fair enough - although "saves entire" state is not correct - it saves what it has to.

even if the const is in the cache, its still 4 bytes (its a DWORD const) of memory bandwidth that gets used, plus it competes with other instructions in the pipeline for access to the cache, whereas a register/register add only competes for the ALU. It could use the ADD r/m32 , imm8 opcode, but thats still 3 bytes versus 2 for teh register to register. plus it woudl be easier to realign other ops if the offset was 2 versus 3. true that I could use a NOP to realign, but id rather use some opcode that would improve performance by doing actual work. both ways are valid, it would depend on the specific tradeoffs you wanted to make. In this case I think Im goign to rewrite ti once again to improve performance even further, ill post the code later.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.