nisse at lysator.liu.se (Niels Möller) writes:
> https://static.docs.arm.com/101398/0200/arm_cortex_a75_software_optimization_guide_v2.pdf
If I manage to read those footnotes correctly, the only multiply
instruction with that data dependency is "Multiply accumulate X-form",
footnote 10, page 15. Maybe we can just avoid that instruction?
I am not aware of any other instruction which does a 64b x 64b -> 64b
lowhalf. One gets the plain non-accumulate variant by supplying x31
alias xzr for the input accumulation register.
But I don't quite get the numbers (and I'm not familiar with the
instruction set, so I'm not really sure which of the instructions are
relevant for GMP loops). UMADDL looks reasonable (3 cycles latency for
the factors, but only 1 for the addend). But UMULHI looks much slower,
so maybe I'm missing something?
There is no UMULHI instructions. UMULH is our 64b x 64b -> 64b highhalf
instruction.
The other instructions have 32-bit operands.
Does any arch have a multiply-accumulate-high instruction, producing
floor (a*b+c)/B ? Which would be very useful, if latency for the c input
is small.
The armv7-a, IA-64 and POWER 3.0B ISAs have such instructions. I
recently committed code for the Power9 CPU which implements POWER 3.0B)
Unfortunately, Arm64 only does lowhalf accumulation, which is quite
useless. It is clear from the documentation that UMULH's encoding is
prepared for an accumulating variant. (The Ra field is 11111, i.e. x31
alias xzr.)
Late accumulation input read is useful, but even with early read these
instructions are very useful to us.
--
Torbjörn
Please encrypt, key id 0xC8601622