Interesting anomaly of INV_NEWTON_THRESHOLD

At http://gmplib.org/devel/INV_NEWTON_THRESHOLD.html we see that
ultrasparc machines get a very low INV_NEWTON_THRESHOLD.
What is going on, why does mpn_invert become faster than plain division
for such small operands, i.e., why does mpn_sbpi1_divappr_q run slower
than the newton loop?
It turns out that this is due to a know "flaw" in schoolbook division:
It uses mpn_submul_1 for updating the partial remainder. Schoolbook
multiplication on the other hand uses mpn_addmul_2. On some machines,
and ultrasparc in particular, the latter is much faster per cross
product than the former.
The solution is to base schoolbook division on a multi-limb inverse, and
develop multiple quotient limbs at a time.
--
Torbjörn