Integer division by constants other than positive powers of two is slow and can be sped up by changing from using the idiv instruction to using a imul and then a shift. Also mod falls into the same problem.

Steps To Reproduce

Divide by any non positive power of 2 constant and the generated assembly will include an idiv instruction, which is a very slow instruction. To put the speed difference in perspective, on a Nehalem idiv has a latency of 37-100 cycles whereas imul has a latency of 3.

Additional Information

I have finished implementing the algorithm which produces the "magic number" which is used to multiply the numerator and also a different number to shift by. I have completed the assembly code generation in the backend, and am currently testing various cases, and also changing mod.

See attached files, for changes (amd64). I added a new Iintop type Iintop_pow2 which uses immediate divisions that only require rdx and rax to compute (mainly power of 2). Iintop_imm is now used for all constant divisions, and the algorithm that is used is in emit.ml, along with the code generation. The algorithm can be found in "Hacker's Deliglht" by Henry S. Warren, Jr.

Also, I am unable to change the argument register for Iintop_imm(Idiv|Imod, _) from rdx to rcx (or any other register for that matter). I get the following:

Preliminary tests of this optimization for both division and mod (by constants that aren't powers of 2) have about a 3.9 times improvement in speed. Also, the changes to division by a power of 2 (3 shifts and an add vs 1 shift 1 test 1 cond mov and 1 add) shows marginal increase of about 1.1 times faster.

This could be a good thing if the ocamlopt backend could do this optimization. Do you have some practical example (not just a micro benchmark) where this is a performance bottleneck ?

However, I have some comments concerning your code:
- Could you please provide a file usable with the Unix patch utility (typically generated with diff -u) ?
- emit.ml is a generated file. Please edit emit.mlp instead.
- You have to give some information to the register allocation algorithm to avoid emitting the moves you are speaking about. You have to modify destroyed_at_oper in proc.ml, and pseudoregs_for_operation in selection.ml
- Is there a good reason to perform this optimization while emitting the final code ? I think it should be preferable to do this during instruction selection (in selectgen.ml, typically). This would help implementing it for other architectures. You would have to add to Mach a multiplication operation that preserves the higher order bits.

I just added the diff which has everything refactored so that emit.mlp is edited (instead of emit.ml) and the numbers are generated in selectgen.ml and selection.ml and then passed along in two two types that are in mach.ml.

A practical example is in converting a time to its individual parts (hours, mins, sec, etc). This function is used in converting times to strings (so is called in any sort of timestamping)

Optimization (of integer division and modulus when the divisor is a constant) implemented in SVN trunk, commits 14254-14256. Will be in OCaml 4.02. Currently supported for amd64 and i386 targets. Several simplifications compared with the proposed patch, e.g. negative divisors are not optimized (they never occur in practice), and there is no need to add new operators to the Mach intermediate language.