Old geezer moment: sometimes these optimisations make me feel like I did when I was writing assembly code (inner loops of texture mappers slinging pixels… unrolling carefully, rejoicing when a clever trick got your loop down from 11 to 9 cycles per pixel – those were the days) and went from Pentium to PentiumPro – suddenly it was no longer possible to measure how many clock cycles any one instruction took: an innocuous change somewhere up or down the line would sometimes dramatically change the time response of an unrelated instruction, occasionally a loop iteration would randomly take more or less cycles than average, and sometimes instructions would appear to take a fractional amount of cycles. It was like wading through quicksand.

Of course, there was no way to inspect the result of the PPro’s µ-Op translator. Thank goodness for use re 'debug';:-)