Provide support for ARMv4, lacking bx and clz. Unroll the
test-and-subtract loop and compute the initial block as address,
shaving off between 5% and 10% on Cortex A9 and 30%+ a Raspberry Pi.
Code written by Matt Thomas and Joerg Sonnenberger.