Here's a faster implementation of generic_fls, that I discovered accidently,by not noticing 2.5 already had a generic_fls, and so I rolled my own. Likethe incumbent, it's O log2(bits), but there's not a lot of resemblance beyondthat. I think the new algorithm is inherently more parallelizable than thetraditional approach. A processor that can speculatively evaluate both sides of a conditional would benefit even more than the PIII I tested on.

The algorithm works roughly as follows: to find the highest bit in a value ofa given size, test the higher half for any one bits, then recursively applythe algorithm to one of the two halves, depending on the test. Once down to 8bits, enumerate all the cases:

The above expression can be considerably optimized by noting that once we getdown to just two bits, at least one of which is known to be a one bit, it'sfaster to shift out the higher of the two bits and add it directly to theresult than to evaluate a conditional.

A sneaky optimization is possible for the lowest two bits: the four values{0, 1, 2, 3} map directly onto three of the four wanted results {0, 1, 2, 2},so a little bit bashing takes care of both the conditional mentioned aboveand the test that would otherwise be needed for the zero case. The resultingoptimized code is sufficiently obfuscated for an IOCC entry, but it's fast:

In short, to find the high bit of a 32 bit value, the algorithm enumeratesall 32 possibilities using a binary search. Inlines clarify the recursion,and as a fringe benefit, give us a handy set of 8, 16, 32 and 64 bitfunction flavors, the shorter versions being a little faster if you can usethem.

The thing that makes it fast (I think) is that the expressions at the leavescan be evaluated in parallel with the conditional tests - that is, it'spossible to compute the results before we know exactly which one is needed. Another thing that may contribute to the speed is that the algorithm is doingrelatively more reading than writing, compared to the current version.Though I did not test it, I believe the speedup will carry over to assemblyimplementations as well.

There are still some optimization possibilities remaining. For example, insome of the evaluation cases the zero case doesn't have to be evaluated, soa little arithmetic can be saved. But then the helper functions wouldn't beuseable as sub-functions in their own right any more, so I don't think thesmall speedup is worth it for the decreased orthogonality.

The improvement on a PIII varies from about 1.43x with gcc -O2 to 2.08x at-O3. The benchmark runs 2**32 iterations, evaluating all 32 bit cases.Roughly speaking, at O3 it takes about 10 cycles to do the job: