Round to next Power of 2

The task to optimize today is to calculate the next power of two higher than or equal to a given number. i.e. round a number up to the next power of two. The first step when given such a task is to see if there is an obviously way of doing it. Having correct code is more important than having fast, incorrect code.

we can test to see how long this takes. Unfortunately, the compiler is a little smart for us - and inlines the routine we want to benchmark into main(). The lack of doing the function call means that the results are slightly tainted. To fix this, add __attribute__((noinline)) to the definition of next_pow2() The result of this is a program that runs in 8.3s on my machine.

To go faster we need a better algorithm. The current algorithm works fairly well if x is small. However, the number of loop iterations scales logarithmically, leading to longer times for larger inputs. Is there a constant-time algorithm we could use instead?

The trick is to try to tell the compiler more information. The more information it has, the better it optimizes. The compiler has intrinsic functions that we may use, and one in particular __builtin_clz() looks promising. This function counts the leading zeros in the binary expansion of a number. A version of next_pow2() using this function is:

Benchmarking this yields immediate improvement. The program now takes a total of 0.000s to run. This is too good to be true. The compiler is very smart, and has realized that next_pow2() does not modify memory. Since it's output is ignored by the benchmarking loop - the loop itself can be elided. The result is the equivalent of just return 0; for main()

To prevent this, modify the benchmarking loop into something which does not ignore the results:

This runs in 1.911s, a bit over four times faster than the original code, even though it is doing more work via summing the results. Why is this so fast? What is gcc doing for for __builtin_clz()? The assembly for the function is:

As we can see, gcc is using the bsr instruction to find the leading 1, and then manipulates it to get the count of leading zeros. The problem is that it isn't smart enough to realize that it is really the position of the leading 1 that we are after. To get better code we'll need to use some inline assembly:

The resulting code is an instruction shorter. However, it isn't faster, still taking 1.911s. The move and xor instructions can overlap in execution, so we haven't gained anything yet.

Perhaps we can gain something by going to branchless code? Branches are notoriously slow, so we may be able to go faster by eliminating the check for x <= 2 somehow. This is possible if we realize that the non-branching version of the code above still gives the correct answer if x == 0, or 2; making x == 1 the only special case. We can check for this without branching, however, yielding the code

as the function in assembly. This is indeed slightly faster, clocking in at 1.908s, but this is well within the noise, so it is debatable whether or not this trick helped other than making the generated code smaller by two bytes.

Can we go faster? The next trick is to do some research. The book "Hacker's Delight" is full of useful algorithms for problems like this. On page 48 it offers an improvement using the fact that if you set all bits lower than a given bit, then the number is one less than a power of 2.

So is this the fastest code? On this processor, no. It is marginally faster, running in 1.403s, if we use the unsigned long long type instead of unsigned integers in the next_pow2() function. However, this trick probably only applies to this particular machine. Moreover, the fact that the bsr instruction is too slow to help also doesn't apply in other machines like those using the core2 architecture. In that case, the branchless bsr-using function may be the fastest.

In short, even with a task as simple as rounding to the next highest power of two, there are multiple solutions, each vying to be the fastest. There may be no clear winner, as in this case, where different machines like different styles of code. However, we can see that a little optimization can make a huge difference. The resulting 1.4s is nearly six times faster than the original routine on this machine.