Posts Tagged ‘ARM Cortex-M0’

Counting leading zeros in an integer number is a critical operation in many DSP algorithms, such as normalization of samples in sound or video processing, as well as in real-time schedulers to quickly find the highest-priority task ready-to-run.

In most such algorithms, it is important that the count-leading zeros operation be fast and deterministic. For this reason, many modern processors provide the CLZ (count-leading zeros) instruction, sometimes also called LZCNT, BSF (bit scan forward), FF1L (find first one-bit from left) or FBCL (find bit change from left).

Of course, if your processor supports CLZ or equivalent in hardware, you definitely should take advantage of it. In C you can often use a built-in function provided by the embedded compiler. A couple of examples below illustrate the calls for various CPUs and compilers:

However, what if your CPU does not provide the CLZ instruction? For example, ARM Cortex-M0 and M0+ cores do not support it. In this case, you need to implement CLZ() in software (typically an inline function or as a macro).

The Internet offers a lot of various algorithms for counting leading zeros and the closely related binary logarithm (log-base-2(x) = 32 – 1 – clz(x)). Here is a sample list of the most popular search results:

But, unfortunately, most of the published algorithms are either incomplete, sub-optimal, or both. So, I thought it could be useful to post here a complete and, as I believe, optimal CLZ(x) function, which is both deterministic and outperforms most of the published implementations, including all of the “Hacker’s Delight” algorithms.

This algorithm uses a hybrid approach of bi-section to find out which 8-bit chunk of the 32-bit number contains the first 1-bit, which is followed by a lookup table clz_lkup[] to find the first 1-bit within the byte.

The CLZ1() function is deterministic in that it completes always in 13 instructions, when compiled with IAR EWARM compiler for ARM Cortex-M0 core with the highest level of optimization.

If the ROM footprint is too high for your application, at the cost of running the bi-section for one more step, you can reduce the size of the lookup table to only 16 bytes. Here is the CLZ2() function that illustrates this tradeoff: