Bit-tricks and other nifty little snippets

1
Introduction

Here are a few nifty little items involving bit operations for various
purposes like branchless signs and bit conversions. These things should
be faster than more standard algorithms for the same thing. Because
they came from GBA-experience, they're particularly apt for ARM cores,
though they should work for other platforms as well. Routines are
given in C or ARM or Thumb assembly. I'm assuming bits per word (BPW) here
(BPW = 32), but code can be extended

Just a bit of warning, though: I am not saying these tricks should be
used; I'm saying these can be used. They're mostly just for fun and
to show alternative solutions. Whether they're faster than the regular
implementations depends on many things, including processor and compiler.
I would recommend some of these if you're already working in assembly
though – once you're doing that, you probably have found that you
need more speed than the compiler can give you (if not, you're doing
something wrong), at which time these entries can actually help you.

2
Math items

A few simply math functions like abs() and
sign() use branched code that could just as well be done
with bit operations, which are often faster than branching. Most of the
items here (alright, pretty much all of them) come from
S.E.
Anderson's bithacks; a collection that I discovered just
after I figured out a few of them myself.

2.1
Branchless 2-way signs

There are three ways of indicating a 2-way sign:
negative-based (−1,0), positive-based (0,+1) and sign-based
(−1,+1). The basic form is negative-based, which in 2s
complement is a one-instruction algorithm: simply sign-extend
into all bits by right-shifting by BPW−1. For a
positive-based sign, add one. The last one is a little trickier, but
can be calculated by multiplying the negative-sign by 2 and adding
one, or shift-right by BPW−1 and setting bit 0 (Note using BPW−2 will work as well).

Most of the routines in this section use the `asr 31' trick in
some way or another. It's amazing how much you can do with it.

2.2
Branchless 3-way signs

This can also be used for a 3-way sign: ±1 for positive/negative
and 0 for zero. The negative and zero parts can be foind from the
2-way negative sign. for the positive part, shift −x,
rather than x. This will give −1 for positive numbers, yet
(−0)>>31 will still be 0. Subtracting that from the
earlier sign gives the whole thing.

Of course you don't need a branchless abs() in ARM because
you can do the same thing with a rsblt, but it's nice that
it exists.

Reference:
S.E. Anderson : abs.
He also points out that apparently this approach is patented >_>.

2.4
Branchless min/max

For min/maxing, you have to compare two values, say a and
b. This generally takes the form of a subtraction:
a−b. The point is that whether
a>b or vice versa depends on the sign of the
difference. If you then mask this difference with the −1-sign,
you have either 0 (if a≥b) or the signed
difference (if a<b, so this will always be
negative), depending on which variable was bigger, and this
value can be used to ‘correct’ the compared variables.

Whether these are really much use is debatable, though: they use
3 registers and are one instruction longer than the standard min/max
routines, which use a compare, a branch and a move. That said, the
branch is quite costly. EDIT, 20071110: but not costly
enough. the standard routine costs, on average, 3S+½N
cycles; the branchless version costs 4S. Thanks Dennis.

There is, however, one special case where this version is definitely
better: max(a,0), (and presumably
min(a,0) too, but that should be rare). In this case, the
thing basically reduces to `a-(a&a>>31)'.
This can be simplified to `a&~(a>>31)'
which is two instructions in Thumb and only one in ARM code.

2.5
A very sneaky bit-field clamp

It is sometimes necessary to keep values within a certain range, like
[0, A〉. The standard approach would be to check for
min/max and act accordingly: `y = min(max(x,A−1),0)'
Normally, this would cost 4 ARM instructions, or about 6 Thumb (with branches, ick); however, if A is a power of two, you can use this rather shifty
technique to do it faster.

First, define A = 2n. In other words, we
want the result to be confined to n bits. If x is
outside the desired range (either too large or too small), then some
of the bits outside the range will be non-zero. This gives us our test
condition. Even better, the specific bits set indicate on which side
we are and how to correct for it.
If x<0, then x is negative (well, duh), meaning that
the sign-bit is set. If x≥A, then the sign-bit is
clear. Now, the values to clamp to are (all) 0 and
A−1 (all ones for n bits); which are essentially
the complements of the extended signbit. In other words, once you
know you're outside the valid range, sign-extend
(x>>31), flip all bits and mask by A−1
(or shift-right by 32−n).

Interestingly, you can combine a few of these operations. The
range-test can be performed by a right-shift, which also acts as a
sign-extension. The flip and shift as usual. Using a shift-test,
rather than a mask-test also clears up a potential problem, namely
that the out-of-range values could have extended into the very high
bitrange (right underneath the sign-bit), and so complicating the
final shift.

If the clamp is in a loop, you can actually cut the ARM version down to two instructions. Instead of `~y>>(32−n)',
pre-load B = A−1 and use
`B^(y>>(32−n))'. One extra note on the
C version: currently, gcc still uses an extra (superfluous) cmp, so
compiled code would not be optimal, but it'd still be faster than the standard version.

ldrip,=(1<<n)-1@ prep mask before loop

@ in the loop:movsr1, r0, asr #neorner0, ip, r1, lsr #(32-n)

2.6
Cheap power-of-two math

You must know these already, but I'm going to point them out again
anyway. For powers-of-two (PoT), you can replace multiplies, divisions
and modulos by shifts.

Base form

Bitop version

y= x*2n

y= x<<n

y= x/2n

y= x >>n

y= x%2n

y= x & (2n−1)

I should point out, though that the division and modulo aren't
quite correct. The problem is that a division rounds towards 0
and a right-shift rounds towards −infinity. Sometimes this is
what you want anyway, but for an exact signed-division replacement you'd
need to do a correction.

3
Bit unpacking

Fair enough, bit-unpacking doesn't happen too often anymore, but it
may be useful for creating masks for bitmaps or collision detection, or
for font rendering, if the font is initially stored as a 1bpp
bitmap – and even when it isn't you can still do ... interesting
things with bit patterns.

3.1
8-fold 1→4 bitunpack via look-up

There are two ways of doing this: putting the masks in a table and
look them up, or use straight bit-manipulation. Version A is the
pussy version, but in Thumb seems to work best. What it does is
read a source byte, use the nybbles of that byte to look-up
the two halves of the word from a 16-entry LUT, then string them
together to make the final mask.

Note that there is a check to see whether raw is zero. If
it is, then you know the line is empty and you don't need to render
anything. If your font has lower-case letters, they will often have non-
zero lines (‘e’, for example). Skipping the render can make
a big difference. If not, the one extra S-cycle isn't too much compared
to the rest.

3.2
8-fold 1→4 bitunpack via bitops

You can also AND/OR/SHIFT your way to the solution, which is actually
faster in ARM code because shifts are free. With shifted ORRs, you can
double the number of bits processed per iteration, giving something like
log2(M) speed. The only problem is that at some
point you'll start overwriting previous results, so you have to clear
bits from time to time as well.

The basic algorithm is as follows. The changes are given in red, and
the ‘correct’ bits are bold.

3.3
8-fold 1→4 bitunpack with reverse

The previous subsection was about 1 to 4 bitunpacking where the bits
were were indexed little-endian inside the bytes: the left-most bit at
bit 0 and so on. This is nice, because both bits and bytes are then
arranged in a left-to-right reading order. Many graphics formats,
however, have the left-most pixel in the high-bit of a byte, which to
some is also nice, because it corresponds to a more conventional
number-reading order.

Yes, those are RORs. They're needed because some bits have to move up, while
others move down. I think some other variations are possible as well, but finding
those is left as an exercise for the reader.

4
Misc

4.1
Writing bytes into an 16-bit bus

This is intended for GBA-VRAM-like memory, which doesn't allow byte
writes. Well, it does, but it'll fill both bytes of that
particular halfword with the byte you write, which is generally
undesirable.

The standard solution is to see is to read the halfword, see which byte
you need, then bic/orr the appropriate bits and write the halfword
back. However, there is a slightly faster way. It starts by not reading
the whole halfword, but by reading only the byte that should not
be written to. The rest is fairly simple.

Yes, I know how awful the C version looks, but that cannot be helped
– you're not supposed to do stuff like that to pointers.
That's what assembly is for :P. It really is a little faster
than the standard version though. Of course, this is only beneficial
if you require random-access. For sequential accesses, there are
better ways.

4.2
Bit reversing

There is a really nifty method of reversing all bits (or power-of-two clumps
of bits) of a byte/word. The standard method will have you loop over all
pixels for O(N) time, but it's also possible to do it in O(logN).
The basic trick is to first flip all single bits, then groups of 2, then 4 and
so on. Here's an example of a byte-reverse, but it works for longer
pieces as well.

But wait there's more! The example above is bit reversing all bits.
But if you look carefully, you'll see that reversing groups of bits can be
done by skipping the earlier reverses. This works out to a switch-block
with fallthroughs.

The following is a C routine to reverse 1, 2, 4, 8 and 16 bits within
a 32-bit word. It probably won't be compiled quite as it should, but
it's a good start. The cases are correctly compiled to a jump-table,
the ROR macro works. If you're working on ARMv5 or higher, you might
be able to use a CLZ here as well somehow.

For those wondering about the
const volatile: no I have not gone insane; it's just that
the GCC version tested with cannot AND literals properly. It'll use 4
byte-sized ANDs, rather than a single word-sized AND, which completely
kills the routine. The awkward formulation here is the only way I've found
that ‘fixes’ this. It probably (hopefully) only applies to
GCC for ARM, and only current versions (at time of writing, GCC 4.3.3).
For reversing inside bytes, this is not an issue.

I was playing with implementing the CLAMP routine under android today (ndk r8b). The code is compiled as Thumb-2, so it is not as restrictive as Thumb. I found that the "asr" instruction misbehaves when Rd != Rs. The workaround was to replace the "asr" with a "movs" like this: