I'm working on a Cortex M0 cpu, which doesn't have hardware division, so every time I divide something, the GCC libary function is used. Now one of the division I do the most is dividing by 256, to convert shorts into bytes. Is there some way I can do this more efficiently (for example by bit-shifting) than the default GCC library will do it?

You would be interested in chapter 10, Integer division by constants, of the book Hacker's delight. But the GCC implementers have probably read it. Did you look at the assembly before assuming there was a better way? (note: this comment assumes that you do mean 255)
–
Pascal CuoqFeb 3 '13 at 16:48

3

If you divide by 255 to get the high-order 8 bits, I've got a bad news for you.
–
Jan DvorakFeb 3 '13 at 16:49

1

@JanDvorak No it has to be 256, you're right.
–
MuisFeb 3 '13 at 16:57

The other commenters have noted that you can do that divide-by-256 with a right-shift or various other tactics, which is true -- a reasonable version of this would be:

unsigned char c = (unsigned char) ((s + 32768) >> 8);

However, there is no need for such optimizations. GCC is very smart about converting divide-by-constant operations into special-case implementations, and in this case it compiles both of these into exactly the same code (tested with -O2 -mcpu=cortex-m0 -mthumb and GCC 4.7.2):

mov r3, #128
lsl r3, r3, #8
add r0, r0, r3
lsr r0, r0, #8
uxtb r0, r0

If you try to be too clever (as with the union or pointer-cast examples in other answers), you are likely to just confuse it and get something worse -- especially since those work by memory loads, and adding 32768 means you already have the value in a register.

You'd be surprised... I once saw a GCC port which dealt with explicit shift operations in the C code by emitting multiplication opcodes - someone probably assumed a hardware multiplier and hence equivalent cost, though the actual experimental hardware it was trying to run on did not have the multiply instruction implemented. Fortunately, being an FPGA, it was easy to add. Hopefully the case with division is better handled.
–
Chris StrattonFeb 4 '13 at 19:59

@ChrisStratton: Yup -- which is why it's useful to read the generated assembly every so often, just to make sure what you think is happening is actually happening! Although the case you describe is pretty clearly a bug in GCC.
–
Brooks MosesFeb 4 '13 at 23:20

@JanDvorak yes, but the answer states that the result depends on endianness. so it is in no way incorrect. Also, the OP knows exactly which architecture he is developing for, so he knows the endianness beforehand.
–
Andreas GrapentinFeb 3 '13 at 17:01

There's little justification for writing platform dependent code when the platform independent code is cleaner, and likely to be faster - programs don't necessarily spend their entire lifetime on the target for which they were originally written. There's a fairly high chance that in the shift case the variable could be held in a register (rather than memory) by an optimizing compiler; in the pointer case there might be some optimizing compiler smart enough to figure out that this is the only reason you are using a pointer and implement with registers, but it seems substantially less likely.
–
Chris StrattonFeb 4 '13 at 20:03

Would there be a performance gain from using the union vs bit-shifting? It seems the union would be faster, since there are no extra operations necessary, but maybe it has other overhead?
–
MuisFeb 3 '13 at 17:11

@Joshua Using union there is nothing more than mov instructions, on assembly level.
–
BSHFeb 3 '13 at 17:20

1

This will give the wrong answer, as it assume a Big-Endian processor but the target in question is actually Little-Endian. You could fix it, but the risk of a mistake remains, especially if the code is ever ported to something else.
–
Chris StrattonFeb 4 '13 at 20:18