Efficient C Tip #13 – use the modulus (%) operator with caution

Tuesday, February 8th, 2011 by Nigel Jones

This is the thirteenth in a series of tips on writing efficient C for embedded systems. As the title suggests, if you are interested in writing efficient C, you need to be cautious about using the modulus operator. Why is this? Well a little thought shows that C = A % B is equivalent to C = A – B * (A / B). In other words the modulus operator is functionally equivalent to three operations. As a result it’s hardly surprising that code that uses the modulus operator can take a long time to execute. Now in some cases you absolutely have to use the modulus operator. However in many cases it’s possible to restructure the code such that the modulus operator is not needed. To demonstrate what I mean, some background information is in order as to how this blog posting came about.

Converting seconds to days, hours, minutes and seconds

In Embedded Systems Design there is an increasing need for some form of real time clock. When this is done, the designer typically implements the time as a 32 bit variable containing the number of seconds since a particular date. When this is done, it’s not usually long before one has to convert the ‘time’ into days, hours, minutes and seconds. Well I found myself in just such a situation recently. As a result, I thought a quick internet search was in order to find the ‘best’ way of converting ‘time’ to days, hours, minutes and seconds. The code I found wasn’t great and as usual was highly PC centric. I thus sat down to write my own code.

Attempt #1 – Using the modulus operator

My first attempt used the ‘obvious’ algorithm and employed the modulus operator. The relevant code fragment appears below.

This approach has a nice looking symmetry to it. However, it contained three divisions and three modulus operations. I thus was rather concerned about its performance and so I measured its speed for three different architectures – AVR (8 bit), MSP430 (16 bit), and ARM Cortex (32 bit). In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and include the test harness overhead:

AVR: 29,825 cycles

MSP430: 27,019 cycles

ARM Cortex: 390 cycles

No that isn’t a misprint. The ARM was nearly two orders of magnitude more cycle efficient than the MSP430 and AVR. Thus my claim that the modulus operator can be very inefficient is true for some architectures – but not all. Thus if you are using the modulus operator on an ARM processor then it’s probably not worth worrying about. However if you are working on smaller processors then clearly something needs to be done – and so I investigated some alternatives.

Attempt #2 – Replace the modulus operator

As mentioned in the introduction, C = A % B is equivalent to C = A – B * (A / B). If we compare this to the code in attempt 1, then it should be apparent that the intermediate value (A/B) computed as part of the modulus operation is in fact needed in the next line of code. Thus this suggests a simple optimization to the algorithm.

void compute_time(uint32_t time)
{
uint32_t days, hours, minutes, seconds;
days = time / (24UL * 3600UL);
time -= days * 24UL * 3600UL;
/* time now contains the number of seconds in the last day */
hours = time / 3600UL;
time -= (hours * 3600UL);
/* time now contains the number of seconds in the last hour */
minutes = time / 60U;
seconds = time - minutes * 60U;
}

In this case I have replaced three mods with three subtractions and three multiplications. Thus although I have replaced a single operator (%) with two operations (- *) I still expect an increase in speed because the modulus operator is actually three operators in one (- * /). Thus effectively I have eliminated three divisions and so I expected a significant improvement in speed. The results however were a little surprising:

AVR: 18,720 cycles

MSP430: 14,805 cycles

ARM Cortex: 384 cycles

Thus while this technique yielded a roughly order of two improvements for the AVR and MSP430 processors, it had essentially no impact on the ARM code. Presumably this is because the ARM has native support for the modulus operation. Notwithstanding the ARM results, it’s clear that at least in this example, it’s possible to significantly speed up an algorithm by eliminating the modulus operator.

I could of course just stop at this point. However examination of attempt 2 shows that further optimizations are possible by observing that if seconds is a 32 bit variable, then days can be at most a 16 bit variable. Furthermore, hours, minutes and seconds are inherently limited to an 8 bit range. I thus recoded attempt 2 to use smaller data types.

All I have done is change the data types and to add casts where appropriate. The results were interesting:

AVR: 14,400 cycles

MSP430: 11,457 cycles

ARM Cortex: 434 cycles

Thus while this resulted in a significant improvement for the AVR & MSP430, it resulted in a significant worsening for the ARM. Clearly the ARM doesn’t like working with non 32 bit variables. Thus this suggested an improvement that would make the code a lot more portable – and that is to use the C99 fast types. Doing this gives the following code:

All I have done is change the data types to the C99 fast types. The results were encouraging:

AVR: 14,400 cycles

MSP430: 11,595 cycles

ARM Cortex: 384 cycles

Although the MSP430 time increased very slightly, the AVR and ARM stayed at their fastest speeds. Thus attempt #4 is both fast and portable.

Conclusion

Not only did replacing the modulus operator with alternative operations result in faster code, it also opened up the possibility for further optimizations. As a result with the AVR & MSP430 I was able to more than halve the execution time.

Converting Integers for Display

A similar problem (with a similar solution) occurs when one wants to display integers on a display. For example if you are using a custom LCD panel with say a 3 digit numeric field, then the problem arises as to how to determine the value of each digit. The obvious way, using the modulus operator is as follows:

– The #define was added to get a 16-bit constant, which then can be casted to a 32-bit one when needed.
– All operations that can be made on 16-bit operands are forced to be 16 bit.
– Explicit typecast on “minutes” in the last line, to avoid implicit integer promotions. It shouldn’t affect execution speed, but my mind is damaged by MISRA C.

Speaking of modulos and optimizers, the IAR optimizer for MSP 430 does not always optimize modulo by powers of 2 and make them mask operations.
Even in high-speed and high-size mode, the optimizer transforms a ` % 8` to a call to `DivMod16s`.
I can’t see the advantage of doing this. The result is much slower, and actually uses more memory (2 words more, besides the DivMod16s function itself).

I used the modulo (checking that the divisor was a power of 2 at compile time), because I took for granted that the optimizer would replace them with masks, and I felt it was more readable and logical.

Maybe there is an explanation behind this behaviour, but until I understand this I’ll keep using masks whenever possible.

In C99, a mask and a modulo don’t get the same result for negative numbers (eg -3 % 2 == -1).

So (on a two’s complement machine) the compiler can only replace a modulo-power-of-2 with an logical AND if it can prove that the operand is always positive (obviously, the easiest way for the compiler to prove this is if it is an unsigned type).

The “usual arithmetic promotions” cause the `uint8_t’ to be promoted to `int’ before the modulus is applied. Obviously, it will always be a *positive* int – but perhaps the compiler isn’t smart enough to see that. See if an explicit cast to `(unsigned)` before the modulus makes a difference?

You are correct in that a %8 on an uint16_t is optimized away to a mask, but I think the problem is more likely to be between 8 / 16 than signed / unsigned.

In the case of a uint8_t, the value is stored in a 16 bits register anyway (16 bits being the native word size), and I believe that the promotion from uint8_t to uint16_t wouldn’t do anything. It would probably result in mov.w R15,R15 (unnecessary since R15 is already written to by a mov.b operation, which zeroes the high byte). So the register situation would be the same, but the optimizer would replace the mod to a mask.

Furthermore, the uint8_t variable I apply the modulo to is incremented by a inc.w instead of inc.b.
(here is the C code: `index = (index + 1) % 8;`, with index an uint8_t). I can see why the compiler would do that, some type of casting to its native word size, but then why calling a division function? The optimizer does know how to use a mask instead of %, if the dividend is uint16_t, but not in this case?

The compiler does not seem to be smart enough indeed, that is what disappoints me. If it manages to optimize on uint16_t, and makes the operations on uint8_t as if they were on uint16_t, why doesn’t it manage to optimize the uint8_t case?

Casting to (unsigned) does work. You seem to be correct, thank you. A variable of type uint8_t should be a hint obvious enough for the optimizer to know the value is unsigned, shouldn’t it? It is enough when the variable is uint16_t, so why not uint8_t? Especially since the uint8_t is created by the same operations as the uint16_t…

Thanks for the hint on usual arithmetic conversions. The problem was in the increment: `a = (a + 1) % 8;`. 1 is a signed int, a is uint8_t but was converted to a signed int according to C99’s 6.3.1.8. As you mentioned, modulo on a signed dividend may not be optimized to a mask operation.

That’ll teach me to use 1U instead of 1.
Separating the increment and the modulo to two different lines also works, as well as casting to (unsigned) as per your suggestion.

With a dividend of type uint8_t (instead of int, as (a+1) turned out to be), the optimizer is in fact smart enough to see that it is always positive, and does use a mask. My bad.

Here is my attempt for the 3-digits problem. I do two divisions by 10 and keep both remainders. I used an optimization for divisions by constant from Hacker’s Delight book. I futher optimized by knowing the limited input bound and using remainders from a table from the computations.
This is not very readable but should be pretty fast!

The problem is that C provides to access to the underlying instructions. All hardware integer dividers produce the modulus as a side effect of doing the division. This is why most instruction sets have an instruction that returns both div and mod. But this efficiency is not captured in most languages operators, including C, perhaps because built in language operators generally return an atom of data not a dual. However, you can get it through a function call, div() or ldiv() which are in stdlib.h. ANd that is the answer! Not this crazy long essay.

Another alternative to try is the div() function – this is a standard C function that calculates a division and modulus together. In theory, it should assist the compiler to perform the same optimisation that you manually do.

modulus sometimes come for free if the two operators (% and /) are in the same block in the source and use the same operands, . I looked at some code generated by GCC and it appears that the divide routine produces the modulus simultaneously.

The processor is irrelevant. Modulo remainder is one of the two results that any reasonable processor or compiler library outputs for a division operation. Only terribly unoptimized code or awful compiler optimization requires TWO divides to produce the modulo AFTER the identical division is performed.

The following code should optimize to one CPU instruction or compiler library call, if they are grouped close together in executon:

time_hrs = time_sec / 3600;
time_sec = time_sec % 3600;

To be fair, I just tested using IAR EWARM and it required medium optimization settings to generate one call to the divide library function:

To be further fair, it didn’t generate one divide call in every spot that I had similar divide then modulo operations. That confuses me.

This should be a compiler optimization issue. If two identical divide and modulo operations are performed very near to one another, then the compiler should only have to perform the division AND get the remainder once.

It would be awful to have to circumvent poor compiler by getting the remainder through the above suggested subtraction of a product.

How about not using a single variable for keeping time?
Depending on how often the display function runs, I’d often rather store seconds, minutes, hours, and days in 4 separate variables.
Some microprocessors even do that automatically (BCD format: DoW, hour, min and sec take each a nibble of a 32 bit register, for example).
If you have to do it by hand it cost one comparison (besides the increments) per second, one comparison per minute, per hour, and so on… If you have to update your display once a second anyway (displaying time), you’d certainly win. Especially without HW multiplication or division.

In the case of integer for display, I believe you could use the double-dabble algorithm:

1. Shift the binary number left one bit.
2. If 8 shifts have taken place, the BCD number is in the Hundreds, Tens, and Units column.
3. If the binary value in any of the BCD columns is 5 or greater, add 3 to that value in that BCD column.
4. Go to 1

Gauthier,
keeping days, hours, minutes and seconds in separate variables is fine if all you want to do is display the time. But if you need to do a lot of time comparisons in your code, or do temporal arithmetic (e.g. adding 5000 seconds to a time), a single 32-bit counter makes for much clearer and less buggy code, which is surely the most important consideration. I would think it also makes for smaller and faster object code too, though I don’t currently have the means to check that.

In general I’d rather you emailed me these sorts of questions rather than put them in the forum. Anyway the answer to your question is that they are both right (and in some ways both wrong). I always configure my compilers to give the maximal amount of warnings. This is what my IAR AVR compiler had to say (line 5 is the assignment to b, line 6 is the printf):
Line 5: Warning[Pa079]: undefined behavior: variable “a” (or a value reached by some form of indirection through it) is modified more than once without an intervening sequence point in this statement
Line 6: Warning[Pa079]: undefined behavior: variable “a” (or a value reached by some form of indirection through it) is modified more than once without an intervening sequence point in this statement
Line 6: Warning[Pa081]: undefined behavior: the order of read and modification of variable “a” (or a value reached by some form of indirection through it) is undefined in this statement

Thus you are writing code in which the compiler’s behavior is undefined. When the compiler’s behavior is undefined, it is is free to do just about anything. Thus an output of “My brain hurts” would in fact be ‘correct’.

Incidentally, just to prove that it isn’t IAR being fussy. This is what PC-Lint had to say:

IAR gives the correct answer and Lint gives an incomplete one. I would strongly question the quality of Lint if it can’t even find this classic, well-known example of undefined behavior.

Modifying a variable twice without sequence points in between is undefined behavior (ISO 9899:1999 6.5 §2). Anything relying on undefined behavior is usually to be regarded as a severe bug, since the compiler is free to give -any- result.

But were it not for the undefined behavior, the same expression would also rely on the order of evaluation, which is unspecified behavior (ISO 9899:1999 6.5 §3). Unspecified behavior means that the compiler is free to do something in an implementation-specific way without documenting how. Relying on unspecified behavior can be a bug as well, even though the result is guaranteed to be deterministic. Someone who never intends to port their source code (or reuse it in other projects) may rely on unspecified behavior without getting any problems.

To sum it up, there is a severe bug in the code: accessing the variable twice between sequence points, and there is also a possible bug in the code, reliance on the order of evaluation of an expression. Lint did not find the most severe one.

As a sidenote, any static code analyzer tool with MISRA-C will find both of these bugs implicitly, as the bugs are covered by the MISRA-C rules. Perhaps if Lint is set to check for MISRA-C violations it would give a better output.

from attempt 1 to attempt 2 you change the orientation of your code. In the first attempt, you first compute the seconds then minutes, hours, days. In the second attempt you first compute the days, then hours, minutes, seconds.

Unfortunatly, I have no embedded compiler to try it (my work mostly is not coding) but perhaps your program performs better this way (?) :

In assembly language, the divide operation produces the result of the division and a remainder – the number that you want with mod. If a function is available to give you both, this could speed you code
-

I think this is only true for some 32 bit assemblers. For example most (all?) 8 /16 bit architectures lack a divide instruction. However your point that the remainder (modulus) inherently falls out of the division operator is valid. An earlier commenter had implied this in mentioning the standard div() function.

I think that any dececnt compiler should compile ‘compute_time’ to code that take zero cycles to execute. ‘compute_time’ doesn’t return anything, and only uses local variables; that is, it has no side-effects at all.

Full marks for paying close attention Ralf. I do quite a lot of this type of benchmarking and one of the problems you immediately run into is that your test code gets optimized away. In this case what I did was to include a printf() statement to generate output. This allowed me to check that the code was producing the correct answers. I recorded the number of cycles including the printf(). I then stripped out the test code and had printf() output constants. I recorded the number of cycles in this second case. I then subtracted one from the other to get the published results.

This is ridiculous. First, it is quite obvious that any division method, hardware or software will generate remainder as a by-product. Therefore, no conversion described in this article is required. This is not pc-specific. All 3 CPUs divide by software, but ARM, being 32-bit, does it much faster, wrote division code myself in assembly, it is very easy, by the way.
For 8 and 16 bit MCUs using 32-bit math is wrong approach altogether, and benchmarks show it quite clearly, no matter what “optimizations” you try to apply.

While I agree that 32 bit operations on small processors is painful, I don’t agree with you that this means that one should not do them. After all one is usually using an 8/16 bitter for a reason. What would you propose that someone who is using an 8/16 bitter should do when faced with this problem?

Write code which takes platform limitations into account, of course. Most older 8bit software probably uses custom representation for date-time stuff, allocating, for example, 1 byte for each component and instead using custom addition-subtraction subroutines instead of more general approach.
If you insist on conversion approach, and still care about speed, you would probably need a custom division code, optimized for dividing 32-bit uint by an unsigned byte. or at least try library div/mod functions first.
Writing a single piece of code for all three architectures and and expect it to give optimal performance is a little naive, IMHO.

By the way, sorry for placing this comment into the wrong place. I also forgot to add that division-capable MCUs actually do exist and would need custom code too. As far as I know, there are instruction set extensions with division for some members of different MCU families.

I have come across this issue more than once. Most uCs provide a remainder after a divide in hardware and as painful as it is, I bite the bullet and simply move the remainder contents to a working register with a short assembly sequence . I do not know of a “C” compiler that provides access to the remainder. Admittedly, this does not allow for portability, but it is the most efficient way of solving this issue that I know of.
‘David

The standard div() function is intended to allow the compiler to produce the quotient and remainder from a single division operation. Some compilers can do this for open-coded use of / and % too (gcc, at least recent-ish versions, does so).

i have an issue with your equation. Explain to me please how if C=A%B, C is also equal to A – B*(A/B). Let’s say A was 6 and B was 4. C = 2 in the first operation. But does 2 = 6 – 4*(6/4)? It doesn’t does it.

btw i don’t mean to sound rude, but i’m doing a math project on the modulus operator (which is like the coolest operator i’ve seen yet other than powers) and it’s usefulness. I know what the graphs look like, but i still need to know why exactly it is useful for converting time. Also, I believe that the second equation given, C = A – B *(A/B), will always produce zero if you were to write it out. If you could, could you please help me find in what ways is the operator useful for real world applications? Thank you mr. Nigel.

No offense taken. This is a blog about embedded systems and the modulus operator crops up a lot. To give you examples from my work unfortunately requires you to know a fair amount about embedded systems. Anyone out there have some simple real world examples for Lucas?

I think the issue is that you are treating what I wrote as an equation rather than as C code using integer operations. Perhaps this is clearer A – B * INT(A/B) where INT(A/B) is taken to mean perform integer arithmetic. In which case INT(A/B) = INT(6/4) = 1. Thus the modulus becomes 6 – 4 * 1 = 2, which I believe is the correct answer. Incidentally one of the things I did do is gloss over what happens if one of the operands is negative. To illustrate how interesting that gets, consider Microsoft’s definition of the .NET modulus operator, which is: A Mod B is the equivalent of A – Int(A / B) * B + CLng(Math.Sign(A) Math.Sign(B)) * B

If you run your code under the simulator then you will almost certainly find a cycle counter. I use IAR’s compilers – and all their simulators I’ve used have the feature. I used Keil’s 8051 compiler years ago, and its simulator also had a cycle counter. My recollection is that it was a pseudo register located under the register pull down. You could zero it at any time e.g. on the call to a function, then step over the function and get a direct read of how many cycles the function took.

If you want to quickly convert a 32-bit number of seconds into d,h,m,s, on an ARM M3 you can do so with one add, five multiplies and 3 subtracts, maybe 20 cycles total.

Of course the article was there to illustrate pitfalls of the modulo operator, and not to provide the best algorithm.

The method I describe here makes use of the fact that the M3 can multiply 32-bit integers and give the most significant 32 bits of the result. The C language does not support this directly, but my compiler will generate good code if you declare a 64-bit integer and immediately right-shift it. The method works by avoiding the divide operation: rather than dividing by a constant K, you multiply by a pre-calculated ciel(2^32 / K ) and then look at the more significant 32-bits of the result. With a little thought and care, you can write a function that always gives the correct integer. The key to this is to round UPWARDS when pre-calculating the constants.

The most critical part is to get the number of days. We need to effectively divide by (24*60*60).

Prepare to multiply by 2^32 / (24 * 60*60) which we can pre-calculate. This comes to 49710.269629… which, if truncated to an integer, is less than 16 bits and it will not be accurate enough, since (24*60*60) is more than 16 bits. So we can split it into an integer and a decimal part.

Again we multiply the decimal part (0.269629..) by 2^32 and round it upwards. If rounded downwards, we might get a number of days that is one too low, but there is no risk of getting one day to many since the closest input we can get is 1 second short of a whole day.
SecondsInDay = 24*60*60
Now we have DaysIntMul = 49710 and DaysDecMul = 1158050442

Altogether 5 multiplies, three subtracts, one add, and if the compiler is good, the shifts are just an address selection and do not take machine cycles at all. In practice, depending where the results go, about 20 cycles on an ARM M3.

Jacob is correct to point out that the division by 60 or modulo 60 operation can also be transformed into a multiplication by the reciprocal of 60 with scaling. No division required!

You can also scale the factor (2^32 / K) up by a power of 2 for a more accurate result with larger divisors. Find the largest n where (2^n / K) < 1, then scale the result by (2^-n), which is just a right shift by n bits (for unsigned operands).

I have found that the IAR Embedded Workbench for ARM compiler does exactly this. I wrote out the algorithm below in C and compiled it (with optimizations turned on), along with the usual y = x / 60 and r = x % 60 calculations. The assembly code the compiler produced for both code segments was identical!

The algorithm:

Consider a division by 60 or modulo 60 operation.

A useful scale factor is ceiling(2^32 * 2^n / 60), where n is the largest integer such that (2^n / 60) < 1. We find n = 5 for a scale factor of (32/60).

The scale factor of ceiling(2^32 * 32 / 60) can be considered a 32-bit unsigned binary fraction representing (32/60).

Multiplying x by this scale factor always gives a product that fits in 64 bits without overflowing, since (32/60) > 32) >> 5.

If you want to quickly convert a 32-bit number of seconds into d,h,m,s, on an ARM M3 you can do so with one add, six multiplies and 3 subtracts, maybe 20 cycles total.

Of course the article was there to illustrate pitfalls of the modulo operator, and not to provide the best algorithm.

The method I describe here makes use of the fact that the M3 can multiply 32-bit integers and give the most significant 32 bits of the result. The C language does not support this directly, but my compiler will generate good code if you declare a local 64-bit integer and immediately right-shift it. The method works by avoiding the divide operation: rather than dividing by a constant K, you multiply by a pre-calculated ciel(2^32 / K ) and then look at the more significant 32-bits of the result. With a little thought and care, you can write a function that always gives the correct integer. The key to this is to round UPWARDS when pre-calculating the constants.

The most critical part is to get the number of days. We need to effectively divide by (24*60*60).

Prepare to multiply by 2^32 / (24 * 60*60) which we can pre-calculate. This comes to 49710.269629… which, if truncated to an integer, is less than 16 bits and it will not be accurate enough, since (24*60*60) is more than 16 bits. So we can split it into an integer and a decimal part.

Again we multiply the decimal part (0.269629..) by 2^32 and round it upwards. If rounded downwards, we might get a number of days that is one too low, but there is no risk of getting one day to many since the closest input we can get is 1 second short of a whole day.
SecondsInDay = 24*60*60
Now we have DaysIntMul = 49710 and DaysDecMul = 1158050442

Altogether six multiplies, three subtracts, one add, and if the compiler is good, the shifts are just an address selection and do not take machine cycles at all. In practice, depending where the results go, about 20 cycles on an ARM M3.

4 multiplies, 2 subtracts, maybe 2 shifts if your compiler isn’t too smart.
Again, note how this works correctly only when the constants are rounded upwards. This is because the right shift , just like any other integer division, truncates, and thus rounds downwards.

I think this is flawless up to 102300 but above 999 the hundreds will not be a single digit, of course.

Here’s one more idea for converting a 32-bit integer of seconds into days, hours, minutes and seconds. This time the algorithm uses only 32-bit integers with no intermediate 64-bit results, so it can be written in straightforward C with no typecasts etc. It can take any 32-bit, signed or unsigned. Here then is my original algorithm.

Whenever one needs to do integer multiplications starting with an integer whose range is not limited, if the incoming number is high, multiplying it by any number greater than 1 will cause an overflow. Here we can prevent this by first shifting away seven bits from the incoming 32-bit number of seconds. This causes no loss of accuracy, because the number of seconds in a day, which we normally think of as 60*60*24 = 86400, can also be written as 128 * 675. So we can divide the incoming number by 128, and then work on dividing it by 675.

The algorithm works by approximating the value of 1/675 by (97 + 93/1024 -59/131072) /65536.

The second line has an unexpected start: a constant 7 is added. This is another example of upward rounding that makes these algorithms work. The 7 will be lost by the large right shift unless the incoming number is very large., in which case it compensates for the imprecision of the integer approximation.

One can predict the general result one expects from such a comparison. It depends on how costly a division operation is compared to a multiply or a shift. On an ARM M3, the div operation is good: 2 to 12 cycles depending on the data, in which case, the line I wrote:

a = seconds >> 7; days = (7 +97*a + ((93*a)>>10) –((59*a)>>17) ) >>16

would run no faster than this:

days = seconds / 86400

On the other hand, if you had something like an ARM M0 (e.g. NIOS), with a single cycle multiply, and no div instruction, the library div function might take several hundred cycles, you would see plenty of benefit from avoiding division even at the high cost of the line I just quoted.

Speaking of NIOS, and the FPGA world, most FPGAs have plentiful and excellent hardware multipliers. When writing firmware (e.g. VHDL) the case is strongest for avoiding a div operation at all costs.

The most important results to remember from all this are:

1) One can easily derive integer approximations that can deliver flawless exact integer results in all cases provided that the constant divisor is small.
2) This becomes easier if you first remove all factors of powers of 2 from the divisor and pre-shift the dividend accordingly.
3) In general you replace a/b with (a * b) >> c, where b is pre-calculated as ceil( (1<<c) /((float) b) and c is preferably the word size (16 or 32) of your processor

Most, if not all modern CPU have hardware support for % operator (as a mater of fact, most will have a moddiv instruction which calculates division and modulo at the same time). % is, in most architectures absolutely equivalent in term of computation time to division.
Even for CPU that do not have a native / or %, the cost is the same as the % is calculated intrisicly, at no extra cost, as the / is performed.

HOWEVER, why you need to be carefull is that % behavior on signed numbes is UNDEFINED in C!!!!

So, on 2 different architectures (ARM and Intel for example), % will give different results depending on the sign of the inputs! and that is a VERY hard bug to find in your code!

“When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded. If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a.”

Things are different in C++ where it was left up to the implementation, but a similar standard to C99 was recommended as of C++98 and it has since been defined in C++11.