While in practice this is (almost) always the case, introducing the
extent (and particularly the limitations) to which integer math is easy
(or, in fact, ‘doable’) is the first necessary step to understanding,
later in this series of articles, some of the limitations of fixed-point
math.

We start from the basics: since we are assuming a binary computer, we
know that n bits can represent 2^n distinct values. So an 8-bit byte
can represent 256 distinct values, a 16-bit word can represent 65536
distinct values, a 32-bit word can represent 4,294,967,296, and a 64-bit
word a whooping 18,446,744,073,709,551,616, over 18 (short) trillion. Of
course the question now is: which ones?

Representation of unsigned integers

Let's consider a standard 8-bit byte. The most obvious and natural
interpretation of a byte (i.e. 8 consecutive bits) is to interpret it
as a (non-negative, or unsigned) integer, just like we would interpret a
sequence of consecutive (decimal) digits. So binary 00000000 would be 0,
binary 00000001 would be (decimal) 1, binary 00000010 would be (decimal)
2 binary 00000011 would be (decimal) 3 and so on, up to (binary)
11111111 which would be (decimal) 255. From 0 to 255 inclusive, that's
exactly the 256 values that can be represented by a byte (read as an
unsigned integer).

Unsigned integers can be trivially promoted to wider words (e.g. from
8-bit byte to 16-bit word, preserving the numerical value) by padding
with zeroes.

This is so simple that it's practically boring. Why are we even going
through this? Because things are not that simple once you move beyond
unsigned integers. But before we do that, I would like to point out that
things aren't that simple even if we're just sticking to non-negative
integers. In terms of representation of the numbers, we're pretty
cozy: n bits can represent all non-negative integers from 0 to
2^n-1, but what happens when you start doing actual math on them?

Modulo and saturation

Let's stick to just addition and multiplication at first, which are the
simplest and best defined operations on integers. Of course, the trouble
is that if you are adding or multiplying two numbers between 0 and 255,
the result might be bigger than 255. For example, you might need to do
100 + 200, or 128*2, or even just 255+1, and the result is not
representable in an 8-bit byte. In general, if you are operating on
n-bits numbers, the result might not be representable in n bits.

So what does the computer do when this kind of overflow happens? Most
programmers will now chime in and say: well duh, it wraps! If you're
doing 255+1, you will just get 0 as a result. If you're doing 128*2,
you'll just get 0. If you're doing 100+200 you'll just get 44.

While this answer is not wrong, it's not right either.

Yes, it's true that the most common central processing units we're used
to nowadays use modular arithmetic, so that operations that would
overflow n-bits words are simply computed modulo 2^n (which is easy
to implement, since it just means discarding higher bits, optionally
using some specific flag to denote that a carry got lost along the way).

Programmers used to the standard modular arithmetic can find saturation
arithmetic ‘odd’ or ‘irrational’ or ‘misbehaving’. In particular, in
saturation arithmetic (algebraic) addition is not associative, and
multiplication does not distribute over (algebraic) addition.

So, when the final result is representable, modular arithmetic gives
the correct result in the case of a static sequence of operations.
However, when the final result is not representable, saturation
arithmetic returns values that are closer to the correct one than
modular arithmetic: 300 is clamped to 255, in contrast to the severely
underestimated 44.

Being as close as possible to the correct results is an extremely
important property not just for the final result, but also for
intermediate results, particularly in the cases where the sequence of
operations is not static, but depends on the magnitude of the values
(for example, software implementations of low- or high-pass filters).

In these applications (of which DSP, be it audio, video or image
processing, is probably the most important one) both modular and
saturation arithmetic might give the wrong result, but the modular
result will usually be significantly worse than that obtained by
saturation. For example, modular arithmetic might miscompute a frequency
of 300Hz as 44Hz instead of 255Hz, and with a threshold of 100Hz this
would lead to attenuation of a signal that should have passed unchanged,
or conversely. Amplifying an audio signal beyond the representable
values could result in silence with modular arithmetic, but it will just
produce the loudest possible sound with saturation.

We mentioned that promotion of unsigned values to wider data types is
trivial. What about demotion? For example, knowing that your original
values are stored as 8-bit bytes and that the final result has to be
again stored as an 8-bit byte, a programmer might consider operating
with 16-bit (or wider) words to (try and) prevent overflow
during computations. However, when the final result has to be demoted
again to an 8-bit byte, a choice has to be made, again: should we just
discard the higher bits (which is what modular arithmetic does), or
return the highest representable value when any higher bits are set
(which is what saturation arithmetic does)? Again, this is a choice for
which there is no “correct” answer, but only answers that depend on the
application.

To conclude, the behavior that programmers used to standard modular
arithmetic might find ‘wrong’ is actually preferable in some
applications (which is why it is has been supported in hardware in the
multimedia and vector extensions (MMX and onwards) of the x86
architecture).

Thou shalt not overflow

Of course, the real problem in the examples presented in the previous
section is that the data type used (e.g. 8-bit unsigned integers)
was unable to represent intermediate or final results.

One of the most important things programmers should consider, maybe the
most important, when discussing doing math on the computer, is precisely
choosing the correct data type.

For integers, this means choosing a data type that can represent
correctly not only the starting values and the final results, but also
the intermediate values. If your data fits in 8 bits, then you want to
use at least 16 bits. If it fits in 16 bits (but not 8), then you want
to use at least 32, and so on.

Having a good understanding of the possible behaviors in case of
overflow is extremely important to write robust code, but the main point
is that you should not overflow.

Relative numbers: welcome to hell

In case you are still of the opinion that integer math is easy, don't
worry. We still haven't gotten into the best part, which is how to deal
with relative numbers, or, as the layman would call them, signed
integers.

As we mentioned above, the ‘natural’ interpretation of n bits is to
read them as natural, non-negative, unsigned integers, ranging from 0 to
2^n-1. However, let's be honest here, non-negative integers are pretty
limiting. We would at least like to have the possibility to also
specify negative numbers. And here the fun starts.

Although there is no official universal standard for the representation
of relative numbers (signed integers) on computers, there is undoubtedly
a dominating convention, which is the one programmers are nowadays
used to: two's complement. However, this is just one of many (no less
than four) possible representations:

sign bit and mantissa;

ones' complement;

two's complement;

offset binary aka biased representation.

Symmetry, zeroes and self-negatives

One of the issues with the representation of signed integers in binary
computers is that binary words can always represent an even number of
values, but a symmetrical amount of positive and negative integers, plus
the value 0, is odd. Hence, when choosing the representation, one has to
choose between either:

having one (usually negative) non-zero number with no representable
opposite, or

having two representations of the value zero (essentially, positive
and negative zero).

Of the four signed number representations enumerated above, the sign bit
and ones' complement representations have a signed zero, but each
non-zero number has a representable opposite, while two's complement and
bias only have one value for zero, but have at least one non-zero number
that has no representable opposite. (Offset binary is actually very
generic and can have significant asymmetries in the ranges of
representable numbers.)

Having a negative zero

The biggest issue with having a negative zero is that it violates a
commonly held assumption, which is that there is a bijective
correspondence between representable numerical values and their
representation, since both positive and negative 0 have the same
numerical value (0) but have distinct bit patterns.

Where this presents the biggest issue is in the comparison of two words.
When comparing words for equality, we are now posed a conundrum: should
they be compared by their value, or should they be compared by their
representation? If a = -0, would a satisfy a == 0? Would it
satisfy a < 0? Would it satisfy both? The obvious answer would be that
+0 and -0 should compare equal (and just that), but how do you tell them
apart then? Is it even worth it being able to tell them apart?

And finally, is the symmetry worth the lost of a representable value?
(2^n bit patterns, but two of them have the same value, so e.g. with
8-bit bytes we have 256 patterns to represent 255 values instead of the
usual 256.)

Having non-symmetric opposites

On the other hand, if we want to keep the bijectivity between value and
representation, we will lose the symmetry of negation. This means, in
particular, that knowing that a number a satisfies a < 0 we cannot
deduce that -a > 0, or conversely, depending on whether the value with
no opposite is positive or negative.

Consider for example the case of the standard two's complement
representation in the case of 8-bit bytes: the largest
representable positive value is 127, while the largest (in magnitude)
representable negative value is -128. When computing opposites, all
values between -127 and 127 have their opposite (which is the one we
would expect algebraically), but negating -128 gives (again) -128 which,
while algebraically wrong, is at least consistent with modular
arithmetic, where adding -128 and -128 actually gives 0.

A brief exposition of the representations

Let's now see the representations in some more detail.

Sign bit and mantissa representation

The conceptually simplest approach to represent signed integers, given a
fixed number of digits, is to reserve one bit to indicate the sign, and
leave the other n-1 bits to indicate the mantissa i.e magnitude i.e.
absolute value of the number. By convention, the sign bit is usually
taken to be the most significant bit, and (again by convention) it is
taken as 0 to indicate a positive number and 1 to indicate a negative
number.

With this representations, two opposite values have the same
representation except for the most significant bit. So, for example,
assuming our usual 8-bit byte, 1 would be represented as 00000001, while
-1 would be represented as 10000001.

In this representation, the highest positive value that can be
represented with n bits is 2^{n-1} - 1, and the lowest (largest in
magnitude) negative value that can be represented is its opposite. For
example, with an 8-bit byte the largest positive integer is 127, i.e.
01111111, and the largest (in magnitude) negative integer is its
opposite -127, i.e. 11111111.

As mentioned, one of the undersides of this representation is that it
has both positive and negative zero, respectively represented by the
00000000 and 10000000 bit patterns.

While the sign bit and mantissa representation is conceptually obvious,
its hardware implementation is more cumbersome that it might seem at
first hand, since operations need to explicitly take the operands' signs
into account. Similarly, sign-extension (for example, promoting an 8-bit
byte to a 16-bit word preserving the numerical value) needs to ‘clear
up’ the sign bit in the smaller-size representation before replicating
it as the sign bit of the larger-size representation.

Ones' complement representation

A more efficient approach is offered by ones' complement representation,
where negation maps to ones' complement, i.e. bit-flipping: the opposite
of any given number is obtained as the bitwise NOT operation of the
representation of the original value. For example, with 8-bit bytes, the
value 1 is as usual represented as 00000001, while -1 is represented as
11111110.

The range of representable numbers is the same as in the sign bit and
mantissa representation, so that, for example, 8-bit bytes range from
-127 (10000000) to 127 (01111111), and we have both positive zero
(00000000) and negative zero (11111111).

(Algebraic) addition in modular arithmetic with this representation is
trivial to implement in hardware, with the only caveat that carries and
borrows ‘wrap around’.

As in the sign-bit case, it is possible to tell if a number is positive
or negative by looking at the most-significant bit, and 0 indicates a
positive number, while 1 indicates a negative number (whose absolute
value can then be obtained by flipping all the bits). Sign-extending a
value can be done by simply propagating the sign bit of the smaller-size
representation to all the additional bits in the larger-size
representation.

Two's complement

While ones' complement representation is practical and relatively easy
to implement in hardware, it is not the simplest, and it's afflicted by
the infamous ‘negative zero’ issue.

Because of this, two's complement representation, which is simpler to
implement and has no negative zero, has gained much wider adoption. It
also has the benefit of ‘integrating’ rather well with the equally
common modular arithmetic.

In two's complement representation, the opposite of an n-bit value is
obtained by subtracting it from 2^n, or, equivalently, from flipping
the bits and then adding 1, discarding any carries beyond the n-th
bit. Using our usual 8-bit bytes as example, 1 will as usual be
00000001, while -1 will be 11111111.

The largest positive representable number with n bits is still
2^{n-1}-1, but the largest (in magnitude) negative representable
number is now -2^{n-1}, and it's represented by a high-bit set to 1
and all other bits set to 0. For example, with 8-bit bytes the largest
positive number is 127, represented by 01111111, whose opposite -127 is
represented by 10000001, while the largest (in magnitude) negative
number is -128, represented by 10000000.

In two's complement representation, there is no negative zero and the
only representation for 0 is given by all bits set to 0. However, as
discussed earlier, this leads to a negative value whose
opposite is the value itself, since the representation of largest (in
magnitude) negative representable number is invariant by two's
complement.

As in the other two representations, the most significant bit can be
checked to see if a number is positive and negative. As in ones'
complement case, sign-extension is done trivially by propagating the
sign bit of the smaller-size value to all other bits of the larger-size
value.

Offset binary

Offset binary (or biased representation) is quite different from the
other representations, but it has some very useful properties that have
led to its adoption in a number of schemes (most notably the IEEE-754
standard for floating-point representation, where it's used to encode
the exponent, and some DSP systems).

Before getting into the technical details of offset binary, we look at a
possible motivation for its inception. The attentive reader will have
noticed that all the previously mentioned representations of signed
integers have one interesting property in common: they violate the
natural ordering of the representations.

Since the most significant bit is taken as the sign bit, and negative
numbers have a most significant bit set to one, natural ordering (by bit
patterns) puts them after the positive numbers, whose most significant
bit is set to 0. Additionally, in the sign bit and mantissa
representation, the ordering of negative numbers is reversed with
respect to the natural ordering of their representation. This means that
when comparing numbers it is important to know if they are signed or
unsigned (and if signed, which representation) to get the ordering
right. The biased representation is one way (and probably the most
straightforward way) to circumvent this.

The basic idea in biased representation or offset binary is to ‘shift’
the numerical value of all representations by a given amount (the bias
or offset), so that the smallest natural representation (all bits 0)
actually evaluates to the smallest representable number, and the largest
natural representation (all bits 1) evaluates to the largest
representable number.

The bias is the value that is added to the (representable) value to
obtain the representation, and subtracted from the representation to
obtain the represented value. The minimum representable number is then
the opposite of the bias. Of course, the range of representable numbers
doesn't change: if your data type can only represent 256 values, you can
only choose which 256 values, as long as they are consecutive
integers.

The bias in an offset binary representation can be chosen arbitrarily,
but there is a ‘natural’ choice for n-bit words, which is 2^{n-1}:
halfway through the natural representation. For example, with 8-bit
bytes (256 values) the natural choice for the bias is 128, leading to a
representable range of integers from -128 to 127, which looks distinctly
similar to the one that can be expressed in two's
complement representation.

In fact, the 2^{n-1} bias leads to a representation which is
equivalent to the two's complement representation, except for a
flipped sign bit, solving the famous signed versus unsigned comparison
issue mentioned at the beginning of this subsection.

As an example, consider the usual 8-bit bytes with a bias of 128: then,
the numerical values 1, 0 and -1 would be represented by the ‘natural’
representation of the values 129, 128 and 127 respectively, i.e.
10000001, 10000000 and 01111111: flipping the most significant bits, we
get 00000001, 00000000 and 11111111 which are the two's complement
representation of 1, 0 and -1.

Of course, the ‘natural’ bias is not the only option: it is possible to
have arbitrary offsets, which makes offset binary extremely useful in
applications where the range of possible values is strongly asymmetrical
around zero, or where it is far from zero. Of course, such arbitrary
biases are rarely supported in hardware, so operation on offset binary
usually requires software implementations of even the most common
operations, with a consequent performance hit. Still, assuming the
hardware uses modular arithmetic, offset binary is at least trivial to
implement for the basic operations.

One situation in which offset binary doesn't play particularly well is
that of sign-extension, which was trivial in ones' and two's complement
represnetations. The biggest issue in the case of offset binary is,
obviously, that the offsets in the smaller and larger data types are
likely going to be different, although usually not arbitrarily
different (biases are often related to the size of the data type).

At least in the case of the ‘natural’ bias (in both the smaller and
larger data types), sign extension can be implemented straightforwardly
by going through the two's complement equivalent representation: flip
the most significant bit of the smaller data type, propagate it to all
the remaining bits of the larger data type, and then flip the most
significant bit of the larger data type. (In other words: convert to
two's complement, sign extend that, convert back to offset binary with
the ‘natural’ bias.)

What does a bit pattern mean?

We're now nearing the end of our discussion on integer math on the
computers. Before getting into the messy details of the first common
non-integer operation (division), I would like to ask the following
question: what do you get if you do 10100101 + 01111111?

Divide and despair

To conclude our exposition of the joys of integer math on the computers,
we now discuss the beauty of integer division and the related modulus
operation.

Since division of the integer e by the integer o only gives an
integer (mathematically) if e is a multiple of o, the concept of
‘integer division’ has arised in computer science as a way to obtain an
integer d from e/o even when o does not divide e.

The simple case

Let's start by assuming that e is non-negative and o is (strictly)
positive. In this case, integer division gives the largest integer d
such that d*o ≤ e. In other words, the result of the division of e
by o is truncated, or ‘approximated by defect’, however small the
remainder might be: 3/5=0 and 5/3=1 with integer division, even though
in the latter case we would likely have preferred a value of 2 (think of
2047/1024, for example).

The upside of this choice is that it's trivial to implement other forms
of division (that round up, or to the nearest number, for example), by
simply adding appropriate correcting factors to the dividend. For
example, round-up division is achieved by adding the divisor diminished
by a unit to the divident: integer divisoin (e + o - 1)/o will give
you e/o, rounded up: (3+5-1)/5 = 7/5 = 1, and (5 + 3 - 1)/3 = 7/3 =
2.

Division by zero

What happens when o is zero? Mathematically, division by zero is not
defined (although in some context where infinity is considered a valid
value, it may give infinity as a result —as long as the dividend is
non-zero). In hardware, anything can happen.

There's hardware that flags the error. There's hardware that produces
bogus results without any chance of knowing that a division by zero
happened. There's hardware that produces consistent results (always
zero, or the maximum representable value), flagging or not flagging the
situation.

‘Luckily’, most programming languages always treat a division by zero as
an exception, which by default causes a program termination. Of course,
this means that to write robust code it's necessary to sprinkle the code
with conditionals to check that divisions will successfully complete.

Negative numbers

If the undefined division by zero may not be considered a big issue per
se, the situation is much more interesting when either of the operands
of the division is a negative number.

First of all, one would be led to think that at least the sign of the
result would be well defined: negative if the operands have opposite
sign, positive otherwise. But this is not the case for the widespread
two's complement representation with modular arithmetic,
where the division of two negative numbers can give a negative number:
of course, we're talking about the corner case of the largest (in
magnitude) negative number, which when divided by -1 returns itself,
since its opposite is not representable.

But even when the sign is correct, the result of integer division is not
uniquely determined: some implementations round down, so that -7/5 = -2,
while others round towards zero, so that -7/5 = -1: both the choices are
consistent with the positive integer division, but the results are
obviously different, which can introduce subtle but annoying bugs when
porting code across different languages or hardware.

Modulo

The modulo operation is perfectly well defined for positive integers,
as the reminder of (integer) division: the quotient d and the reminder
r of (integer) division e/o are (non-negative) integers such that
e = o*d + r and r < o.

Does the same hold true when either e or o are negative? It depends
on the convention adopted by the language and/or hardware. While for
negative integer division there are ‘only’ two standards, for the modulo
operation there are three:

a result with the sign of the dividend;

a result with the sign of the divisor;

a result that is always non-negative.

In the first two cases, what it means is that, for example, -3 % 5 will
have the opposite sign of 3 % -5; hence, if one would satisfy the
quotient/reminder equation (which depends on whether integer
division rounds down or towards zero), the other obviously won't. In the
third case, the equation would only be satisfied if the division rounds
down, but not if the division rounds towards zero.

This could lead someone to think that the best choice would be a
rounding-down division with an always non-negative modulo. Too bad that
rounding-down division suffers from the problem that -(e/o) ≠ (-e)/o.

Summary

Integer math on a computer is simple only as far as you never think
about dealing with corner cases, which you should if you want to write
robust, reliable code. With integer math, this is the minimum of what
you should be aware of: