Floating point numbers. Why we need them?

Since computer memory is limited, you cannot store numbers with infinite precision, no matter whether you use binary fractions or decimal ones: at some point you have to cut off. Float point numbers is one of the possible way to represent real number so that to keep a trade-off between range and precision.

What does this mean?
It means that each float number, according to standard IEEE754, can be represented in next form:

Details about number representation.

However, we will consider only one of them namely Single precision which allows us to store digits with accuracy of 7-8 decimal numbers (from to in range).
A little more how the single precision floating point number is organized.

It occupies 32 bits(4 bytes) and provides (1 bit for sign, 8 bits for exponent and 23 for mantissa).

How the converting process happens?

I take some double(let it be 5.125) and will make conversion step by step, to show the whole number transition from decimal to binary format.

Now take a look at 5.125 and define next points:Sign = 0 (means positive number)Mantissa = 125 (actually this is the fraction)Exponent = 2 (power) – you will see later how can we get thisBase will be = 2(binary representation)
So eventually we will be able to see the number in exponential form and to understand how the computer will store it in binary format.

Step1 (conversion of the fractional part)

Since in normalized binary mantissa integer part always equals to 1, so that we will put only fraction part into mantissa.
Consider our 5.125 and take the fractional part = 0.125.

Now we need to convert it into a binary fraction:

Multiply the fraction by 2

Get rid of integer part

Check if new fraction = zero
If NO – re-multiply new fraction by 2 (Note: you can repeat until the precision limit is reached 23 fraction digits). If YES – finish.

After following schema above we got the next: – here is terminate
So 0.125 fraction can be represented in as 0,001
Therefore

Step2 (de-normalize number)

It means that we need to represent the number in exponential form. You can read more details here.
In general, you need to shift coma that the number will have such form:

So firstly, we need to make left or right shifting, depends on what we already have.
In our case we have 101.001, so that would be shifted to the right by 2 digits and become . Screen below:Step3 (find the offset-bite)
Actually, we need to make next:

Offset-bite = 127 + 2 = 129
After converting this to binary we will get 10000001

Final Result.

So what we exactly have? Our number 5.125 looks in exponential form like
this and represented in binary like this:

I hope it was helpful information for you. Feel free to correct me. Will appreciate.