Query about Information Entropy formula

I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:

Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.

I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.

Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as 2 bits; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?

If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:

0.75 of the time, we'll need 1 bit:

0.75 * 1 = 0.75

Then the same logic for how often we'll need 2 and 3 bits:

0.1875 * 2 = 0.375
0.0725 * 3 = 0.1875

Add the 3 values: 0.75 + 0.375 + 0.1875 = 1.3125 bits on average.

I don't understand why my method gives a different answer and can only assume I am misunderstanding how exactly information entropy is measured.

As a sidenote, which may be the point I'm losing this on, I can understand that log(p(x)) gives how many bits I would need to describe X for a uniform distribution, but can't see how it is relevant for a random distribution.

For example,

log(1/4) = 2 bits, which makes perfect sense if I had 4 options each with an equal probability, but when the distribution is random, how is it ok to use this?

I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:

Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.

I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.

Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as 2 bits; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?

If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:

0.75 of the time, we'll need 1 bit:

0.75 * 1 = 0.75

Then the same logic for how often we'll need 2 and 3 bits:

0.1875 * 2 = 0.375
0.0725 * 3 = 0.1875

Add the 3 values: 0.75 + 0.375 + 0.1875 = 1.3125 bits on average.

I don't understand why my method gives a different answer and can only assume I am misunderstanding how exactly information entropy is measured.

As a sidenote, which may be the point I'm losing this on, I can understand that log(p(x)) gives how many bits I would need to describe X for a uniform distribution, but can't see how it is relevant for a random distribution.

For example,

log(1/4) = 2 bits, which makes perfect sense if I had 4 options each with an equal probability, but when the distribution is random, how is it ok to use this?

I think your sidenote and example explains whats going on. If p(X) = 1/4 for a uniform distribution, log(1/4) = 2 bits are required to encode the 4 possible values for X. So the assumption is that 2 bits are always required to encode an X value with p(X) = 1/4. But that's not a direct assumption. Rather it follows from the assumptions used to derive the formula: Continuity, Symmetry, Maximum and Additivity (these are listed in the above link). Specifically, I think Additivity drives this. I'd say if you want to add up subsystems, you have to treat X values with equal p(X) the same. See Entropy encoding - Wikipedia, the free encyclopedia, which comes from the above link.

I'm a computer programmer with a bit of rusty maths under my belt, but am now returning to education and was hoping for a little help getting to grips with Information Entropy. Specifically, I'm trying to understand this formula:

Which as I understand it is a measure of information entropy for a random variable x with a probability mass function p(x), or said differently - the average number of bits needed to describe the random variable.

I understand the elements that make up this formula, but I don't see how it is the average number of bits needed to describe the random variable.

Let's take, for example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64]. The formula gives the information entropy as 2 bits; meaning that on average you will need 2 bits to represent which of the set has been chosen, correct?

That checks out.

If we assign these bit sequences to the set to describe each variable: [0,1,10,11,100,101,110,111], then we can calculate:

This is not an admissible code, as knowlege of the code and the code-text
is not sufficient to recover the source-text. Consider the message 111100
this could be geneated from "he" or from "ddaa".

Now consider the code (maybe not the most efficient but admissible):

[0, 10, 110, 1110, 11110, 111110, 1111110, 11111110]

The average number of bits per charater is now 2.03125 >= 2 as required.

This is not an admissible code, as knowlege of the code and the code-text
is not sufficient to recover the source-text. Consider the message 111100
this could be geneated from "he" or from "ddaa".

Now consider the code (maybe not the most efficient but admissible):

[0, 10, 110, 1110, 11110, 111110, 1111110, 11111110]

The average number of bits per charater is now 2.03125 >= 2 as required.

RonL

I think I get what you're saying. Consider the following example, which I think is the simplest.

Suppose the characters are A, B and C with probabilities 1/2, 1/4, 1/4. According to entropy encoding, log(2) = 1 bit is needed to encode A, and log(4) = 2 bits are needed to encode B and C.

To achieve this requires the use of a prefix-free code to tell you whether a 1-bit or a 2-bit code is being used. Prefix-free means the code for any character is never a prefix for the code for any other character. So use A = 0, B = 10 and C = 11. 'ABC' = 01011. To decode, if 0 is read, the character is A. If 1 is read, then next bit is a 0 for B or 1 for C. This encoding is optimal.

If we tried to use fewer bits in the encoding, such as A = 0, B = 1, C = 00, the code for A is a prefix for the code for C and there would be confusion reading 'ABC' = 0100 = 'ABAA' too. Then a separator such as "," would be needed to separate the codes, so 'ABC' = 0,1,00 requiring extra bits for the separator.

The code given by CaptainBlack (the unary encoding), uses 0 as the separator. That encoding would be A = 0, B = 10, C = 110 here. It is not optimal.

If you have a uniform distribution on the characters, every code has the same length and you don't have to worry about prefixes.

It is interesting that this logic falls out of the entropy formula.

PS: Going back to TigerTom's original example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64], a prefix-free encoding for this would be

[0, 10, 110, 1110, 111100, 111101, 111110, 111111]

which uses the optimal average of 2 bits compared to the 2.03125 bits of the unary encoding given by CaptainBlack:

Thanks - beginning to make a lot more sense...

This is not an admissible code, as knowlege of the code and the code-text is not sufficient to recover the source-text. Consider the message 111100 this could be geneated from "he" or from "ddaa".

Ahhh, ok. After reading the links about entropy encoding and prefix codes, this is beginning to make sense.

Originally Posted by CaptainBlack

Now consider the code (maybe not the most efficient but admissible):

[0, 10, 110, 1110, 11110, 111110, 1111110, 11111110]

The average number of bits per charater is now 2.03125 >= 2 as required.

The book I am reading gives this code [0, 10, 110, 1110, 111100, 111101, 111110, 111111]. On reading this I had no clue how it was reached and thus how they decided it meant an average description length of 2 bits.

Now I can calculate the same answer myself, and see that the bit lengths match up with -log2(p(x)) for the probabilities and this is optimal code length according to Shannons source coding theorem.

Bits and pieces are starting to fit together!

Originally Posted by JakeD

PS: Going back to TigerTom's original example, the set [a,b,c,d,e,f,g,h] with probabilities [1/2,1/4,1/8,1/16,1/64,1/64,1/64,1/64], a prefix-free encoding for this would be

[0, 10, 110, 1110, 111100, 111101, 111110, 111111]

which uses the optimal average of 2 bits compared to the 2.03125 bits of the unary encoding given by CaptainBlack

Yes - the same as is used in my book.

Thanks so much for helping me along the way. I've not had to do any heavy or new maths for 7 years, since high school, so diving back in with this has meant I have so many questions it is hard to tie things together. Both of you have really helped.