I have learned entropy in my information theory classes. The definition I got from text books was the average information content in a message sequence etc. But in one of the MIT videos related to Information theory , the professor said entropy is the information which we do not have regarding the message. Is that both same? Another view point regarding entropy is the amount of disorder associated with message. My doubts are the following:

If we say that entropy of english language is 2 bits and that of hindi is 3, what does that convey?

Compressed data normally has lesser entropy. Does that mean the disorder assosiated with compressed data is less?

What is the significance of entropy related to genes (in biology) and music etc?

Lastly how the strength of a password is related to entropy?

Any help or links or references are appreciated.

VERY IMPORTANT NOTE : Answers related to my 2$^{nd}$ question is creating some amount of confusion. First of all I must have specified about compression method (lossy or lossless). So I was discussing this question with one of my friends. So his argument is this (and I am happy to accept it, because it seems to be logical than other explanations here.): Lossless compressed data and original data will have same amount of Entropy, since both have same information content. But if the compression is lossy (like JPEG ones) it will have less entropy than that of original data's entropy, because lossy compression has lost some amount of information in the process. I invite clarifications / corrections in the form of answer if anyone have different opinion or can give a better answer.

3 Answers
3

The entropy of a message is a measurement of how much information it carries.

One way of saying this (per your textbook) is to say that a message has high entropy if each word (message sequence) carries a lot of information. Another way of putting it is saying that if we don't get the message, we lose a lot of information; i.e., entropy is a measure of the number of different things that message could have said. All of these definitions are consistent, and in a sense, the same.

To your first question: the entropy of each letter of the English language is about two bits, as opposed to a Hindi letter which apparently contains $3$.

The question this measurement answers is essentially the following: take a random sentence in English or Hindi, and delete a random letter. On average, how many possible letters might we expect to be in that blank? In English, there are on average $2$ possibilities. In Hindi, $3$

EDIT: the simplest way to explain these measurements is that it would take, on average, $2$ yes/no questions to deduce a missing english letter and $3$ yes/no question to deduce a missing Hindi letter. On average, there are in fact twice as many Hindi letters (on "average", you'd have $2^3=8$ letters) that can fill in a randomly deleted letter in a Hindi passage as the number of English letters (on "average", you'd have $2^2=4$ letters). See also Chris's comment below for another perspective.

For a good discussion of this stuff in the context of language, I recommend taking a look at this page.

As for (2), I don't think I can answer that satisfactorily.

As for (3), there's a lot to be done along the same lines of language. Just as we measure the entropy per word, we could measure the entropy per musical phrase or per base-pair. This could give us a way of measure the importance of damaged/missing DNA, or the number of musically appealing ways to end a symphony. An interesting question to ask about music is will we ever run out? (video).

Password strength comes down to the following question: how many passwords does a hacker have to guess before he can expect to break in? This is very much answerable via entropy.

$\begingroup$Letter, not word, that was a mistake. Another way to think of that is assuming both languages are equally compressed, Hindi sentences should be shorter to write.$\endgroup$
– OmnomnomnomAug 22 '13 at 4:45

$\begingroup$I like the answer, but I disagree with your explanation of the entropy of languages. It's not true that if the entropy of a language is N bits per letter, that means there are about N different letters that could fill in the blank if you deleted a random letter (for a start, it depends what units you measure entropy in - a dice roll has 2.59 bits of entropy, 1.79 nats or 0.77 dits). Instead, you should think in terms of compression ratios. Since we use 8 bits to represent English text in ASCII, and it has 2 bits of entropy per letter, we should be able to compress it by a factor of 8/2 = 4.$\endgroup$
– Chris TaylorSep 2 '13 at 15:49

1

$\begingroup$The entropy of a phenomenon is a measurement of the average information content that we don't have, but would gain by observing said phenomenon.$\endgroup$
– OmnomnomnomDec 15 '13 at 20:15

1

$\begingroup$@Omnomnomnom I really appreciate the great effort done by you.$\endgroup$
– dexterdevMay 6 '14 at 14:44

If you can compress a message, it means that it can be conveyed in a shorter way, meaning some of the bits are not needed. In the compressed form, the message will contain the same amount of information using less bits, so it will have lower entropy (now bits are more likely to be important to convey the message).

The "disorder" you refer to isn't disorder in a physical sense: it's a fluffy way of talking about randomness. Chemists and physicists talk about entropy a lot, meaning how spread-out or randomly distributed is the energy in a system. It's related mathematically to the information-theoretic sense of entropy, but of course you need to think in terms of different analogies.

So, now think about randomness instead of disorder. A random sequence has high entropy because, unlike English, it's very difficult to guess the next symbol/number/letter in the sequence. When you compress data, you try to reduce the redundancies. This reduces the entropy, because it's then harder to guess the next symbol. It also makes the data look more like random data. The more compressed the data are, the more random they look.

Similarly, a randomly chosen password is hard to guess. There are many equally likely possibilities for the password: it has high entropy. But if the password is much more likely to be a dictionary word, it has lower entropy, because some possibilities are much more likely than others.

To make a simpler example, let's take a "password" that consists of a single digit 0-9. If the password is equally likely to be any digit, then the Shannon entropy $-\sum{p_i \log p_i}$ comes out as $-10\times(0.1 \log 0.1) \approx 3.3 \text{ bits} $.

Now, let's say people choose the prettiest digit. Half of the time, they choose 0, and the other half of the time, they choose one of the other digits at random. That is, one outcome has probability $\tfrac{1}{2}$, and nine outcomes have probability $\tfrac{1}{9\times2}$. This password is much easier to guess: if you guess 0, you'll be right half of the time. And this time, the Shannon entropy comes out as about 2.58 bits. This is lower, reflecting how much easier the password is to guess.

Of course, just like randomness, entropy depends on how you're modelling the input: that is, what probability you think each symbol has. If an attacker didn't know that the password was 0 half the time, he'd still find it just as hard to guess as a random password.