There seem to be many different 'kinds' of entropy. I've come across two different concepts:

A) The XKCD example of correcthorsebatterystaple. It has 44 bits of entropy because four words randomly chosen from a list of 2048 words is 4 * log2(2048) = 44 bits of entropy. This I understand.

B) The Shannon entropy of the actual string i.e. the entropy is calculated based on frequencies of the letters/symbols. Applying the Shannon formula on correcthorsebatterystaple the result is 3.36 bits of entropy per character.

It is important to realize the difference between the entropy of a set of possible outcomes, and the entropy of a particular outcome. A single toss of a fair coin has an entropy of one bit, but a particular result (e.g. "heads") has zero entropy, since it is entirely "predictable".
-- Wikipedia: Entropy (information theory)

I don't quite understand the distinction between the entropy of the toss (generation) and the entropy of the result (the string).

When is B used and for what purpose?

Which concept accurately reflects the entropy of the password?

Is there terminology to differentiate between the two?

True randomness could give us correctcorrectcorrectcorrect. Using
A we still have 44 bits. Using B the entropy would be the same as
that of correct. When is the difference between the two important?

If a requirement specifies that a string needs to have 20 bits of
entropy—do I use A or B to determine the entropy?

5 Answers
5

The Wikipedia article explains mathematical entropy, which isn't identical to what people mean when they talk about password entropy. Password entropy is more about how hard it is to guess a password under certain assumptions which is different from the mathematical concept of entropy.

A and B are not different concepts of password entropy, they're just using different assumptions as how a password is built.

A treats correcthorsebatterystaple as a string of English words and assumes that words are randomly selected from a collection of 2048 words. Based on these assumptions each word gives exactly 11 bits of entropy and 44 bits of entropy for correcthorsebatterystaple.

B treats correcthorsebatterystaple as a string of characters and assumes that the probability of any character to appear is the same as it is in the English language. Based on these assumptions correcthorsebatterystaple has 84 bits of entropy.

So which definition you use really depends on what assumptions you make about the password. If you assume the password is an XKCD-style password (and that each word indeed has a chance of one in 2048 to appear in the password) then A is the correct way to calculate entropy. If you don't assume the password is built as a collection of words but do assume that the probability of any character to appear to be equal to the probability of it's appearance in the English language then B is the correct way to calculate entropy.

In the real world none of these assumptions are correct. So if you have a "requirement that specifies that a string needs to have 20 bits of entropy" and this is for user generated passwords it's very difficult to give a precise definition of entropy. For more on this see Calculating password entropy?.

If, on the other hand, you can use computer generated strings (and are using a good PRNG) then each alphanumeric character (a-z, A-Z, 0-9) will give almost 6 bits of entropy.

What it means

Coin toss entropy assumes that from one toss to the next, the result of the previous toss will not affect the result of the next toss. So, each toss adds one bit of entropy.

Shannon entropy assumes that the value of the next letter is in fact partially determined by the value of the previous letter (and perhaps others). Facts like "h" often follows "t" and "e" often follow "h" are taken into consideration so common patterns are assigned a lower entropy value. So with an english dictionary, the string the would have a much lower Shannon entropy value than the string exu.

What it means to you

The direct implication of this with respect to passwords is pretty insignificant. The real (and only) important question with respect to passwords is this:

What dictionary is your password in?

That is to say, if you were to construct a list of potential passwords to conduct a brute-force attack, how big would the dictionary have to be to contain your password?

For example:

Your password is in the top 500 most commonly-used passwords

Your password is in the dictionary of lowercase English words

Your password is in the list of lowercase or title-case English words with a one-digit or two-digit suffix

Your password is in the list of random-case English words with haxor numeric substitutions (i.e. A=>4, L=>1, S=>5)

Your password is in the list of all strings 8 characters or less using numbers and upper and lower case letters.

All of the above are examples of frequently used real-world password cracking dictionaries.

In other words

The purpose of password complexity is to stand up against a brute-force attack. The size of the smallest available dictionary that contains your password determines the amount of time required to crack your password. We can guess at what dictionaries will be available to the attacker, but we can't know for certain. Therefore, as a proxy for dictionary size, we instead use entropy. It's a poor substitute because it doesn't reflect the actual attack mechanics, but it's potentially better than nothing.

Comparisons of passwords based on entropy calculations may potentially be fruitful, but you should be careful to avoid ascribing too much value to a number which is, in the end, only indirectly related to how well the password will hold up.

Let's say we have a random number generator has a provable output entropy of 3 bits per digit of output. That generator's "toss" entropy is 3 bits. Now, let's say you run that for 20 digits, and despite the ridiculously small probability, every number in the stream comes out as 6. The "toss" entropy is still 3 bits per digit, so 60 bits. The actual "result" entropy of the password is tiny - one could argue that it's as low as 3 or 4 bits.

The difference is that the "toss" entropy represents the expected entropy of the output, based on probabilistic modelling of the generator, whereas the "result" entropy represents the actual information entropy of the data it produced in a real case.

Thanks Polynomial. Are there (formal) terms as to distinguish between the different kinds of entropy? Or does it solely depend on context?
– mdsOct 14 '12 at 8:53

I'd say "statistical entropy" is a good term for the provable calculated entropy of the generator or a set sample. Shannon entropy is a particular type of entropy calculation that falls under this category. In terms of a formal term for the real entropy of a sample, I'm not sure I know of one.
– PolynomialOct 14 '12 at 9:00

A single byte can contain up to 8 bits of entropy. This is the upper limit. As you learn more about your data, the amount of entropy in those 8-byte block goes down. Oh, all your bytes are all ASCII characters? That means the highest bit must be a 0; you're down to 7 bits of entropy. No control characters? Of the ASCII set, 0-31 are control characters - tab, enter, bell, end-of-file. That reduces the character further. alphabetic, lower case only? Now you're reducing the available options hugely. English word? Not many of those - an entire english word, RANDOMLY selected, may only have about, say, 12 bits altogether, even though the words may have 5 characters.

Passwords chosen by humans are even worse; not because the possibilities are smaller, but because some are chosen more frequently than others. If certain passwords are common, it's easier to guess; that effects the entropy. If 10% of your users have "password", that will reduce the entropy in your list of passwords - i.e. it is easier to guess.

So the more information you have about a password, the lower you can calculate the entropy. In the case of the Shannon formula, it assumes the biases in natural languages, and calculates the entropy at 3.6bits * 25 characers = about 90 bits. When you get the additional information (4 words, each from a list of 2048), that drops to 44 bits.

Look at it this way - if someone was hacking this password, knowing only that it was some natural language, and then suddenly finding out it was 4 words from a list of 2048 (and knowing the list), they would suddenly find their work a LOT easier.

The thing you're missing here is the fact that an entropy measurement is tied to some specific probability distribution. You can't talk about the entropy of a password without making some assumption, explicit or implicit, about what probability distribution is the password randomly drawn from. Which bottoms out to the nature of the process whereby the password is generated.

In the XKCD comic, Munroe is telling you that he generated the passphrase by successively and independently picking four words at random from a dictionary of about 2^11 words. This tells you precisely the probability distribution that the password is being drawn from: the discrete uniform distribution over a set of 2^44 distinct passphrases.

But then when you do this:

B) The Shannon entropy of the actual string i.e. the entropy is calculated based on frequencies of the letters/symbols. Applying the Shannon formula on correcthorsebatterystaple the result is 3.36 bits of entropy per character.

...you're choosing a different probability distribution than what Munroe used, so you are going to get a different estimate, one that assigns about 84 bits of entropy to the string (25 × 3.36).

If you're evaluating the entropy of passwords for which you don't know how they were generated, one fruitful and intuitive perspective is to adopt the idea that the probability distribution in play is the attackers' knowledge and hypotheses about how users select passwords. If you can form yourself a reasonable idea of what this distribution looks like, then a password's entropy is its message length in an optimal code for that distribution. This is, very roughly, the approach in the better password strength meters like zxcvbn (although they formulate it in terms of average number of guesses to hit the password).

When you look at the 44- vs. 84-bit estimates that you show in your question from this perspective, what's going on is this: Munroe gets a much lower entropy estimate for the password because he assumes that the attacker has a 100% accurate hypothesis about how the passphrase was generated, so they can rule out huge numbers of strings a priori that are just not concatenations of four words in the dictionary. Whereas the Shannon entropy calculation that you illustrate gets a much higher entropy estimate because it's not as "smart" of a strategy for guessing XKCD-style passphrases. zxcvbn is cleverer, though, and it estimates a password cracker would crack correcthorsebatterystaple in about 10^14.43696 guesses which is about (14.4 × 3.3) + 1 ≈ 48.6 bits of entropy. (The calculation is a base-10 to base-2 logarithm conversion, plus one bit to convert from number of guesses to entropy.) That's a bit more than Munroe's estimate, but zxcvbn is coded to attack other passwords than just XCKD-style passphrases.