First line show count of uniq word found in dictionary, dropped down to 2^Entropy. Second line show initial count of uniq word, and compute theorical entropy based on this.

Each output lines begin with two values, The first is the Shanon's entropy, I'm not clear aboot the meaning and usage of this. The second is based on number of character in the whole line, whith 1/26 for each of them.

Computing entropy reduction

The answer from David Cary confirm that this calcul is very approximative and difficult to represent, but give some good evaluations and a way of thinking:

This make more easy to represent how human selection on this would reduce entropy: For sample if I dislike one, two or upto 10 letters, the final entropy of 4 bits, based on 26 letters is still maintened...

So by extension, if over a bunch of 56947 word, I won't exclude more than 24179 words, the final entropy of 15bit/word is still maintened, but while:

If the human won't choose, for sample word longer than character, the number of word to exclude will go down to 2186. Worst: If human refuse to use word longer than 7 character (with my personal dict file), this will drop overall entropy:

2 Answers
2

Entropy is a measure of the password generation process. Suppose that you have a list of 32768 words to choose from. You select 5 words randomly and uniformly from that list (the words are chosen independently of each other, so you might get twice the same). Then you have exactly 75 bits of entropy. Your password generation process can result in precisely 275 distinct passphrases which all have exactly the same probability of being selected. That's 75 bits of entropy, without any ambiguity.

That some words are short and others are long, or some are plurals, or some are sexier than others, has no influence whatsoever on your entropy. Entropy is a property of the generation process, and your generation process pays no attention to the length or sexiness of words. Average human users, when left to their own meat-based devices (their brains), tend to select some words more often than others; they are bad at uniform randomness. For them, computing the actual "entropy" is hard because we don't really know how biased they are. But this has no bearing on your generator, which does not use a brain, but /dev/random. /dev/random does not find some words more attractive than others; its tastes are simpler.

(Speaking of which, you should use /dev/urandom, not /dev/random. See this.)

No, I'm pretty sure in my second sample, using 450bits to humanly choose 5 words would drop entropy down under 75bit in fine. This seem paradoxal, but... There must exist a mathematical demonstration.
– F. HauriApr 7 '13 at 21:07

1

While I could use this to generate sometime (no more than 10 per a day) one passphrase, with the need of only 10 bytes, is there a good reason to not use /dev/random and to prefer /dev/urandom ??
– F. HauriApr 7 '13 at 21:13

/dev/random may block for indefinite amounts of time, which is always inconvenient, and it does so because whoever implemented it thinks of entropy as if it was gasoline, which is, at best, a flawed way of thinking.
– Tom LeekApr 8 '13 at 12:25

Isn't "real" entropy (as opposed to psuedo random made up entropy) similar to gasoline in that you can only "pump" a finite amount of it out of your system until you need to go searching for more, which can take some time?
– JohnnyMay 13 '13 at 21:10

As Tom Leek correctly noted, entropy is a property of the generation process; it's not a property of any particular passphrase generated by the process.

What's minimal length for one word? Is 4 chars sufficient? How to
compute entropy for a 4 letter word?

When you pull 15 bits from a random number generator, and use those bits to uniformly select one of 2^15 unique words from a dictionary, every word has exactly 15 bits of entropy -- it doesn't matter how many characters long it is.

Yes, using a dictionary that only has 4-letter English words would result in passphrases with less than 15 bits of entropy per 4-letter word -- that's another way of saying there are less than 2^15 words that are 4 letters long in a English dictionary.
But that's irrelevant in this case -- there's no reason to arbitrarily exclude short words from your dictionary.

Likewise, there are less than 2^15 words that start with "be" in an English dictionary -- so words that start with "be" have less than 15 bits of entropy per word.
Likewise, that fact is irrelevant -- there's no reason to arbitrarily exclude words starting with "be" from your dictionary.

... run this several time to obtain some choice:
... and choose in this bunch 5 words with is human
... This will reduce entropy ...

Yes, if you let a human reject some words, then it will reduce entropy.

One way to estimate this is to assume that there is some "real" list of words that the human will accept, and he will reject all others.
If that list has only 2^N bits, then each word in the actual human-selected password will have (at best) only N bits of entropy per word.
Alas, it is difficult to find out what "N" really is.

Sometimes the reason humans reject certain words is because they don't know how to spell them.
One way to avoid this involves a much shorter dictionary that only includes common, easy-to-spell words.
For example, compared to the best case of 15 bits/word * 5 words = 75 bits, you get slightly more entropy (77 bits) by uniformly picking 7 words from a much shorter dictionary of 2^11 common words, such as the the S/Key 2048-word dictionary.

Perhaps a better option is for the computer to pick 5-word passphrases as an all-or-nothing list.
If you show your user 8 such passphrases, and force the user to pick one of those 8 passphrases (rather than mixing-and-matching any 5 words from a list of 30 words), it can be shown that: In the worst case it reduces the strength of the selected passphrase by 3 bits (to 72 bits). In the best cases (the user always picks the first one, or the user uses 3 fair coin-flips to choose one of the 8), the strength of the selected passphrase is the full entropy of 15 bits * 5 words = 75 bits.

1/3 words are plurals

As long as each word in your dictionary of 2^15 words is unique and uniformly chosen, it is irrelevant whether it is plural.
As in the above "be" case, there is no reason to arbitrarily exclude words ending in "s" from your dictionary.

(And I doubt that 1/3 of the words are really true plurals -- your quick test catches words like "abacus" and "bus" and "boss" that are not true plurals).