Suppose you have a system X that can be in any of 8 states, which are all equally probable (relative to your current state of knowledge), and a system Y that can be in any of 4 states, all equally probable.

The entropy of X, as defined yesterday, is 3 bits; we'll need to ask 3 yes-or-no questions to find out X's exact state. The entropy of Y, as defined yesterday, is 2 bits; we have to ask 2 yes-or-no questions to find out Y's exact state. This may seem obvious since 23 = 8 and 22 = 4, so 3 questions can distinguish 8 possibilities and 2 questions can distinguish 4 possibilities; but remember that if the possibilities were not all equally likely, we could use a more clever code to discover Y's state using e.g. 1.75 questions on average. In this case, though, X's probability mass is evenly distributed over all its possible states, and likewise Y, so we can't use any clever codes.

What is the entropy of the combined system (X,Y)?

You might be tempted to answer, "It takes 3 questions to find out X, and then 2 questions to find out Y, so it takes 5 questions total to find out the state of X and Y."

But what if the two variables are entangled, so that learning the state of Y tells us something about the state of X?

In particular, let's suppose that X and Y are either both odd, or both even.

Now if we receive a 3-bit message (ask 3 questions) and learn that X is in state 5, we know that Y is in state 1 or state 3, but not state 2 or state 4. So the single additional question "Is Y in state 3?", answered "No", tells us the entire state of (X,Y): X=X5, Y=Y1. And we learned this with a total of 4 questions.

Conversely, if we learn that Y is in state 4 using two questions, it will take us only an additional two questions to learn whether X is in state 2, 4, 6, or 8. Again, four questions to learn the state of the joint system.

The mutual information of two variables is defined as the difference between the entropy of the joint system and the entropy of the independent systems: I(X;Y) = H(X) + H(Y) - H(X,Y).

Here there is one bit of mutual information between the two systems: Learning X tells us one bit of information about Y (cuts down the space of possibilities from 4 to 2, a factor-of-2 decrease in the volume) and learning Y tells us one bit of information about X (cuts down the possibility space from 8 to 4).

What about when probability mass is not evenly distributed? Yesterday, for example, we discussed the case in which Y had the probabilities 1/2, 1/4, 1/8, 1/8 for its four states. Let us take this to be our probability distribution over Y, considered independently - if we saw Y, without seeing anything else, this is what we'd expect to see. And suppose the variable Z has two states, 1 and 2, with probabilities 3/8 and 5/8 respectively.

Then if and only if the joint distribution of Y and Z is as follows, there is zero mutual information between Y and Z:

Z1Y1: 3/16

Z1Y2: 3/32

Z1Y3: 3/64

Z1Y3: 3/64

Z2Y1: 5/16

Z2Y2: 5/32

Z2Y3: 5/64

Z2Y3: 5/64

This distribution obeys the law:

p(Y,Z) = P(Y)P(Z)

For example, P(Z1Y2) = P(Z1)P(Y2) = 3/8 * 1/4 = 3/32.

And observe that we can recover the marginal (independent) probabilities of Y and Z just by looking at the joint distribution:

P(Y1) = total probability of all the different ways Y1 can happen= P(Z1Y1) + P(Z2Y1)= 3/16 + 5/16= 1/2.

So, just by inspecting the joint distribution, we can determine whether the marginal variables Y and Z are independent; that is, whether the joint distribution factors into the product of the marginal distributions; whether, for all Y and Z, P(Y,Z) = P(Y)P(Z).

In English, "After you learn Zj, your belief about Yi is just what it was before."

So when the distribution factorizes - when P(Y,Z) = P(Y)P(Z) - this is equivalent to "Learning about Y never tells us anything about Z or vice versa."

From which you might suspect, correctly, that there is no mutual information between Y and Z. Where there is no mutual information, there is no Bayesian evidence, and vice versa.

Suppose that in the distribution YZ above, we treated each possible combination of Y and Z as a separate event—so that the distribution YZ would have a total of 8 possibilities, with the probabilities shown—and then we calculated the entropy of the distribution YZ the same way we would calculate the entropy of any distribution:

You would end up with the same total you would get if you separately calculated the entropy of Y plus the entropy of Z. There is no mutual information between the two variables, so our uncertainty about the joint system is not any less than our uncertainty about the two systems considered separately. (I am not showing the calculations, but you are welcome to do them; and I am not showing the proof that this is true in general, but you are welcome to Google on "Shannon entropy" and "mutual information".)

What if the joint distribution doesn't factorize? For example:

Z1Y1: 12/64

Z1Y2: 8/64

Z1Y3: 1/64

Z1Y4: 3/64

Z2Y1: 20/64

Z2Y2: 8/64

Z2Y3: 7/64

Z2Y4: 5/64

If you add up the joint probabilities to get marginal probabilities, you should find that P(Y1) = 1/2, P(Z1) = 3/8, and so on - the marginal probabilities are the same as before.

But the joint probabilities do not always equal the product of the marginal probabilities. For example, the probability P(Z1Y2) equals 8/64, where P(Z1)P(Y2) would equal 3/8 * 1/4 = 6/64. That is, the probability of running into Z1Y2 together, is greater than you'd expect based on the probabilities of running into Z1 or Y2 separately.

Which in turn implies:

P(Z1Y2) > P(Z1)P(Y2)P(Z1Y2)/P(Y2) > P(Z1)P(Z1|Y2) > P(Z1)

Since there's an "unusually high" probability for P(Z1Y2) - defined as a probability higher than the marginal probabilities would indicate by default - it follows that observing Y2 is evidence which increases the probability of Z1. And by a symmetrical argument, observing Z1 must favor Y2.

As there are at least some values of Y that tell us about Z (and vice versa) there must be mutual information between the two variables; and so you will find—I am confident, though I haven't actually checked—that calculating the entropy of YZ yields less total uncertainty than the sum of the independent entropies of Y and Z. H(Y,Z) = H(Y) + H(Z) - I(Y;Z) with all quantities necessarily nonnegative.

(I digress here to remark that the symmetry of the expression for the mutual information shows that Y must tell us as much about Z, on average, as Z tells us about Y. I leave it as an exercise to the reader to reconcile this with anything they were taught in logic class about how, if all ravens are black, being allowed to reason Raven(x)->Black(x) doesn't mean you're allowed to reason Black(x)->Raven(x). How different seem the symmetrical probability flows of the Bayesian, from the sharp lurches of logic—even though the latter is just a degenerate case of the former.)

"But," you ask, "what has all this to do with the proper use of words?"

All [mortal, ~feathers, bipedal] are mortal.Socrates is a [mortal, ~feathers, bipedal].Therefore, Socrates is mortal.

Why, then, would you even want to have a word for "human"? Why not just say "Socrates is a mortal featherless biped"?

Because it's helpful to have shorter words for things that you encounter often. If your code for describing single properties is already efficient, then there will not be an advantage to having a special word for a conjunction - like "human" for "mortal featherless biped" - unless things that are mortal and featherless and bipedal, are found more often than the marginal probabilities would lead you to expect.

In efficient codes, word length corresponds to probability—so the code for Z1Y2 will be just as long as the code for Z1 plus the code for Y2, unless P(Z1Y2) > P(Z1)P(Y2), in which case the code for the word can be shorter than the codes for its parts.

And this in turn corresponds exactly to the case where we can infer some of the properties of the thing, from seeing its other properties. It must be more likely than the default that featherless bipedal things will also be mortal.

Of course the word "human" really describes many, many more properties - when you see a human-shaped entity that talks and wears clothes, you can infer whole hosts of biochemical and anatomical and cognitive facts about it. To replace the word "human" with a description of everything we know about humans would require us to spend an inordinate amount of time talking. But this is true only because a featherless talking biped is far more likely than default to be poisonable by hemlock, or have broad nails, or be overconfident.

Having a word for a thing, rather than just listing its properties, is a more compact code precisely in those cases where we can infer some of those properties from the other properties. (With the exception perhaps of very primitive words, like "red", that we would use to send an entirely uncompressed description of our sensory experiences. But by the time you encounter a bug, or even a rock, you're dealing with nonsimple property collections, far above the primitive level.)

So having a word "wiggin" for green-eyed black-haired people, is more useful than just saying "green-eyed black-haired person", precisely when:

Green-eyed people are more likely than average to be black-haired (and vice versa), meaning that we can probabilistically infer green eyes from black hair or vice versa; or

Wiggins share other properties that can be inferred at greater-than-default probability. In this case we have to separately observe the green eyes and black hair; but then, after observing both these properties independently, we can probabilistically infer other properties (like a taste for ketchup).

One may even consider the act of defining a word as a promise to this effect. Telling someone, "I define the word 'wiggin' to mean a person with green eyes and black hair", by Gricean implication, asserts that the word "wiggin" will somehow help you make inferences / shorten your messages.

If green-eyes and black hair have no greater than default probability to be found together, nor does any other property occur at greater than default probability along with them, then the word "wiggin" is a lie: The word claims that certain people are worth distinguishing as a group, but they're not.

In this case the word "wiggin" does not help describe reality more compactly—it is not defined by someone sending the shortest message—it has no role in the simplest explanation. Equivalently, the word "wiggin" will be of no help to you in doing any Bayesian inference. Even if you do not call the word a lie, it is surely an error.

And the way to carve reality at its joints, is to draw your boundaries around concentrations of unusually high probability density in Thingspace.

You've forgotten one important caveat in the phrase "And the way to carve reality at its joints, is to draw your boundaries around concentrations of unusually high probability density in Thingspace."
The important caveat is : 'boundaries around where concentrations of unusually high probability density lie, to the best of our knowledge and belief' . All the imperfections in categorisation in existing languages come from that limitation.
Other problems in categorisation, like those of Antonio, in 'Merchant of Venise', or those of the founding fathers who wrote that it is 'self evident that all men were created equal' but at the same time were slave owners, do not come from language problems in categorisation, they would have acknowledged that Shylock or the slaves were human, but from different types of cognitive compromise.
Apart from that, it's an intellectually satisfying approach, and you might, if you persevere, end up with a poor relation to an existing language. Why a poor relation ? because it would lack nuance, ambiguity, and redundance, which are the roots of poetry.
It would also lack words for the surprising but significant improbable phenomenon. Like genius, or albino. Then again, once you get around to saying you will have words for significant low hills of probability, the whole argument blows away.
Bon courage.

"The important caveat is : 'boundaries around where concentrations of unusually high probability density lie, to the best of our knowledge and belief' ."

I would call the above an instance of the Mind Projection Fallacy, as you seem to be assuming a probability density that is a property of the physical world, and which we are trying to ascertain. But probabilities are properties of our minds (or ideal, perfectly rational minds), not of the exterior world, and a probability distribution is simply an entity to describe our state of information; it is "the best of our knowledge and belief".

tcpkac:
The important caveat is : 'boundaries around where concentrations of unusually high probability density lie, to the best of our knowledge and belief' . All the imperfections in categorisation in existing languages come from that limitation.

This strikes me as a rather bold statement, but "to the best of our knowledge and belief" might be fuzzy enough to make it true. Some specific factors that distort our language (and consequently our thinking) might be:
* Probability shifts in thingspace invalidating previously useful clusterings. Natural languages need time adapt, and dictionary writers tend to be conservative.
* Cognitive biases that distort our perception of thingspace. Very on topic here, I suppose. ^_^
* Manipulation (intended and unintended). Humans treat articulations from other humans as evidence. That can go so far that authentic contrary evidence is explained away using confirmation bias.

Other problems in categorisation, [...] do not come from language problems in categorisation, [...] but from different types of cognitive compromise.

Well, lack of consistency in important matters seems to me to be a rather bad sign.

It would also lack words for the surprising but significant improbable phenomenon. Like genius, or albino. Then again, once you get around to saying you will have words for significant low hills of probability, the whole argument blows away.

I don't think so. Once the most significant hills have been named, we go on and name the next significant hills. We just choose longer names.

Will, thingspace may not need a distance metric depending on how you draw your boundaries, which are not necessarily surfaces containing volumes of constant density. For example, a class in Naive Bayes / neural network of type 2 also slices up thingspace. More about this shortly. But if you're interested in the general topic, I believe that in the field of statistical learning, for algorithms that actually do depend on distance metrics, the standard cheap trick is to "sphere" the space by making the standard deviation equal 1 in all directions. An ad-hoc technique but apparently a useful one, though it has all the flaws you would expect.

While it is true that you don't need a metric to draw a boundary, I personally need a metric to be able to envision high concentrations of probability density.

A concentration implies a region, which implies a metric space. While your sphering of the space normalises it somewhat and deals with part of the trouble, it still skips over the question of metric space. For example is 2, 2, 2 closer to 1, 1, 1 than 4, 1, 1? If that was a co-ordinate of a position in three dimensional space you would want to use the euclidean metric i.e. d = ((x2 - x1)^2 + (y2 - y1)^2+ (z2 - z1)^2)^1/2 or you that might not be appropriate and you would have to use city block distances and put them equally far away (if they were average energy usage, weight and how many copies of the gene for green eyes it had).

You said: "Since there's an "unusually high" probability for P(Z1Y2) - defined as a probability higher than the marginal probabilities would indicate by default - it follows that observing Z1 is evidence which increases the probability of Y2. And by a symmetrical argument, observing Y2 must favor Z1."

But I think what you meant was "Since there's an "unusually high" probability for P(Z1Y2) - defined as a probability higher than the marginal probabilities would indicate by default - it follows that observing Y2 is evidence which increases the probability of Z1. And by a symmetrical argument, observing Z1 must favor Y2."

Nothing you said was untrue, but the implication of what you wrote doesn't match up with the example you actually gave just above that text.

Hopefully not taking away anyone's fun here, but to reconcile Raven(x)->Black(x) but not vice versa, what this statement wants to say, letting P(R) and P(B) be the probabilities of raven and black, respectively, is P(R|B)=0 and P(B|R)=1, which gives us that

P(R|B) = 0
P(RB)/P(B) = 0
P(RB) = 0

and

P(B|R) = 1
P(BR)/P(R) = 1
P(BR) = P(R)

But of course this leads to a contradiction, so it can't really be true that Black(x)-/->Raven(x), can it? Sure, because what is really meant by implies (-/->) is not P(B|R) = 0 but P(B|R)<1. But in logic we often forget this because anything with a probability less than 1 is assigned a truth value of false.

Logic has its value, since sometimes you want to prove something is true 100% of the time, but this is generally only possible in pure mathematics. If you try to do it elsewhere you'll get exceptions (e.g. albino ravens). So leave logic to mathematicians; you should use Bayesian inference.

I agree that it makes no sense, but as I was writing the comment I figured I would take you down the wrong path of what someone might naively think and then correct it. I think that someone who was overly trained in logic and not in probability might assume that if Raven(x)-->Black(x) being true leads to P(B|R) = 1, they might reason that since the reverse implication Black(x)-->Raven(x) is false, it leads to P(R|B) = 0. But based on the comments above, maybe only an ancient Greek philosopher would be inclined to make such a mistake.

I'd hope they weren't so hopelessly 'overtrained' that they wouldn't be able to step back from their P's and parentheses and ask themselves whether they really think that a black object cannot be a raven.

Having a word [...] is a more compact code precisely in those cases where we can infer some of those properties from the other properties. (With the exception perhaps of very primitive words, like "red" [...]).

Remember that mutual information is symmetric. If some things have the property of being red, then "red" has the property of being a property of those things. Saying "blood is red" is really saying "remember that visual experience that you get when you look at certain roses, apples, peppers, lipsticks and English buses and phone booths? The same happens with blood." If I give you the list above, can you find ("infer") more red things? Then "red" is a good word.

But do note that this is a dual sense to the one in which "human" is a good word. Most of the properties of humans are statistically necessary for being human: remove any one of them, and the thing is much less likely to be human. "Human" is a good word because these properties are positively correlated. On the other hand, most of the red things are statistically sufficient for being red: take any one of them, and the thing is much more likely to be red. "Red" is a good word because these things are negatively correlated - they are a bunch of distinct things with a shared aspect.

So, hold on, if you wrote this in 2008, why the hell did you keep writing this blog instead of publishing at least one of what were eventually numerous papers on information-theoretic clustering with mutual-information measurements? Some of those didn't even come out until 2012 or 2014 or so, so it's not like you wouldn't have had time to publish a solid revision to MI-clustering if you came up with a good algorithm.