Tag Archives: information

Last time we talked about information-entropy, which is a way of quantifying the information in a given string of symbols. Each symbol has a certain uncertainty associated with it, which may be low if the symbol can be one of two things, or high if the symbol can be one of a hundred things. Summing all the possible outcomes gives the entropy, which is larger for more uncertain situations and smaller for more certain situations. So the more you stand to learn from the string, the higher its uncertainty and entropy, and the larger its information content.

But let’s get less abstract and think about examples. Say that we flip a coin and see if it lands with the head or tail side up. An evenly weighted coin will have an equal probability of being heads or tails. But perhaps an uneven coin is more likely to be heads, or maybe gnomes have swapped in a coin that’s heads on both sides. The results of the heads-only coin toss have very low entropy because there is only one possible result, and minimal information is gained by examining the result. Whereas the weighted coin will have higher entropy due to the larger number of results possible. And the evenly weighted coin not only has more results possible than the heads-only coin, those results are also maximally unpredictable, so the entropy is the highest. And if we do a series of coin tosses, we get a string of coin toss results. A longer string will have more entropy than a shorter string, except in the case where the coin is heads-only. Now if we replace the binary coin toss with choosing a letter from the alphabet, which has 26 possibilities instead of two, we have significantly increased the information-entropy!

Of course, sometimes it is useful to impose some rules on a string of symbols, for example the rules associated with a specific language. Doing so will reduce the uncertainty, and thus the entropy and information content, of the string. This is another way of saying that a string of letters that spells out words in English has a lower entropy than a string of random letters, because in English you know that not all the letters are equally probable, one letter affects the probability of letters following it, and other things like that. It’s the equivalent of weighting the coin! In fact, the trick of data compression is to reduce the number of symbols used in a string without reducing the entropy (and thus the information content) of the message. Data compression is not possible when each symbol in a message is maximally surprising, which explains the difficulty of compressing things like white noise.

Now, what if instead of a sequence of coin tosses or a string of letters, you instead had a collection of atoms that could be in different states? Consider a box filled with a gas, where each atom of the gas can be described by its position in the box and its momentum. The entropy of any given configuration of atoms would then be the sum of all the possible states for each atom, the same way the entropy of a string was the sum of possible symbols in the string (weighted for probability). Entropy is still a measure of uncertainty, but in this physical example the question is how many arrangements of atoms in specific states can make a configuration that has the same measurable properties, such as pressure, temperature, and volume. For example, if the gas is evenly distributed throughout the box, we can make a wide variety of changes to the individual atom positions and velocities without changing the measurable properties of the gas. Thus the entropy is high because of the large number of atomic arrangements that could yield the same result, which means there is a high uncertainty in what any individual atom is doing. In contrast, if the gas atoms are confined to a very small region of the box, there are fewer positions and momenta available to the atoms, and thus a smaller number of indistinguishable arrangements. So the entropy is lower, because there are fewer ways to have the same number of atoms all in a corner of the box.

The technical way to describe this formulation of entropy is that each atom has a number of microstates available to it, and all the atoms together have measurable properties (pressure, volume, temperature, etc.) that define the macrostate. The entropy of any given macrostate is equal to the number of microstate configurations that could produce that macrostate, which means it’s still about uncertainty. But you can also see that entropy is a form of state-counting: higher entropy macrostates can be attained in a larger number of ways than lower entropy macrostates. This means that in general, higher entropy states are more probable. If there is one way to pack all the atoms into the corner of a box, but there are a million ways to evenly distribute the atoms in the box, then the chances of just finding the atoms in the corner are one in a million. And since those atoms are constantly moving and exploring new microstates, over time they will tend to the highest entropy macrostates. This is where the Second Law of Thermodynamics comes from, which says that in any isolated system, total entropy increases over time toward a maximum value.

The idea of entropy as state-counting came from Ludwig Boltzmann, more than fifty years before information theory was developed. Shannon called his measure information-entropy because of the resemblance to entropy as defined in collections of atoms, which is the basis of statistical mechanics. Entropy is a measure of information and uncertainty, but also a way to count the number of states, and a measure of the relative ordering of a system.

For the end of the year, something a little different: let’s talk about information. Information is one of those concepts that everyone feels familiar with and few people examine carefully, so take a step back to think about what it is. A book, an email, or even a science blog could each be said to contain information, and in fact to have the primary purpose of conveying information from one person or place to another. The physical properties of an object aren’t classically considered to be information, meaning that the words inside the book are information while the fact that the cover is red leather is a property rather than information. How do we make this distinction? Well, the properties of the book cover have to be separately measured by each observer, but the contents of the book constitute a message, a sequence of symbols or signals that have a known meaning. That meaning constitutes information!

And the question of how information is transmitted from one place to another is deeply relevant for all of human communication. This is even more true with the rise of the Internet and the struggle to understand how to compress information without losing important pieces. Data compression is a problem of information, as is data storage. But even earlier, what constitutes information was being studied as part of the World War II cryptography effort, because of the importance during wartime of sending and receiving messages. It was just after the war that Claude Shannon wrote a paper which effectively founded the research field of information theory, which focused on how to encode a message to pass between two people. Shannon’s important insight was to think of information probabilistically, for example by looking at the set of all word lengths to understand their distribution and thus, the difference between trying to encode a short word and a long word.

Initially word length might seem like a rather mechanical way to classify words, but it turns out to be deeply related to the information carried within the word itself. In English, many of the most common words are short words (see for example the word list for Basic English). But as you’ll find out when you start trying to write in Basic English, the longer words you can’t use often require a lot of short words to explain, which means that long words tend to, on average, carry more information than short words! And if we return to the problem of coded messages, longer messages will carry more information than shorter messages. Put that way, it sounds like common sense, but it’s a key insight into how information works: the more symbols you can use in your message, the greater the information content of that message. Shannon coined the term ‘bit‘ for a unit of information within a message, which may sound familiar as there are eight bits in a Byte, just over a million (220 or 1048576) Bytes in a megabyte, and so on in the computer world.

As for the uncertainty in a word or message, which relates directly to the number of symbols or bits, Shannon decided to call that information-entropy. A longer message has more potential combinations of symbols, more uncertainty, and is thus, on average, likely to contain more information than a shorter message. Information-entropy directly measures that information potential, just by counting bits. And if you recognize the word entropy, well, information-entropy is indeed related to thermodynamic entropy, which we’ll explore further in the New Year!