The beauty of math; the humor of stupidity.

An Introduction to Information Theory (updated from Blogspot)

Back when I first started this blog on blogspot, one of the first things I did was write an introduction to information theory. It’s not super deep, but it was a decent overview – enough, I thought, to give people a good idea of what it’s about, and to help understand why the common citations of it are almost always misusing its terms to create bizzare conclusions, like some [“law of conservation of information”,][conservation]
[conservation]: http://en.wikipedia.org/wiki/Law_of_conservation_of_information “wikipedia on Dembski’s law of CoI”
This is a slight revision of that introduction. For the most part, I’m just going to clean up the formatting, but once I’m going through it, I’ll probably end up doing some other polishing.
****
So what is information theory?
================================
Information theory is a relatively new field of mathematics that tries to characterize what information is in a quantifiable way. It’s an area of math that was almost totally obscure until not too long ago, and one which is frequently misunderstood, even by people who mean well.
A bit of history
—————–
Modern information theory comes from two very different sources: Shannon’s information theory, and Kolmogorov/Chaitin algorithmic information theory. Fortunately, they both converge, mostly, on the same place.
### Shannon’s Information Theory
The more famous branch of IT is Claude Shannon’s information theory, as presented in his infamous paper, [Communication in the Presence of Noise][shannon]. Shannon’s interest came from his work at Bell Labs for AT&T, which wanted a way of being able to figure out how much wire they needed to lay for the telephone network.
AT&T was interested because for a telephone company, laying wire in an incredibly expensive operation. The wire itself (especially back in the ways when it was bundles of copper) is quite expensive; the process of running and burying the cables is even more expensive. So ideally, as a phone company, you want to run the cables once; you want to lay enough that you’ll never have to come back and do it again; and you want to lay the smallest amount that will do the job, since the materials are so expensive.
The problem that they wanted Shannon to help solve was, how could they determine how much wire did they actually need to lay? It’s actually an astonishingly tricky question. The less they laid, the less it cost them in the short term. But they *knew* that the number of phone lines was going to increase dramatically – so if they were cheap, and laid too little wire, when the number of phones exceeded the capacity of what they had already laid, they’d have to go back and lay more. So they wanted to be able to make a good prediction of the minimum amount of wire they could lay that would meet their needs both at the time it was laid, and in the future.
But there’s a big trick to that. First: how much information can a single wire carry? And when you bundle a bunch of wires together, how much can you pump over a single wire without it interfering with signals on it’s neighbors in the bundle?
That’s what Shannon was trying to understand. He was trying to find a mathematical model that he could use to describe information transmission, including the effects of imperfections in transmission, including the introduction of noise and interference. He wanted to be able to quantify information in a meaningful way that would allow him to ask questions like: “At what point does noise in the line start to eliminate the information I want to transmit?”.
The answer is just fascinating: a given communication channel has a capacity for carrying information; and a given message has a quantity of information in it. Adding noise to the signal *adds* information to message *until* the capacity of the channel is reached, at which point information from the noise will start to *replace* information from the message; at that point, you can say that information from the message will start to be lost.
Shannon called the information content of a signal it’s *entropy*, because he saw a similarity between his information entropy and thermodynamic entropy: in a communicating system, entropy *never* decreases: it increases until the capacity of the channel is reached, and then it stays content. You can’t reduce the amount of information in a message. (There are various rumours that in fact this choice of terminology was actually suggested by von Neumann himself.)
That naming led directly to the most common misuse of information theory. But that’s a post for another day.
### Algorithmic Information Theory
The other main branch of information theory came from a very different direction. The two pioneers were Andrey Kolmogorov, and Greg Chaitin. Kolmogorov and Chaitin didn’t know each other at the time they independently invented the same mathematical formalization (in fact, Chaitin was a teenager at the time!), a fact which led to some friction.
In the interests of honesty, I’ll say that Greg Chaitin works in the same building that I do, and I’ve met him a number of times, and been greatly impressed by him, so my description is heavily influenced by Greg.
Anyway – algorithmic information theory grew out of some of the fundamental mathematics being done at the beginning of the 20th century. There was an effort to create a single, fundamental, set of mathematical axioms and inference rules, which was both complete (every true statement was provably true), and consistent (every well-formed statement was either true or false). This effort failed, miserably; the nail in the coffin was Gödel’s incompleteness theory. Gödel presented an extremely complicated proof that showed, essentially, that no formal system could be both complete and consistent. (See my recent article about [Extreme Math][extreme] for more about this.)
[extreme]: http://scienceblogs.com/goodmath/2006/06/extreme_math_1_1_2.php “1+1=2”
Most people saw Gödel’s proof, did the equivalent of saying “Oh, shit!”, and then pretended that they hadn’t seen it. It does mean that there are fundamental limits to what you can do using mathematical formalization, but for the most part, they don’t affect you in normal day to day math. (Sort of the way that car designers didn’t change the way they build cars because of relativity. Yes, it showed that the fundamental physical model that they were using was wrong – but at the speeds that cars move, that wrongness is so small that it just doesn’t matter.)
But for people in the early stages of what would become computer science, this was a big deal. One of the early hopes was that the mathematical system would produce a “decision procedure”, a mechanical process by which a computing device could check the truth or falsehood of any mathematical statement. Gödel showed that it couldn’t be done.
But the early computer scientists – in particular, Alan Turing – embraced it. It led directly to two of the fundamental rules of computer science:
1. The Church-Turing Thesis: all mechanical computing systems are basically the same: there is a fundamental limit to what is computable, and any “acceptable” system can compute anything up to that limit – so if you can find a way of doing it on any computing device, then you can, ultimately, do it on every acceptable computing device.
2. The Halting Problem: there are some things that you cannot program a device to do. In particular, you cannot write a program for any computing device that examines another program, and tells you if that other program will eventually stop.
The halting problem turns out to say *exactly* the same thing as the Gödel incompleteness theorem. Only you can write a proof of it that a freshman college student can understand in ten minutes, instead of a proof that the average math grad student will need a semester to plow through.
Chaitin and Kolmogorov saw this as a big deal: using an algorithmic approach to how to process information, you could prove something very simply, which would require a much more complicated proof using other methods.
K-C information theory grew out of this. According to Kolmogorov and Chaitin, the fundamental amount of information contained in a string (called the string’s information entropy after Shannon), is the shortest string of program + data for a computing device that can generate that string.
(If you’re interested in Gödel’s incompleteness theorem, then Hofstadter’s
[Godel, Escher, Bach: an Eternal Golden Braid][geb] is a really fun introduction. For K-C theory, Greg Chaitin has written a bunch of really [great][chaitin1] [books][chaitin2] for mathematical laymen.
[geb]: http://www.amazon.com/gp/search/ref=br_ss_hs/002-3998406-1014458?search-alias=aps&keywords=godel%20escher%20bach “GEB amazon link”
[chaitin1]: http://www.amazon.com/gp/product/9814021725/sr=8-4/qid=1141849782/ref=pd_bbs_4/002-3998406-1014458?%5Fencoding=UTF8
[chaitin2]: http://www.amazon.com/gp/product/1852336684/sr=8-3/qid=1141849782/ref=pd_bbs_3/002-3998406-1014458?%5Fencoding=UTF8″
Information and Entropy
————————-
Now that I’ve got a bit of background and origins done, it’s time to get to some of the fun stuff.
As I said yesterday, both of the big branches of information theory have a concept of *entropy*. While the exact presentations differ, because they’re built on somewhat different theoretical foundations, the basic meaning of entropy in both branches is pretty much the same. I’m going to stick with the terminology of the Kolmogorov-Chaitin branch, but the basic idea is the same in either.
In K-C theory, we talk about the information content of *strings*. A string is an ordered sequence of characters from a fixed alphabet. It doesn’t really matter what alphabet you choose; the point of this kind of theory isn’t to come up with a specific number describing the complexity of a string, but to be able to talk about what information means in a formal algorithmic sense.
The K-C definition of the entropy of a string is the total quantity of information encoded in that string.
Here’s where it gets fun.
Another way of saying exactly the same thing is that the entropy of a string is a measure of the *randomness* of the string.
Structured strings are structured *precisely* because they contain *redundancy* – and redundancy does not add information.
Which string has more information: “XXXXYYYYY” or “4X5Y”? They have the same amount. One is a lot smaller than the other. But they contain the same amount of information.
Here’s a third way of saying exactly the same thing: the entropy of a string is the length of the smallest compressed string that can be decompressed into the original.
Here’s a better example. Go back and look at the first two paragraphs of yesterday’s post. It took 514 characters.
Here’s the same information, compressed (using gzip) and then made
readable using a unix utility called uuencode:
M’XL(“)E8$$0“VIU;FL`19$Q=JPP#$7[6≤7KIN’,`M+]≤HHLPH“)8BMOPY
M[#Y/GCDG#
MLY2EI8$9H5GLX=*R(_+ZP/,-5-1#HRJNT`77@LL,MZD,”H7LSUKDW3@$#V2
MH(KTO$Z^$%CN1Z>3L*J^?6ZW?^Y2+10;+SOO’OC”/;7T5QA%987SY02I3I$
MLKW”W,VZ-(J$E”[$;’2KYI^-_L./3BW.+WF3XE8)≤@D8X^U59DQ7IA*F+X/
MM?I!RJ*%FE%])Z+EXE+LSN*,P$YNX5/P,OCVG;IK=5_K&CK6J7%’+5M,R&J]
M95*W6O5EI%G^K)8B/XV#L=:5_`=5ELP#Y#UJ??>[[DI=J”*2D];K_F230″
$`@(““`
`
That’s only 465 characters; and if I didn’t have to encode it to stop
it from crashing your web-browser, it would only have been 319
characters. Those characters are a lot more random than the original –
so they have a higher information density. The closer we get
to the shortest possible compressed length, the higher the information
density gets.
Actually, I’m lying slightly; there’s an additional factor that needs to be considered in K-C theory.
Remember that K-C comes from the foundational mathematics of computer science, something called recursive function theory. We’re talking about strings being processed algorithmically by a program. So really, we get to talk about programs that generate strings. So it’s not just strings: you have a program and a data string. Really, you measure information content of a string as the total size of the smallest pairing of program and data that generates that string. But for most purposes, that doesn’t make that much difference: the combination of the program and the data can be encoded into a string.
In the two examples above, the program to follow is implied by the contents of the string; if we wanted to really model it precisely, we’d need to include the programs:
* For the string “XXXXYYYYY”, the program is roughly:
while there are more characters in the data:
read a character
output the character that you read
end
* For the string “4X5Y”, the program is roughly:
while there are more characters in the data:
read a number n from the data
read a character c from the data
output c n times
end
Now, specifics aside: the definitions I gave earlier are pretty much correct in both K-C and Shannon information theory: information content is proportional to randomness is proportional to minimum compressed length.
So – what happens if I take a string in very non-compressed format (like, say, an uncompressed digitized voice) and send it over a noisy phone line? Am I gaining information, or losing information?
The answer is: *gaining* information. Introducing randomness into the string is *adding* information.
“AAAABBBB” contains less information than “AAAABxBB”.
The way this is used to refute bozos who claim things like “You can’t create information” should be obvious.

Post navigation

An excellent post. I have often considered writing a brief introduction to information theory (because it’s just so gosh-darn fun!) but never got round to it. My other problem being that my understanding of the fine details is not as sound as it could be. 😉

When I was a kid I was captivated by JR Pierce’s Symbols, Signals, and Noise. It was the first hardbound book that I bought voluntarily and I actually wore it to tatters. Haven’t seen my (third) copy for a few decades, but you’ve inspired me to root around for it and see how it’s stood up to current writings and theories.

Sure, you can create as much information as you like. But my impression is that most ‘information’ is junk.
So we can create information, but creating information that is immediately useful is hard.
Your blog seems to be information of the useful variety. 🙂
Peter

plunge:
No comment. I know it might seem a bit silly, but my employer requested that I don’t explicitly name them on the blog; even if it’s easy to figure out who they are, it still helps to establish a clear boundary between me when I’m writing for me blog; and me in my role as an employee. That boundary makes it clear that what I do and say here on the blog has absolutely nothing to do with my employer.

And probably, even if evolution was closed and didn’t incorporate information on the explored new features and the changing environment so any purported “law of conservation of information” was applicable, we could still as for entropy have parts of the system gaining information and parts loosing it. The loosing parts includes creationists.

Well… IF the physical universe is assumed to be a short, digital, Turing-Computable, Theory-Of-Everything program, then the universe as a whole would have a constant amount of information. If one takes a snapshot of the universe at a particular time-step (not “time” as we know it, as there is no universal clock), then the entropy of that snapshot would be the information in the original Theory-Of-Everything program PLUS the information in the actual time-step number. So, if we wanted to zoom in further in the universe to, say, Earth, then the coordinates of OUR Earth within the universe would become the bulk of the information describing our planet. So, (by making this Theory-Of-Everything assumption), we can approach information from the normal bottom-up fashion, where tiny events are considered simpler, or from the top-down fashion, where large sections of the universe are considered simpler, because the Theory-Of-Everything is simplest of all. Somewhere in the middle would be the parts of the universe with maximum entropy, just as “big” as they are “small”… So, from the top-down view of the universe, the idea of information being conserved is NOT completely absurd, if one considers the information in the time-step to be negligible as it’s a deterministic variable.
Maybe?

For some reason, this calls to mind something that a Foundation Drawing teacher said. Essentially he likened a white peice of paper to “all possibilities” and that every mark you make on the paper reduces the possibilities.

Mark,
Some nitpicks about Shannon’s IT:
* […]as presented in his infamous paper, [Communication in the Presence of Noise][…]
It’d probably be a good idea to also cite his first paper on the subject, “The mathematical theory of communication”.
* “Shannon called the information content of a signal it’s entropy[…]”
Entropy is a property of a source (a random variable), not a signal. (And it should be “its entropy”, not “it’s entropy” 🙂
* […]at which point information from the noise will start to replace information from the message[…]”
I found this description unnecessarily imprecise. Noise carries no information at all (at least, the AWGN Shannon considered).
Adding a short, selected bibliography could also be a good idea… Like Pierce’s book (already mentioned in the comments). I seem to recall some of Shannon’s papers are available on the web, and they are accessible to the dedicated reader who knows a bit of math.
HTH.

My familiarity with K-C entropy is nil but I suggest your example is poor: 4X5Y uses an expanded alphabet. Either the alphabet needs to be parsed into escape and command characters or it seems to be an apples-to-oranges comparison. Now command sequences can be identified by position (since position is not reuasable, it shouldn’t be considered part of the alphabet.) So this is one method for program and data to share an alphabet in practice without escape sequences.
You also don’t define “random” which I assume to mean “uncorrellated”, i.e. if the autocorrelation of one sequence is less than another then it has more information but I’m not sure.

BTW, in the sentence that begins “Shannon called the information content”, you should use “its”, not “it’s”. “Its” is the possessive form, “it’s” is a contraction for “it is”.
(I once had a prof listening to me in a group discussion on the first day of class and he asked me if I was an English major or a math major, having narrowed it down to that based on the ways in which I was being nitpicky and insisting on exactness….)

I read one of the books you recommended by Gregory J Chaitin ‘The Limits of Mathematics: A Course on Information Theory and the Limits of Formal Reasoning (Discrete Mathematics and Theoretical Computer Science)’.
Fascinating material and examples. Good explanation of Godel, Turing and Chaitin arguments for disproving Hilbert’s 10th conjecture.
Although I do agree with Chaitin about randomness in mathematics and the need for experimental mathematics, doesn’t mathematical game theory offer a means of dealing with this situation?
Instead of the classic bifurcation of games into cooperative and non-cooperative games an additional category of hybrid games may be needed. For example hybrid game would have the range from 1] one part cooperative with n parts non-cooperative to 2] n parts cooperative with one part non-cooperative [as opposed to n+1 parts cooperative only or n+1 parts non-cooperative only].
In some ways this may correspond to the ‘New Kind of Science’ advocated by Wolfram with his examples of cellular automata.
Axioms may then allow for randomness yet still have a usefulness as game theory does in economics.
I wonder if Demski borrowed ‘irreducible’ from the ‘irreducible mathematic information’ concept of Chaitin or from the “irreducible complex representation” language of Lie Algebra?

Doug:
Glad you enjoyed it.
I don’t know much about game theory, so I’m not really sure about how well what you’re suggesting works. My limited understanding suggests probably not, but I’m not at all certain.
In person, Greg has explained that his idea is basically that some of the statements that he and other mathematicians building on Gödel have developer are statements that are neither provably true nor provably false; but they’re also not intrinsically contradictory or non-sensical. If we have statements like that – statements where we can’t know if they’re true or false – we should take the scientific approach of saying “Well, this one seems right in this context – let’s follow it and see where it leads”. It’s an idea for a kind of experimental pure mathematics.

apav:
I didn’t say what the alphabet was 🙂
My intention was actually that the alphabet I was using was the same for both strings – one didn’t use all of the characters available in the alphabet, which made it confusing. I probably should have been clearer about that.

Peter:
Information theory doesn’t care about the “value” of information. Information is information; what it means, if anything, is an entirely different topic.
The interesting and strange thing is that it doesn’t matter. When it comes to measuring it, producing it, etc., describing it; the meaning of it doesn’t matter. A string that contains the encoding directions for how to build an internal combusion engine; the string that describes the pattern of grains of sand on the beach, or the string that describes an utterly random string of bits – they’re all the same from the point of what information means, and how information behaves.

The midsummer holiday was long and eventful, apparently so was this thread.
“IF the physical universe is assumed to be a short, digital, Turing-Computable, Theory-Of-Everything program, then the universe as a whole would have a constant amount of information.”
Much says the universe isn’t turingcomputable AFAIK. QM randomness and prohibition against local hidden variable theories seems to say that algorithms aren’t used by nature.
Also, entropy says there is a finite but increasing amount of information within the observable universe. So says also the holographic theories, which are consistent with entropy.
“Noise carries no information at all (at least, the AWGN Shannon considered).”
Fourier spectra says white noise contains information on all frequencies. What Mark says is compatible with the signal-to-noise ratio (S/N) where increasing noise drowns out the signal.

You say that what the information means doesn’t matter?
So I could make an even smaller string, say “54” which is understood by those in the know that 5 always represents XXXXX and 4 always represents YYYY. Or an even shorter string “&” where & always reperesents XXXXXYYYY?
So I could just make up a language where every character means something arbitrary and another language where it takes whole sets of characters to “say” the same thing and they would both have the same amount of information?
But what about symbols? A few concentric circles might mean nothing to some people, but to Target employees and shoppers it means one thing, and to Mayans it means something else entirely.
BTW, I think it’s entirely obvious that I don’t know w thing about information theory.

Jason:
You’ve hit on one of the really important subtle bits of K-C theory – but it has nothing to do with “what the information means”. Meaning is a separate question; information quantity is about “how much can be coded into a string of a certain length”; it doesn’t matter whether what you’re coding is a string of random bits, or the decay times of a collection of radioactive atoms, or the complete works of shakespear. What matters is that you can look at a set of information *in the abstract*, and discuss what properties it must necessarily have.
K-C theory says that the information content of a string is the sum of the shortest pair of program and data that will generate that string. The differences you’re asking about are actually machine differences: the difference between “XXXXYYYYY” and “4X5Y” is the program that you need to generate the output string. It’s a more complicated program for “4X5Y” than it is for “XXXXYYYYY” – the program needs to understand how to interpret numbers and count. And “45” with the assumption that it’s Xs and Ys means that the symbols “X” and “Y” have to be coded into the program.

Jason:
To answer “If the meaning doesn’t matter, why call it information”?, I need to dive a bit deeper.
The point of information theory is to understand the differences between different strings, and to capture some notions of communication and complexity.
A totally random string of digits *intrinsically* takes more space to encode – no matter how clever you are – than a string of alternating 1s and 0s. What property does that string of random digits have? We call it information, because it does seem to fit our abstract notion of what information should mean.
And it works. One of hte most common practical uses of K-C theory has been proving *lower* bounds on algorithmic complexity. When it comes to algorithms, we usually talk about *upper* bounds: we know that for any input of length N, a particular algorithm can solve it in no more than f(N) steps for some function f. For example, we can say that we know that sorting a list of n numbers doesn’t need to take any more than something propertional to n*log(n) steps. But going the other way – saying that *no* algorithm can do something in *less* than f(n) steps on input n is a much harder thing to do.
K-C theory has been able to do that – by analyzing the amount of information that must be processed. So, for example, you can show that finding a value in an ordered list of n elements can’t take less than O(log n) steps, because you must process that much information to find its location in the list.

“The halting problem turns out to say exactly the same thing as the Gödel incompleteness theorem.”
Heh, I’d suspected as much, but it’s good to get confirmation from an expert. What about links between computation and “chaos”? That is, from both intuition and experience, computation shows Sensitive Dependence on Initial Conditions.
On the flip side, it seems to be that chaotic transitions could be used as the basis for “switches”… is it reasonable to say that chaotic systems can generally be constructed to perform given computations? (E.g, Nervous systems, and methods for production of same, “constructed” by evolution.)

“information content is proportional to randomness is proportional to minimum compressed length.”
For a few years now, I’ve been applying this idea to writing essays. Namely, the notion that concise writing (i.e., compressed length) requires a larger vocabulary (i.e., randomness), so, in the end, less is more (i.e., more information). A concise writer with a large vocabulary can do in one sentence what someone with a limited vocabulary can do in a paragraph.

Nah, its an ad for a company with an application and a few lines about using Shannon.
They even drag up the old left side-right side brain model that AFAIK neuroscientists have never bought into.
I advise getting facts elsewhere.

Hello Mark
great intro to this topic
What do you think about this statement:
In the future we will find that technically and on a human communication level NOISE is not only a disturbing factor but is an essential factor for the transmission and reception process
AND Noise is an integral part of any communication, which means communication without noise in impossible, nothing exists that is 100 % consistent (which is another word for saying “free of noise” or is only “yes or no” or only “black or white” and at the same time complete….
in our medical evaluation system the CoRe Inergetix system we use a random number generator as a source of information about the client
I am interested to get you on our board of experts to develop this further let me know Kiran@inergetix.com