Wednesday, March 08, 2006

Some good math: An Introduction to Information Theory, part 1

I don't want this blog to just be bad math. Math is a wonderful, beautiful, fun thing. I want to spend some of the time here trying to help people see the beauty and the fun of good math.

As a start, I'm going to try to put together a series of entries on something really interesting and fun, and which is all too often misused by crackpots: information theory. It's an area of math that was almost totally obscure until not too long ago, and one which is frequently misunderstood, even by people who mean well.

A bit of history

Modern information theory comes from two very different sources: Shannon's information theory, and Kolmogorov/Chaitin algorithmic information theory. Fortunately, they both converge, mostly, on the same place.

Shannon's Information Theory

The more famous branch of IT is Claude Shannon's information theory, as presented in his infamous paper, Communication in the Presence of Noise. Shannon is commonly regarded as the father of information theory, with good reason. While the other branch of information theory actually predates Shannon, his work is what turned IT into something which was actually directly useful in the real world.

Shannon's interest come from his work at Bell Labs for AT&T. For a telephone company laying wire is an incredibly expensive proposition. They'd really like to do it once, and never need to go back to lay more wire than they needed to. The problem for them was, how much wire did they actually need to lay? It's actually an astonishingly tricky question. The less they laid, the less it cost them in the short term. But they knew that the number of phone lines was going to increase dramatically - so if they were cheap, and laid too little wire, when the number of phones exceeded the capacity of what they had already laid, they'd have to go back and lay more. So they wanted to be able to make a good prediction of the minimum amount of wire they could lay that would meet their needs both at the time it was laid, and in the future.

But there's a big trick to that. First: how much information can a single wire carry? And when you bundle a bunch of wires together, how much can you pump over a single wire without it interfering with signals on it's neighbors in the bundle?

That's what Shannon was trying to understand. He was trying to find a mathematical model that he could use to describe information transmission, including the effects of imperfections in transmission, including the introduction of noise and interference. He wanted to be able to quantify information in a meaningful way that would allow him to ask questions like "at what point does noise in the line start to eliminate the information I want to transmit?".

The answer is just fascinating: a given communication channel has a capacity for carrying information; and a given message has a quantity of information in it. Adding noise to the signal adds information to message until the capacity of the channel is reached, at which point information will start to be lost.

Shannon called the information content of a signal it's entropy, because he saw a similarity between his information entropy, and thermodynamic entropy: in a communicating system, entropy never decreases: it increases until the capacity of the channel is reached, and then it stays content. You can't reduce the amount of information in a message.

That naming led directly to the most common misuse of information theory. But that's a post for another day.

Algorithmic Information Theory

The other main branch of information theory came from a very different direction. The two pioneers were Andrey Kolmogorov, and Greg Chaitin. Kolmogorov and Chaitin didn't know each other at the time they independently invented the same mathematical formalization (in fact, Chaitin was a teenager at the time!), a fact which led to some friction.

In the interests of honesty, I'll say that Greg Chaitin works in the same building that I do, and I've met him a number of times, and been greatly impressed by him, so my description is heavily influenced by Greg.

Anyway - algorithmic information theory grew out of some of the fundamental mathematics being done at the beginning of the 20th century. There was an effort to create a single, fundamental, set of mathematical axioms and inference rules, which was both complete (every true statement was provably true), and consistent (every well-formed statement was either true or false). This effort failed, miserably; the nail in the coffin was Godel's incompleteness theory. Godel presented an extremely complicated proof that showed, essentially, that no formal system could be both complete and consistent.

Most people saw Godel's proof, did the equivalent of saying "Oh, shit!", and then pretending that they hadn't seen it. It does mean that there are fundamental limits to what you can do using mathematical formalization, but for the most part, they don't affect you in normal day to day math. (Sort of the way that car designers didn't change the way they build cars because of relativity. Yes, it showed that the fundamental physical model that they were using was wrong - but at the speeds that cars move, that wrongness is so small that it just doesn't matter.)

But for people in the early stages of what would become computer science, this was a big deal. One of the early hopes was that the mathematical system would produce a "decision procedure", a mechanical process by which a computing device could check the truth or falsehood of any mathematical statement. Godel showed that it couldn't be done.

But the early computer scientists - in particular, Alan Turing - embraced it. It led directly to two of the fundamental rules of computer science:

The Church-Turing Thesis: all mechanical computing systems are basically the same: there is a fundamental limit to what is computable, and any "acceptable" system can compute anything up to that limit - so if you can find a way of doing it on any computing device, then you can, ultimately, do it on every acceptable computing device.

The Halting Problem: there are some things that you cannot program a device to do. In particular, you cannot write a program for any computing device that examines another program, and tells you if that other program will eventually stop.

The halting problem turns out to say exactly the same thing as the Godel incompleteness theorem. Only you can write a proof of it that a freshman college student can understand in ten minutes, instead of a proof that the average math grad student will need a semester to plow through.

Chaitin and Kolmogorov saw this as a big deal: using an algorithmic approach to how to process information, you could prove something very simply, which would require a much more complicated proof using other methods.

K-C information theory grew out of this. According to Kolmogorov and Chaitin, the fundamental amount of information contained in a string (called the string's information entropy after Shannon), is the shortest string of program + data for a computing device that can generate that string.

8 Comments:

Very much needed blog. I tried to do some math and everyday sense explorations in my blog. But I can feel it will be great to hear from your blog. Looking forward to some good math applications of basic fundamentals.

For those with some formal logic training, Smullyan's Gödel's Incompleteness Theorems (here) is the best volume I've come across on the subject. Also, there's an inexpensive translation of the original paper here. It's not that heavy.

"The Halting Problem: there are some things that you cannot program a device to do. In particular, you cannot write a program for any computing device that examines another program, and tells you if that other program will eventually stop."

Aeee! Please put quantifiers. I have written several of programs that looked at other programs and told if they would stop. In fact my programs told you almost exactly how long they would take to stop. The target programs were all of a certain subset of programs all of a given type.

Maybe a better way to state the issue is:You cannot write a program that will take ANY arbitrary other program as input and which will guarantee to give an answer as to whether the input program will stop. In fact any program you write to do this will be one of those you will not be able to get an answer about.

as an aside: it was rumored that the name 'entropy' was suggested to Shannon by von Neumann because of the familiarity of scientists with the term 'entropy'. Any confusion between information-theoretic entropy and physical entropy can thus be at least partially attributed to this name choice !

Information theory was well developed in the fields of statistical mechanics and statistical thermodynamics long before Shannon. The similarity between entropy and uncertainty is not a superficial resemblance. In the appropriate context they are precise equivalents.

You can measure thermodynamic entropy in 'bits' - the only difference between measuring it in bits and more conventional units is a constant factor derived from Boltzmann's constant and the natural logarithm of two.

Brilliant work, it was, but ultimately, I have to tell you it was mixed with atrocious math - which Shannon no doubt knew but was "dumbing down" his talk. I think Kolmogorov said he had to "correct" Shannon. But yes, Brilliant - the number of key concepts in that paper that led to whole new areas of information theory and communications is breathtaking for a single paper.