I've become more interested in information theory recently. I was reading Eric Baum's What is Thought?, in which he lays out the case that thought is basically the compression of information and then the use of those compressions to predict the future (at least that's what I'm getting out of it, and it has a lot of conceptual overlap with Jeff Hawkins' ideas). The idea is that you are receiving an enormous amount of information, and your brain is constantly trying to find statistical regularities among the torrent of data so that it can compress it into useful representations. He constantly compares thought to the application of Occam's Razor, in which you try to find the simplest explanation that fits all (or at best, most) of the data. You could call these representations "theories" as well. Examples might be the names we give to things in the world, such as cats or cars. These things have a lot of variability, but they have enough features in common to allow us to compress the representation.

Causal regularities are also things we learn to compress, such as "when I drop something, it falls to the ground." We learn "folk physics" through our constant interaction with the world, compressing the regularities we see into usable guidelines. Scientific laws expressed in mathematical form, like "F=ma" are the holy grail of compressions, because they are incredibly compact and describe a huge array of phenomena in the world.

Anyway, I'm still trying to get a handle on what exactly is being represented and compressed, and so I'm also looking into information theory, which I'm not wrapping my head around very well. I started reading Charles Seife's Decoding the Universe, but I don't like his writing style, and he seems to be maddeningly vague.

John Wilkins' Basic Concepts in Science list has an entry for information theory, but the link is broken as of this writing. I haven't found many good on-line resources either. I looked at this introduction, by a computer scientist at Carnegie Mellon. But some things still confuse me. For example, right off the bat he says:

Suppose you flip a coin one million times and write down the sequence of results. If you want to communicate this sequence to another person, how many bits will it take? If it's a fair coin, the two possible outcomes, heads and tails, occur with equal probability. Therefore each flip requires 1 bit of information to transmit. To send the entire sequence will require one million bits.

But then he goes on to describe how we are able to compress a sequence of a million bits from a coin that only produces heads 1/10th of the time, by encoding the "runs" of tails rather than every bit explicitly. But why aren't you able to do that with the sequence from the fair coin? You're going to have long runs of either all heads or all tails, or seemingly ordered sequences that occur randomly (e.g., you might have a string like "01010101010101010101" that occurs randomly, but could still be compressed, right?).

Also, there seems to be something counterintuitive about the idea of random sequences containing more information. What that effectively means is that when you apply some algorithm that induces regularity (like a sorting algorithm), you're reducing the amount of information in the system. That implies that a blueprint contains less information than a sheet of paper with random squiggles, right? Because the blueprint has more regularity and is thus more compressible.

Anyway, if someone knows of a good, lucid primer on the subject, I'd welcome any recommendations.