Archive for the ‘Inference’ Category

In my last post, I discussed phase transitions, and how computing the free energy for a model would let you work out the phase diagram. Today, I want to discuss in more detail some methods for computing free energies.

The most popular tool physicists use for computing free energies is “mean-field theory.” There seems to be at least one “mean-field theory” for every model in physics. When I was a graduate student, I became very unhappy with the derivations for mean-field theory, not because there were not any, but because there were too many! Every different book or paper had a different derivation, but I didn’t particularly like any of them, because none of them told you how to correct mean-field theory. That seemed strange because mean-field theory is known to only give approximate answers. It seemed to me that a proper derivation of mean-field theory would let you systematically correct the errors.

One paper really made me think hard about the problem; the famous 1977 “TAP” spin glass paper by Thouless, Anderson, and Palmer. They presented a mean-field free energy for the Sherrington-Kirkpatrick (SK) model of spin glasses by “fait accompli,” which added a weird “Onsager reaction term” to the ordinary free energy. This shocked me; maybe they were smart enough to write down free energies by fait accompli, but I needed some reliable mechanical method.

Since the Onsager reaction term had an extra power of 1/T compared to the ordinary energy term in the mean field theory, and the ordinary energy term had an extra power of 1/T compared to the entropy term, it looked to me like perhaps the TAP free energy could be derived from a high-temperature expansion. It would have to be a strange high-temperature expansion though, because it would need to be valid in the low-temperature phase!

Together with Antoine Georges, I worked out that the “high-temperature” expansion (it might better be thought of as a “weak interaction expansion”) could in fact be valid in a low-temperature phase, if one computed the free energy at fixed non-zero magnetization. This turned out to be the key idea; once we had it, it was just a matter of introducing Lagrange multipliers and doing some work to compute the details.

It turned out that ordinary mean-field theory is just the first couple terms in a Taylor expansion. Computing more terms lets you systematically correct mean field theory, and thus compute the critical temperature of the Ising model, or any other quantities of interest, to better and better precision. The picture above is a figure from the paper, representing the expansion in a diagrammatic way.

We found out, after doing our computations but before submitting the paper, that in 1982 Plefka had already derived the TAP free energy for the SK model from that Taylor expansion, but for whatever reason, he had not gone beyond the Onsager correction term or noted that this was a technique that was much more general than the SK model for spin glasses, so nobody else had followed up using this approach.

This approach has some advantages and disadvantages compared with the belief propagation approach (and related Bethe free energy) which is much more popular in the electrical engineering and computer science communities. One advantage is that the free energy in the high-temperature expansion approach is just a function of simple one-node “beliefs” (the magnetizations), so it is computationally simpler to deal with than the Bethe free energy and belief propagation. Another advantage is that you can make systematic corrections; belief propagation can also be corrected with generalized belief propagation, but the procedure is less automatic. Disadvantages include the fact that the free energy is only exact for tree-like graphs if you add up an infinite number of terms, and the theory has not yet been formulated in an nice way for “hard” (infinite energy) constraints.

Sebastian Thrun is a professor of computer science and electrical engineering at Stanford, and director of the Stanford Artificial Intelligence Laboratory. He was the leader of Stanford’s team which won the $2 million first prize in the 2005 DARPA Grand Challenge, which was a race of driver-less robotic cars across the desert, and also leads Stanford’s entry into the 2007 DARPA Urban Challenge.

One of the ingredients in the Stanford team’s win was their use of “probabilistic robotics,” which is an approach based on the recognition that all sensor readings and models of the world are inherently subject to uncertainty and noise. Thrun, together with Wolfram Burgard and Dieter Fox have written the definitive text on probabilistic robotics, which will be a standard for years to come. If you are seriously interested in robotics, you should read this book. (The introductory first chapter, which clearly explains the basic ideas of probabilistic robotics is available as a download here.)

The Laboratory of Intelligent Systems at the Swiss École Polytechnique Fédérale de Lausanne (EPFL) hosts the superb “Talking Robots” web-site, which consists of a series of podcast interviews with leading robotics researchers. I noticed that the latest interview is with Thrun, and liked it quite a bit; it is well worth downloading to your iPod or computer.

You can watch Thrun speaking about the DARPA Grand Challenge at this Google TechTalk.

“Artificial Intelligence: A Modern Approach,” by Stuart Russell (professor of computer science at UC Berkeley) and Peter Norvig (head of research at Google) is the best-known and most-used textbook about artificial intelligence, and for good reason; it’s a great book! The first edition of this book was my guide to the field when I was switching over from physics research to computer science.

I feel almost embarrassed to recommend it, because I suspect nearly everybody interested in AI already knows about it. So I’m going to tell you about a couple related resources that are maybe not as well-known.

First, there is the online code repository to the algorithms in the book, in Java, Python, and Lisp. Many of the algorithms are useful beyond AI, so you may find for example that the search or optimization algorithm that you are interested in has already been written for you. I personally have used the Python code, and it’s really model code from which you can learn good programming style.

If you’re interested in learning about statistical mechanics, graphical models, information theory, error-correcting codes, belief propagation, constraint satisfaction problems, or the connections between all those subjects, you should know about a couple of books that should be out soon, but for which you can already download extensive draft versions.

MIT professor Edward Adelson uses remarkable visual illusions to help explain the workings of the human visual system. One such illusion is shown above. Believe it or not, the square marked A is the same shade of gray on your computer screen as the square marked B.

Here’s a “proof,” provided by Adelson. Two strips of constant grayness are aligned on top of the picture. You can see that the A square is the same shade as the strips near it and the B square is the same shade as the strips near it. Perhaps you still don’t believe that the strips are of constant grayness. In that case, put some paper up next to your computer screen to block off everything except for the strips; you’ll see it’s true.

Adelson explains the illusion here. The point is that our visual system is not meant to be used as a light meter; instead it is trying to solve the much more important problem (for our survival) of determining the true shade (that is, the color of the attached “paint”) of the objects it is looking at.

You can find more interesting illusions and demos from Adelson and other members of the perceptual science group at MIT, but don’t fail to also take a look at the illusions collected by the lab of Dale Purves at Duke. I particularly recommend the cube color contrast demo, where you can see that gray can be made to look yellow or blue.

Purves, together with R. Beau Lotto, wrote the book “Why We See What We Do: An Empirical Theory of Vision,” which collects these remarkable illusions and also expounds on a theory explaining them. The theory, to summarize it very briefly, says that what humans actually see is a “reflexive manifestation of the past rather than a logical analysis of the present.” I found myself quite uncomfortable with the theory for much the same reasons as given in Alan Gilchrist‘s review.

I also would prefer a more mathematical theory than Purves and Lotto give. It seems to me that we should in general try to explain illusions in terms of a Bayesian analysis of the most probable scene given the evidence provided by the light. My collaborators Bill Freeman and Yair Weiss (both former students of Adelson’s) have long worked along these lines; see for example Yair’s excellent Ph.D. thesis from 1998, explaining motion illusions.

In fact, I would like to go beyond a mathematical explanation of illusions to an algorithmic one. I would argue that a good computer vision system should “suffer” from the same illusions as a human, even though it has neither the same evolutionary history nor the same life history. To take an example of what I have in mind, the famous Necker cube illusion presumably arises naturally from the fact that the two interpretations are both local optima, with respect to probability, so a good artificial system should use an algorithm that settles into one interpretation, but then still be able to spontaneously switch to the other.

What does it mean to say that “I believe that Hillary Clinton has a 42% chance to win the 2008 U.S. Presidential election?” It used to be that some academics (called “frequentists”) had problems with this statement, because they only wanted to talk about probabilities for experiments that, at least in principle, could be run many times, to permit a reasonable estimate of the frequency of some event. I think hardly anybody is a frequentist anymore.

One good operational definition of the above statement is that if you offer me a choice between $1.00 if Hillary Clinton wins, or $0.42 regardless of whether or not she wins, I am indifferent. If you offer me less than the $0.42, I’ll take the chance on Senator Clinton, if you offer me more, I’ll go with the sure thing.

Nowadays, we can get good estimates of the probabilities of many interesting events from prediction markets like TradeSports for sporting events, or Intrade for political and other events. In these markets, participants buy and sell contracts of the above variety, so that the market provides a consensus probability for a particular event.

I find these markets fascinating. You can find out for example, looking at the chart above, that Senator Obama’s apparent chance of winning the Democratic nomination has fallen from around 39% to around 17% over the last few weeks (presumably because of the fallout over his statements on foreign policy). Or that the New England Patriots off-season acquisitions have increased their chances of winning the next Super Bowl from about 7% to 17%.

There’s some bias towards American sports and politics, but events like the recent French Presidential election are also heavily traded.

I want to expand on what I wrote previously in “A Simple But Challenging Game: Part II, this time focusing on Rosenthal’s Centipede Game. To remind you of the rules, in that game there are two players. The players, named Mutt and Jeff, start out with $2 each, and they alternate rounds. On the first round, Mutt can defect by stealing $2 from Jeff, and the game is over. Otherwise, Mutt cooperates by not stealing, and Nature gives Mutt $1. Then Jeff can defect and steal $2 from Mutt, and the game is over, or he can cooperate and Nature gives Jeff $1. This continues until one or the other defects, or each player has $100.

As I previously wrote, in this game, the Nash equilibrium is that Mutt should immediately defect on his first turn. This result is obtained by induction. When both players have $99, it is clearly in Mutt’s interest to steal from Jeff, so that the he will end with $101, and Jeff will end with $97. But that means that when Jeff has $98 and Mutt has $99, Jeff knows what Mutt will do if he cooperates, and can see that he should steal from Mutt, so that he will end with $100 and Mutt will end with $97. But of course that means that when both players have $98, Mutt can see that he should steal from Jeff, and so on, until one reaches the conclusion that Mutt should start the game by stealing from Jeff.

Of course, this Nash equilibrium behavior doesn’t really seem very wise (not to mention ethical), and experiments show that humans will not follow it. Instead they usually will cooperate until the end or near the end of the game, and thus obtain much more money than would “Nashists” who rigorously follow the conclusions of theoretical game theory.

Game theorists often like to characterize the behavior of Nashists as “rational,” which means that they need to explain the “irrational” behavior of humans in the Rosenthal Centipede Game. See for example, this economics web-page, which gives the following “possible explanations of ‘irrational’ behavior”:

There are two types of explanation to account for the divergence. The first assumes that the subject pool contains a certain proportion of altruists who place a positive weight in their utililty function on the payoff of their opponent. Also to the extent that selfish players believe that there is some probability that other players are altruists, they have an incentive to mimic altruistic behaviour by passing.

The second explanation considers the possibility of action errors. Errors in action, or ‘noisy’ play, may result from subjects experimenting with different strategies. Or simply from subjects pressing the wrong key.

Let’s step back for a second and consider what “rational” behavior should mean. A standard definition from economics is that a rational agent will act so as to maximize his expected utility. Let’s accept this definition of “rational.”

The first thing we should note is that “utility” is not usually the same as “pay-off” in a game. As noted in the first explanation above, many people get utility from helping other people get a pay-off. But there are many other differences between pay-offs and utility. You might lose utility from performing actions that seem unethical or unjust, and gain utility from performing actions that seem virtuous or just. You might want to minimize the risk in your pay-off as well as maximize the expected pay-off. You might value pay-offs in a non-linear way, so that the difference between $101 and $100 is very small in terms of utility.

Of course, this difference between pay-off and utility is very annoying theoretically. We’d really like the pay-offs to strictly represent utilities, but unfortunately for experiments, it is only possible to hand out dollars, not some abstract “utils.”

But suppose that the pay-offs in the Rosenthal Centipede Game really did represent utils. Would the game theory result really be “rational” even in that case? Would the only remaining explanation of cooperating behavior be that the players just don’t understand the situation and are making an error?

No. Remember that to be “rational,” an agent should maximize his expected utility. But he can only do that conditioned on some belief about the nature of the person he is playing with. That belief should take the form of a probability distribution for the possible strategies of his opponent. A Nashist rigidly reasons by backward induction that his opponent must always defect at the first opportunity. He even believes this if he plays second, and his opponent cooperates on the first turn! But is this the most accurate belief possible, or the one that will serve to maximize utility? Probably not.

A much more accurate belief could be based on the understanding that even people who understand the backward induction argument can reason beyond it and see that many of their opponents are nevertheless likely to cooperate for a long time, and therefore it pays to cooperate. If you believe that your opponent is likely to cooperate, it is completely “rational” to cooperate. And if this belief that other players are likely to cooperate is backed by solid evidence such as the fact that they started the game by cooperating, then the behavior of the Nashist, based on inaccurate beliefs that cannot be updated, is in fact quite “irrational,” because it does not maximize his utility.

Sophisticated game theorists do in fact understand these points very well, but they muddy the waters by unnecessarily overloading the term “rational” with a second meaning beyond the definition above; they in essence say that “rational” beliefs are those of the Nashist. For example, take a look at this 1995 paper about the centipede game by Nobel Laureate Robert Aumann. Aumann proves that “Common Knowledge of Rationality” (by which he which he essentially means the certain knowledge that all players must always behave as Nashists) will imply backward induction. He specifically adds the following disclaimer at the end of his paper:

We have shown that common knowledge of rationality (CKR) implies backward induction. Does that mean that in perfect information games, only the inductive choices are appropriate or wise? Would we always recommend the inductive choice?

Certainly not. CKR is an ideal (this is not a value judgement; “ideal” is meant as in “ideal gas”) condition that is rarely met in practice; when it is not met, the inductive choice may be not only unreasonable and unwise, but quite simply irrational. In Rosenthal’s (1982) centipede games, for example, even minute departures from CKR may make it incumbent on rational players to “stay in” until quite late in the game (Aumann, 1992); the resulting outcome is very far from that of backward induction. What we have shown is that if there is CKR, then one gets the backward induction outcome; we do not claim that CKR obtains or “should” obtain, and we make no recommendations.

This is all well and good, but why use the horribly misleading name “Common Knowledge of Rationality” for something that would be more properly called “Universal Insistence on Idiocy?”

I hope it is obvious by now why I am skeptical of explanations of various types of human behavior that are based on assuming that all humans are always Nashists, and even more skeptical of recommendations about how we should behave that are based on those same assumptions.

[Acknowledgement: I thank my son Adam for discussions about these issues.]

“On Intelligence,” written by Jeff Hawkins with Sandra Blakeslee, is a great read, full of provocative ideas about how our brains work.

Hawkins founded Palm Computing and Handspring, but he says that his true lifelong passion has been trying to understand our brains. In 2002, he founded the Redwood Neuroscience Institute (now the Redwood Center for Theoretical Neuroscience at Berkeley).

At Redwood, he developed a theory of the brain, that he expounds in this book. The book is a popular science book; you will not find any equations or explicit algorithms. It reads very smoothly, undoubtably due to the fact that Hawkins was helped in writing the book by Sandra Blakeslee, a science writer for the New York Times.

Hawkins argues that the cortex consists of modules that are all performing the same algorithm. The purpose of that algorithm is to learn to complete and predict the spatio-temporal patterns coming into a module from “lower” modules in the brain’s hierarchy, using feedback from “higher” modules.

I find this hypothesis for how the brain works extremely attractive, and “On Intelligence” argues for the hypothesis almost too well, in that the difficulties of converting the hypothesis into a concrete and useful algorithm will slip by the reader, as if by sleight of hand. The problem, of course, is that it is easy to use words to talk about how a brain might work; the hard part is making a machine do the same thing.

After reading the book, I actually spent a few weeks trying to make the hypothesis into a concrete algorithm, without much success. But Hawkins is certainly putting his money where his mouth is: he has founded a company, called Numenta, and Numenta has released software (the “Numenta Platform for Intelligent Computing” or “NuPIC”) that implements part of his theory. NuPIC appears to be based on software originally written by Numenta co-founder and Stanford graduate student Dileep George, but transformed by a team of software developers into a professional-quality product.

However, it is very disappointing that the NUPIC software does not include feedback in its hierarchies, nor does it let you learn and infer temporal patterns (only spatial ones). To be honest, I am not so surprised, because it was these elements (that are obviously central in Hawkins’ theory) that I found it so difficult to integrate into a real algorithm.

Numenta promises that future versions of NuPIC will fill these gaps. I sincerely wish them the best of luck! I should say that I think it is wonderful that an impatient guy like Hawkins pushes the field, using a different approach than the standard academic one.

So in summary, you should definitely read Hawkins’ book, and if you’re intrigued, you should by all means check out the free Numenta software. Whether this line of research will lead to anything significant though, I’m not really sure…

Many physicists study “disordered systems,” such as materials like glasses where the molecules making up the material are arranged randomly in space, in contrast to crystals, where all the particles are arranged in beautiful repeating patterns.

The symmetries of crystals make them much easier to analyze than glasses, and new theoretical methods had to be invented before physicists could make any headway in computing the properties of disordered systems. Those methods have turned out to be closely connected to approaches, such as the “belief propagation” algorithm, that are widely used in computer science, artificial intelligence, and communications theory, with the result that physicists and computer scientists today regularly exchange new ideas and results across their disciplines.

Returning to the physics of disordered systems, physicists began working on the problem in the 1970’s by considering the problem of disordered magnets (also called “spin glasses”). My Ph.D. thesis advisor, Philip W. Anderson summarized the history as follows:

“In 1975 S.F. (now Sir Sam) Edwards and I wrote down the “replica” theory of the phenomenon I had earlier named “spin glass”, followed up in ’77 by a paper of D.J. Thouless, my student Richard Palmer, and myself. A brilliant further breakthrough by G. Toulouse and G. Parisi led to a full solution of the problem, which turned out to entail a new form of statistical mechanics of wide applicability in fields as far apart as computer science, protein folding, neural networks, and evolutionary modelling, to all of which directions my students and/or I contributed.”

In 1992, I presented five lectures on “Quenched Disorder: Understanding Glasses Using a Variational Principle and the Replica Method” at a Santa Fe Institute summer school on complex systems. The lectures were published in a book edited by Lynn Nadel and Daniel Stein, but that book is very hard to find, and I think that these lectures are still relevant, so I’m posting them here. As I say in the introduction, “I will discuss technical subjects, but I will try my best to introduce all the technical material in as gentle and comprehensible a way as possible, assuming no previous exposure to the subject of these lectures at all.”

The first lecture is an introduction to the basics of statistical mechanics. It introduces magnetic systems and particle systems, and describes how to exactly solve non-interacting magnetic systems and particle systems where the particles are connected by springs.

The second lecture introduces the idea of variational approaches. Roughly speaking, the idea of a variational approach is to construct an approximate but exactly soluble system that is as close as possible to the system you are interested in. The grandly titled “Gaussian variational method” is the variational method that tries to find the set of particles and springs that best approximates an interacting particle system. I describe in this second lecture how the Gaussian variational method can be applied to heteropolymers like proteins.

The next three lectures cover the replica method, and combine it with the variational approach. The replica method is highly intricate mathematically. I learned it at the feet of the masters during my two years at the Ecole Normale Superieure (ENS) in Paris. In particular, I was lucky to work with Jean-Philippe Bouchaud, Antoine Georges, and Marc Mezard, who taught me what I knew. I thought it unfortunate that there wasn’t a written tutorial on the replica method, so the result were these lectures. Marc told me that for years afterwards they were given to new students of the replica method at the ENS.

Nowadays, the replica method is a little less popular than it used to be, mostly because it is all about computing averages of quantities over many samples of systems that are disordered according to some probability distribution. While those averages are very useful in physics, they are somewhat less important in computer science, where you usually just want an algorithm to deal with the one disordered system in front of you, rather than an average over all the possible disordered systems.

In 2002, I gave a lecture at the Mathematical Sciences Research Institute on the work I did, together with Bill Freeman and Yair Weiss on Generalized Belief Propagation, and the correspondence between free energy approximations and message passing algorithms. The lecture is available as a streaming video, together with a pdf for the slides, here.

It’s worth mentioning that there are many other interesting research lectures available in MSRI’s video archive, and that the more recent ones are of higher production quality.

Here is our most recent and comprehensive paper on this subject, published in the July 2005 issue of IEEE Transactions on Information Theory, which gives many additional details compared to the lecture: MERL TR2004-040.

If that paper is too difficult, you should probably start with this earlier paper, which was more tutorial in nature: MERL TR2001-22.

If you’re looking for generalized belief propagation software, your best bet is this package written by Yair’s student Talya Meltzer.

P.S.: I realized I haven’t told those of you who don’t know anything about it what generalized belief propagation is. Well, one answer is to that is look at the above material! But here’s a little background text that I’ve copied from my research statement to explain why you might be interested:

Most of my current research involves the application of statistical methods to “inference” problems. Some important fields which are dominated by the issue of inference are computer vision, speech recognition, natural language processing, error-control coding and digital communications. Essentially, any time you are receiving a noisy signal, and need to infer what is really out there, you are dealing with an inference problem.

A productive way to deal with an inference problem is to formalize it as a problem of computing probabilities in a “graphical model.” Graphical models, which are referred to in various guises as “Markov random fields,” “Bayesian networks,” or “factor graphs,” provide a statistical framework to encapsulate our knowledge of a system and to infer from incomplete information.

Physicists who use the techniques of statistical mechanics to study the behavior of disordered magnetic spin systems are actually studying a mathematically equivalent problem to the inference problem studied by computer scientists or electrical engineers, but with different terminology, goals, and perspectives. My own research has focused on the surprising relationships between methods that are used in these communities, and on powerful new techniques and algorithms, such as Generalized Belief Propagation, that can be understood using those relationships.