Tishby’s talk wasn’t easy for me to follow—he assumed you already knew rate-distortion theory and the Bellman equation, and I didn’t—but it was great!

It was about the ‘action-perception loop’:

This is the feedback loop in which living organisms—like us—take actions depending on our goals and what we perceive, and perceive things depending on the actions we take and the state of the world.

How do we do this so well? Among other things, we need to balance the cost of storing information about the past against the payoff of achieving our desired goals in the future.

Tishby presented a detailed yet highly general mathematical model of this! And he ended by testing the model on experiments with cats listening to music and rats swimming to land.

It’s beautiful stuff. I want to learn it. I hope to blog about it as I understand more. But for now, let me just dive in and say some basic stuff. I’ll start with the two buzzwords I dropped on you. I hate it when people use terminology without ever explaining it.

Rate-distortion theory

Rate-distortion theory is a branch of information theory which seeks to find the minimum rate at which bits must be communicated over a noisy channel so that the signal can be approximately reconstructed at the other end without exceeding a given distortion. Shannon’s first big result in this theory, the ‘rate-distortion theorem’, gives a formula for this minimum rate. Needless to say, it still requires a lot of extra work to determine and achieve this minimum rate in practice.

The Bellman equation

The Bellman equation reduces the task of finding an optimal course of action to choosing what to do at each step. So, you’re trying to maximize the ‘total reward’—the sum of rewards at each time step—and Bellman’s equation says what to do at each time step.

If you’ve studied physics, this should remind you of how starting from the principle of least action, we can get a differential equation describing the motion of a particle: the Euler–Lagrange equation.

And in fact they’re deeply related. The relation is obscured by two little things. First, Bellman’s equation is usually formulated in a context where time passes in discrete steps, while the Euler–Lagrange equation is usually formulated in continuous time. Second, Bellman’s equation is really the discrete-time version not of the Euler–Lagrange equation but a more or less equivalent thing: the ‘Hamilton–Jacobi equation’.

Ah, another buzzword to demystify! I was scared of the Hamilton–Jacobi equation for years, until I taught a course on classical mechanics that covered it. Now I think it’s the greatest thing in the world!

Briefly: the Hamilton–Jacobi equation concerns the least possible action to get from a fixed starting point to a point in space at time If we call this least possible action the Hamilton–Jacobi equation says

where is the particle’s momentum at the endpoint of its path, and is its energy there.

If we replace derivatives by differences, and talk about maximizing total reward instead of minimizing action, we get Bellman’s equation:

Markov decision processes

Bellman’s equation can be useful whenever you’re trying to figure out an optimal course of action. An important example is a ‘Markov decision process’. To prepare you for Tishby’s talk, I should say what this is.

In a Markov process, something randomly hops around from state to state with fixed probabilities. In the simplest case there’s a finite set of states, and time proceeds in discrete steps. At each time step, the probability to hop from state to state is some fixed number

This sort of thing is called a Markov chain, or if you feel the need to be more insistent, a discrete-time Markov chain.

A Markov decision process is a generalization where an outside agent gets to change these probabilities! The agent gets to choose actions from some set If at a given time he chooses the action the probability of the system hopping from state to state is Needless to say, these probabilities have to sum to one for any fixed

That would already be interesting, but the real fun is that there’s also a reward This is a real number saying how much joy or misery the agent experiences if he does action and the system hops from to

The problem is to choose a policy—a function from states to actions—that maximizes the total expected reward over some period of time. This is precisely the kind of thing Bellman’s equation is good for!

If you’re an economist you might also want to ‘discount’ future rewards, saying that a reward time steps in the future gets multiplied by where is some discount factor. This extra tweak is easily handled, and you can see it all here:

Partially observable Markov decision processes

There’s a further generalization where the agent can’t see all the details of the system! Instead, when he takes an action and the system hops from state to state he sees something: a point in some set of observations. He makes the observation with probability

(I don’t know why this probability depends on but not Maybe it ultimately doesn’t matter much.)

Again, the goal is to choose a policy that maximizes the expected total reward. But a policy is a bit different now. The action at any time can only depend on all the observations made thus far.

Partially observable Markov decision processes are also called POMPDs. If you want to learn about them, try these:

The latter includes an introduction without any formulas to POMDPs and how to choose optimal policies. I’m not sure who would study this subject and not want to see formulas, but it’s certainly a good exercise to explain things using just words—and it makes certain things easier to understand (though not others, in a way that depends on who is trying to learn the stuff).

The action-perception loop

I already explained the action-perception loop, with the help of this picture from the University of Bielefeld’s Department of Cognitive Neuroscience:

Nafthali Tishby has a nice picture of it that’s more abstract:

We’re assuming time comes in discrete steps, just to keep things simple.

At each time

• the world has some state and
• the agent has some state

Why the letter ? This stands for memory: it can be the state of the agent’s memory, but I prefer to think of it as the state of the agent.

At each time

• the agent takes an action which affects the world’s next state, and

• the world provides a sensation to the agent, which affect’s the agent’s next state.

This is simplified but very nice. Note that there’s a symmetry interchanging the world and the agent!

We could make it fancier by having lots of agents who all interact, but there are a lot of questions already. The big question Tishby focuses on is optimizing how much the agent should remember about the past if they

• get a reward depending on the action taken and the resulting state of the world

but

• pay a price for the information stored from sensations.

Tishby formulates this optimization question as something like a partially observed Markov decision process, uses rate-distortion theory to analyze how much information needs to be stored to achieve a given reward, and uses Bellman’s equation to solve the optimization problem!

So, everything I sketched fits together somehow!

I hope what I’m saying now is roughly right: it will take me more time to get the details straight. If you’re having trouble absorbing all the information I just threw at you, don’t feel bad: so am I. But the math feels really natural and good to me. It involves a lot of my favorite ideas (like generalizations of the principle of least action, and relative entropy), and it seems ripe to be combined with network theory ideas.

Post navigation

13 Responses to Sensing and Acting Under Information Constraints

I come from this field, sort of (my very first conference paper was about Q-learning), and there’s a few wrinkles to this.

Most important is perhaps that real agents generally do not use exponential discounting of future rewards, but hyperbolic discounting.

For instance, if I offer you 100 tonight, or 101 tomorrow, you’re likely to take me up on the offer right now. But let’s say I offer you 100 in exactly a year, or 101 in a year and a day. Exponential discounting predicts that you will take the 100, whereas hyperbolic discounting – and real organisms – will go for 101 a day later.

So the Bellman equation doesn’t really hold for real-life agents. There’s ways to deal with it, but things start to get a little messy.

I do not think that such an easy argument can be used against exponential discounting.

When the offer is made by a trusted authority people might take the 101 dollars a day later (because they do not need the 100 dollars today) and in pensions many of us take the early lower pension over the higher later pension.

I gave a simple example as an illustration. You can frequently find that organisms don’t conform to exponential discounting in their decision making (technically it’s frequently not exactly hyperbolic either).

I was surprised that there was so little about vision at this workshop (going by the titles). Vision is the most information-rich sense that we have, and there are clearly informational constraints. We have 100 million photoreceptive cells per retina, but only one million fibres in an optic nerve. One third of the brain is devoted to vision. And there is evidence from change blindness (http://en.wikipedia.org/wiki/Change_blindness) that we don’t remember much about what we just saw, even 100ms later. This seems a much better testing ground for Tishby’s theory than sleeping cats and mice swimming in milk.

The book Active Vision by John M Findlay and Iain D Gilchrist, 2003, makes a strong case for vision being an active process rather than a passive one. Especially important are eye movements, which depend on what you’re trying to do, and what you’ve seen (and remembered) so far. Each saccade, you change what you see. The saccades even give you discrete time steps, more or less.

There was a fun talk about information theory in vision, and certainly a lot of people alluded to it. But the application of information theory people talked about most was to intercellular molecular communication!

Stimulus-response organization of behavior may occur, but I have yet to find a good example. Even reflexes, when studied closely, only appear to be organized according to S-R concepts. If such S-R organization exists, then it is a highly derived trait and a rather rare exception.

There’s an interesting paper on information seeking, which alludes to some of the topics mentioned here…so I can’t wait to watch the talk! How we handle information, obviously, impacts hugely how we interact with our environment + learn + seek out information.

For POMDP-type models, based on what I understand, how does it handle varying probabilities? In an actual environment – agent interaction situation, the probability of the event is not always constant. How does that affect the optimal strategy?

I imagine POMDP models either 1) need to be generalized to allow the agent to update its policies as time goes on, based on new data, or 2) need to be able to handle enough contingencies that any change in conditions is simply treated as another observation.

I hope you get what I mean. In case 1) we have actual learning with the passage of time as we run the model, while in case 2) all the learning must occur before we start running the model; the model needs to be sufficiently well-prepared to be already familiar with all situations it comes across.

Clearly approach 1) is more ambitious and more interesting. I wonder what various systems like self-driving cars actually do! Does one of Google’s cars actually learn as it drives, for example learn more about the probability that another car will cut in front of it as it drives along a particular highway you use to go do work, perhaps in a way that depends on the time of day? This could be good, but it’s probably harder to get right.

I remember an old discussion on sci.physics.research where James Dolan discussed a type of linear algebra based on Bellman’s Dynamic Programming where addition was replaced by the minimization of costs and multiplication was replaced by the addition of costs. The transition costs for traversing a graph roughly corresponded to the action of classical mechanics so that minimizing the cost of a path from point a to point b corresponded to the least action principle. From what I remember, this approach really showed the close relationship between statistical, classical, and quantum mechanics, although I also remember never quite grokking the whole picture. Does this ring a bell for anyone? Does anyone know if this approach is written up anywhere?

I spent lots of time discussing this with James, and you can read bits of it written up in my Winter 2007 quantum gravity seminar notes, especially here:

• Week 12 (Jan. 30) – Classical, quantum and statistical mechanics as "matrix mechanics". In quantum mechanics we use linear algebra over the ring of complex numbers; in classical mechanics everything is formally the same, but we instead use the rig Rmin = R ∪ {+∞} where the addition is min and the multiplication is +. As a warmup for bringing statistical mechanics into the picture – and linear algebra over yet another rig – we recall how the dynamics of particles becomes the statics of strings after Wick rotation. Blog entry.

How To Write Math Here:

You need the word 'latex' right after the first dollar sign, and it needs a space after it. Double dollar signs don't work, and other limitations apply, some described here. You can't preview comments here, but I'm happy to fix errors.