Relative Entropy in Evolutionary Dynamics

In John’s information geometry series, he mentioned some of my work in evolutionary dynamics. Today I’m going to tell you about some exciting extensions!

The replicator equation

First a little refresher. For a population of replicating types, such as individuals with different eye colors or a gene with distinct alleles, the ‘replicator equation’ expresses the main idea of natural selection: the relative rate of growth of each type should be proportional to the difference between the fitness of the type and the mean fitness in the population.

To see why this equation should be true, let be the population of individuals of the th type, which we allow to be any nonnegative real number. We can list all these numbers and get a vector:

The Lotka–Volterra equation is a very general rule for how these numbers can change with time:

Each population grows at a rate proportional to itself, where the ‘constant of proportionality’, is not necessarily constant: it can be any real-valued function of This function is called the fitness of the th type. Taken all together, these functions are called the fitness landscape.

Let be the fraction of individuals who are of the th type:

These numbers are between 0 and 1, and they add up to 1. So, we can also think of them as probabilities: is the probability that a randomly chosen individual is of the th type. This is how probability theory, and eventually entropy, gets into the game.

Again, we can bundle these numbers into a vector:

which we call the population distribution. It turns out that the Lotka–Volterra equation implies the replicator equation:

where

is the mean fitness of all the individuals. You can see the proof in Part 9 of the information geometry series.

By the way: if each fitness only depends on the fraction of individuals of each type, not the total numbers, we can write the replicator equation in a simpler way:

From now on, when talking about this equation, that’s what I’ll do.

Anyway, the take-home message is this: the replicator equation says the fraction of individuals of any type changes at a rate proportional to fitness of that type minus the mean fitness.

Now, it has been known since the late 1970s or early 1980s, thanks to the work of Akin, Bomze, Hofbauer, Shahshahani, and others, that the replicator equation has some very interesting properties. For one thing, it often makes ‘relative entropy’ decrease. For another, it’s often an example of ‘gradient flow’. Let’s look at both of these in turn, and then talk about some new generalizations of these facts.

Relative entropy as a Lyapunov function

I mentioned that we can think of a population distribution as a probability distribution. This lets us take ideas from probability theory and even information theory and apply them to evolutionary dynamics! For example, given two population distributions and the information of relative to is

This measures how much information you gain if you have a hypothesis about some state of affairs given by the probability distribution and then someone tells you “no, the best hypothesis is !”

It may seem weird to treat a population distribution as a hypothesis, but this turns out to be a good idea. Evolution can then be seen as a learning process: a process of improving the hypothesis.

We can make this precise by seeing how the relative information changes with the passage of time. Suppose we have two population distributions and Suppose is fixed, while evolves in time according to the replicator equation. Then

So, the information of relative to will decrease as evolves according to the replicator equation if

If makes this true for all we say is an evolutionarily stable state. For some reasons why, see Part 13.

What matters now is that when is an evolutionarily stable state, says how much information the population has ‘left to learn’—and we’re seeing that this always decreases. Moreover, it turns out that we always have

and precisely when

People summarize all this by saying that relative information is a ‘Lyapunov function’. Very roughly, a Lyapunov function is something that decreases with the passage of time, and is zero only at the unique stable state. To be a bit more precise, suppose we have a differential equation like

where and is some smooth vector field on Then a smooth function

is a Lyapunov function if

• for all

• iff is some particular point

and

• for every solution of our differential equation.

In this situation, the point is a stable equilibrium for our differential equation: this is Lyapunov’s theorem.

The replicator equation as a gradient flow equation

The basic idea of Lyapunov’s theorem is that when a ball likes to roll downhill and the landscape has just one bottom point, that point will be the unique stable equilibrium for the ball.

The idea of gradient flow is similar, but different: sometimes things like to roll downhill as efficiently as possible: they move in the exactly the best direction to make some quantity smaller! Under certain conditions, the replicator equation is an example of this phenomenon.

Let’s fill in some details. For starters, suppose we have some function

Think of as ‘height’. Then the gradient flow equation says how a point will move if it’s always trying its very best to go downhill:

Here is the usual gradient in Euclidean space:

where is short for the partial derivative with respect to the th coordinate.

The interesting thing is that under certain conditions, the replicator equation is an example of a gradient flow equation… but typically not one where is the usual gradient in Euclidean space. Instead, it’s the gradient on some other space, the space of all population distributions, which has a non-Euclidean geometry!

For example, it’s an equilateral triangle when The equilateral triangle looks flat, but if we measure distances another way it becomes round, exactly like a portion of a sphere, and that’s the non-Euclidean geometry we need!

In fact this trick works in any dimension. The idea is to give the simplex a special Riemannian metric, the ‘Fisher information metric’. The usual metric on Euclidean space is

This simply says that two standard basis vectors like and have dot product zero if the 1’s are in different places, and one if they’re in the same place. The Fisher information metric is a bit more complicated:

As before, is a formula for the dot product of the th and th standard basis vectors, but now it depends on where you are in the simplex of population distributions.

We saw how this formula arises from information theory back in Part 7. I won’t repeat the calculation, but the idea is this. Fix a population distribution and consider the information of another one, say relative to this. We get If this is zero:

and this point is a local minimum for the relative information. So, the first derivative of as we change must be zero:

But the second derivatives are not zero. In fact, since we’re at a local minimum, it should not be surprising that we get a positive definite matrix of second derivatives:

And, this is the Fisher information metric! So, the Fisher information metric is a way of taking dot products between vectors in the simplex of population distribution that’s based on the concept of relative information.

This is not the place to explain Riemannian geometry, but any metric gives a way to measure angles and distances, and thus a way to define the gradient of a function. After all, the gradient of a function should point at right angles to the level sets of that function, and its length should equal the slope of that function:

So, if we change our way of measuring angles and distances, we get a new concept of gradient! The th component of this new gradient vector field turns out to b

where is the inverse of the matrix and we sum over the repeated index As a sanity check, make sure you see why this is the usual Euclidean gradient when

Now suppose the fitness landscape is the good old Euclidean gradient of some function. Then it turns out that the replicator equation is a special case of gradient flow on the space of population distributions… but where we use the Fisher information metric to define our concept of gradient!

To get a feel for this, it’s good to start with the Lotka–Volterra equation, which describes how the total number of individuals of each type changes. Suppose the fitness landscape is the Euclidean gradient of some function :

Then the Lotka–Volterra equation becomes this:

This doesn’t look like the gradient flow equation, thanks to that annoying on the right-hand side! It certainly ain’t the gradient flow coming from the function and the usual Euclidean gradient. However, it is gradient flow coming from and some other metric on the space

For a proof, and the formula for this other metric, see Section 3.7 in this survey:

Again, if the fitness landscape is a Euclidean gradient, we can rewrite the replicator equation as a gradient flow equation… but again, not with respect to the Euclidean metric. This time we need to use the Fisher information metric! I sketch a proof in my paper above.

In fact, both these results were first worked out by Shahshahani:

• Siavash Shahshahani, A New Mathematical Framework for the Study of Linkage and Selection, Memoirs of the AMS17, 1979.

New directions

All this is just the beginning! The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry.

You can see some very nice versions of this story for maximum likelihood estimators and linear programming here:

Indeed, this seems to be the first paper discussing the similarities between evolutionary game theory and information geometry.

Dash Fryer (at Pomona College) and I have generalized this story in several interesting ways.

First, there are two famous ways to generalize the usual formula for entropy: Tsallis entropy and Rényi entropy, both of which involve a parameter There are Tsallis and Rényi versions of relative entropy and the Fisher information metric as well. Everything I just explained about:

• conditions under which relative entropy is a Lyapunov function for the replicator equation, and

• conditions under which the replicator equation is a special case of gradient flow

generalize to these cases! However, these generalized entropies give modified versions of the replicator equation. When we set we get back the usual story. See

My initial interest in these alternate entropies was mostly mathematical—what is so special about the corresponding geometries?—but now researchers are starting to find populations that evolve according to these kinds of modified population dynamics! For example:

There’s an interesting special case worth some attention. Lots of people fret about the relative entropy not being a distance function obeying the axioms that mathematicians like: for example, it doesn’t obey the triangle inequality. Many describe the relative entropy as a distance-like function, and this is often a valid interpretation contextually. On the other hand, the relative entropy is one-half the Euclidean distance squared! In this case the modified version of the replicator equation looks like this:

This equation is called the projection dynamic.

Later, I showed that there is a reasonable definition of relative entropy for a much larger family of geometries that satisfies a similar distance minimization property.

In a different direction, Dash showed that you can change the way that selection acts by using a variety of alternative ‘incentives’, extending the story to some other well-known equations describing evolutionary dynamics. By replacing the terms in the replicator equation with a variety of other functions, called incentives, we can generate many commonly studied models of evolutionary dynamics. For instance if we exponentiate the fitness landscape (to make it always positive), we get what is commonly known as the logit dynamic. This amounts to changing the fitness landscape as follows:

where is known as an inverse temperature in statistical thermodynamics and as an intensity of selection in evolutionary dynamics. There are lots of modified versions of the replicator equation, like the best-reply and projection dynamics, more common in economic applications of evolutionary game theory, and they can all be captured in this family. (There are also other ways to simultaneously capture such families, such as Bill Sandholm’s revision protocols, which were introduced earlier in his exploration of the foundations of game dynamics.)

Dash showed that there is a natural generalization of evolutionarily stable states to ‘incentive stable states’, and that for incentive stable states, the relative entropy is decreasing to zero when the trajectories get near the equilibrium. For the logit and projection dynamics, incentive stable states are simply evolutionarily stable states, and this happens frequently, but not always.

The third generalization is to look at different ‘time-scales’—that is, different ways of describing time! We can make up the symbol for a general choice of ‘time-scale’. So far I’ve been treating time as a real number, so

But we can also treat time as coming in discrete evenly spaced steps, which amounts to treating time as an integer:

More generally, we can make the steps have duration where is any positive real number:

There is a nice way to simultaneously describe the cases and using the time-scale calculus and time-scale derivatives. For the time-scale the time-scale derivative is just the ordinary derivative. For the time-scale the time-scale derivative is given by the difference quotient from first year calculus:

and using this as a substitute for the derivative gives difference equations like a discrete-time version of the replicator equation. There are many other choices of time-scale, such as the quantum time-scale given by in which case the time-scale derivative is called the q-derivative, but that’s a tale for another time. In any case, the fact that the successive relative entropies are decreasing can be simply state by saying they have negative time-scale derivative. The continuous case we started with corresponds to

Remarkably, Dash and I were able to show that you can combine all three of these generalizations into one theorem, and even allow for multiple interacting populations! This produces some really neat population trajectories, such as the following two populations with three types, with fitness functions corresponding to the rock-paper-scissors game. On top we have the replicator equation, which goes along with the Fisher information metric; on the bottom we have the logit dynamic, which goes along with the Euclidean metric on the simplex:

From our theorem, it follows that the relative entropy (ordinary relative entropy on top, the entropy on bottom) converges to zero along the population trajectories.

The final form of the theorem is loosely as follows. Pick a Riemannian geometry given by a metric (obeying some mild conditions) and an incentive for each population, as well as a time scale ( or ) for every population. This gives an evolutionary dynamic with a natural generalization of evolutionarily stable states, and a suitable version of the relative entropy. Then, if there is an evolutionarily stable state in the interior of the simplex, the time-scale derivative of sum of the relative entropies for each population will decrease as the trajectories converge to the stable state!

When there isn’t such a stable state, we still get some interesting population dynamics, like the following:

Yes, that’s it. Thanks—fixed! In case anyone missed it the first time, it’s an illustration of the how the simplex of population distributions gets a round geometry when we use the Fisher information metric:

And if it is an exact sphere, is there a nice formula to map a probability distribution (a point in the simplex) to a point on a sphere embedded in space? if so, is there any interesting way of understanding that formula as being natural (i.e. as giving some other representation of the probability distribution)? I might guess the mapping p_i to p_i squared (since it does map to a sphere)… but I’m more used to seeing probabilities be amplitudes squared, than to squaring probabilities.

equipped with its Fischer information metric has exactly the geometry of a portion of a perfectly round sphere of radius 2. To see this, we can use the formula for the Fisher information metric together with the map

and do some calculations.

But the fact that it’s a sphere of radius 2 is not a huge big deal. If we stick the right constant in front of the Fisher information metric, the simplex gets to be the same shape as

In short: we are indeed taking writing probabilities as squares of ‘amplitudes’ here, though any potential connection to quantum mechanics remains quite mysterious, at least to me. For one thing, these ‘amplitudes’ are real—but that’s the least of it, there is such a thing as real quantum mechanics, so real amplitudes don’t faze me. (Pun intended.) The bigger question would be why the information metric on the space of probabilities should make them look more like quantum states.

John’s answer is spot on. The radius doesn’t really matter because it just changes the velocity of the trajectories but not the trajectories themselves. There might be some issues on the boundary when you compute the Jacobian, but certainly for the interior of the simplex the mapping is exact, and the replicator equation is forward-invariant on the interior (if it starts in the interior, it stays in the interior for any finite time period).

Another implication of the mapping is that you essentially know what the geodesics of the Fisher metric are since they are great circle arcs on the sphere and can be pulled back to the simplex.

There are versions of the Fisher-Information metric in quantum mechanical contexts (e.g. the Fubini-Study metric and the Bures metric), and lots of people study quantum information geometry. (I’m by no means an expert.) At the last big information geometry conference (IGAIA 2010) about half of the talks were about quantum information topics.

It is an interesting article.
I am thinking (I don’t know if it is useful) that it is not necessary to use the normalization.
If the fitness landscape obeys a simple constraint:

then the dynamic is on the the simplex. The replicator equation is simpler, and the constraint are in the fitness: if the fitness is a Taylor series, then the parameters have a constraint, so that only n-1 fitness have a free dynamic.
It is interesting the application on the alleles: if it is possible to infer some property for the population dynamic with some genetic diseases, then it can be possible to infer the number of defective alleles.

the constraint you mention is equivalent to saying the total population is constant:

That seems to be a rather special situation, not something we usually see in populations of organisms!

However, this constraint does hold in a situation like this: we have a population of game players with different strategies. They randomly meet in pairs and play a 2-player game. One or the other player wins, with some probability depending on both player’s strategies. Then the loser changes their strategy to that of the winner!

In the large-number limit, where random fluctuations become small, we can write down a differential equation for the time evolution of the population of players having the th strategy. This gives a special case of the Lotka–Volterra equation. And in this special case we indeed have

since the total number of players never changes: they just change strategies!

There are lots of generalizations (e.g to multiplayer games). As long as the total number of players never changes, your constraint holds and the dynamics of the populations stays on a simplex.

I thought that the equation for the probability of the population, that it is obtained from the number of individual of the population, have a complex form because of the normalization to project on the simplex.
If the fitness in the probability equation is the true law, that can be evaluated from experimental data, then all seems easier.

My answer to Domenico’s question above hints at one of my obsessions: figuring out the relation between reaction networks and evolutionary game theory!

On this blog we’ve seen how reaction networks describe interacting collections of individuals of various types. This sounds related to evolutionary game theory… and indeed it is!

Say we have a reaction network. When we have small populations we describe their evolution stochastically using a ‘master equation’. In the limit of large populations, we can often ignore random fluctuations and use a ‘rate equation’ to describe the time evolution of the expected number of individuals of each type. This rate equation is always of the form

for some functions

But sometimes ‘it takes one to make one': every process that produces an individual of type at output must involve an individual of type as input. In this case, the rate equation has the special form

In other words, it reduces to the Lotka–Volterra equation!

Now look at what we’ve seen:

Manoj Gopalkrishan described a general theorem for reaction networks, giving conditions that ensure a certain ‘free energy’ function always decreases. Marc Harper has described a general theorem for the Lotka–Volterra equation (or its offspring, the replicator equation), giving conditions that ensure a certain ‘relative entropy’ function always decreases.

The second theorem has got to be a special case of the first, or… well, or both are a special case of some third, even better theorem!

So, what’s up? The theorem Manoj stated requires the existence of a ‘complex balanced equilibrium’. The theorem Marc stated requires the existence of an ‘evolutionarily stable state’. Is the second condition a special case of the first?

Well… I think the next article should help! For finite population models we typically assume that the population size doesn’t change, and Dash and I found a “Lyapunov Theorem” for that context. This finite population model is a Markov chain called the Moran process; Arne Traulsen and collaborators have shown that the transition probabilities of the Markov chain can be used to define a Master equation SDE which has the replicator equation as its Langevin equation.

Anyway, maybe I’m giving too much away, but Dash and I found that the stable states (local maxima of the stationary distribution) for the Markov chain are those that have a inflow-outflow balance (via sums of incoming and outgoing transition probabilities). We then show that these states are evolutionarily stable states! So I think we’re getting close to putting it all together!

See “The entropy-based Lyapunov function” on page 11. She considers chemical reaction networks obeying the law of mass action, assuming they have a complex balanced equilibrium… and the total number of invididuals of all kinds is constant! Using this she gives a super-short proof that her ‘entropy-based Lyapunov function’ decreases with time.

Just for spectators who are having trouble keeping score:

Hangos’ entropy-based Lyapunov function is essentially the ‘free energy’ described in Manoj’s post:

where is the number of individuals of type as a function of time, and is that number at the complex balanced equilibrium point. Hangos simply differentiates this with respect to time and uses some properties of the logarithm function to show the answer is

Now we’ve seen David Anderson present a similar ‘brutally direct’ proof without the assumption that the total number of individuals of all kinds

These two papers, near the top: “Coevolutionary dynamics: From finite to infinite populations” and “Coevolutionary dynamics in large, but finite populations” by Traulsen et al. The second paper has more details, and the Fokker-Planck equation is the master equation in this context.

Dash and I have added a Lyapunov stability layer to finite populations and have shown that local extrema of the stationary distributions are (if I understand the terminology correctly) complex-balanced for sufficiently large populations (N=30 is typically large enough). These states then satisfy an evolutionary stability criterion that incorporates mutation, and all the classical resutls (e.g. the replicator stability that this article starts with) are recovered for small mutation rates and large populations.

The reason that the population needs to be sufficiently large is simply that using a finite population is essentially taking a partition of the simplex, and our approach requires a fine enough partition to get the local maxima of the stationary distribution to stabilize on the evolutionarily stable states. But usually the population doesn’t need to be that large.

Also, unless the population size is very small, having a fixed population size isn’t really that limiting in my experience, though admittedly it seems artificial…

Here is something that surprised me, so I just want to check that I got it right. Suppose that we have a species that play two different “games”, say survive predators and doing courtship rituals. In the first game it has strategies and the starting distribution is given by and in the second game it has strategies and distribution given by . Furthermore, assume that the strategy each individual have in one game is independent from its strategy in the second game, and that its total utility is just the sum of the utilities for each of the two games. We now have types in total, and the starting probability of having type (i,j) is , and we can use the above formula to compute who the system evolves. Unless I missed something we get that:

1) The strategies in the two game will continue to be independent
2) The distributions on strategies in game 1 will evolve exactly as if they were only playing that game
3) Similar for game 2

Figuratively speaking, a species can take classes in martial arts and dancing (and many other skills) at the same time, and it will still improve as much in martial arts as if it was only taking martial arts classes, improve as much at dancing as if it was only taking dancing classes and so on. This really surprised me, I would have thought that being in an arms race would slow down the evolution of other traits.

Hi! I’ll let Marc answer, but I think the situation you describe is equivalent to one where have two completely different species, one playing one game and one playing another, and then we define a new rather abstract kind of individual to be an ordered pair consisting of one individual of the first type and one of the second type. So, there’s no real interaction between what’s happening in the two games.

Maybe in real life various interactions complicate the situation? If you have to spend a certain amount of time per day playing each game, you’ll have less time for each game the more games you play, so your rate of improvement on each game will be less.

“…I think the situation you describe is equivalent to one where have two completely different species, one playing one game and one playing another, and then we define a new rather abstract kind of individual to be an ordered pair consisting of one individual of the first type and one of the second type.”

Here’s a diagram of this for an imaginary laboratory animal experiment:

It’s a set-up that would test the existence of probability learning for nested situations: First, would lab animals use the strategy of probability learning to select between playing two games that each will detect another instance of probability learning– one being a game of foraging for food, the other a game where the reinforcement is sex. (From the abstracts I googled, sex as well as food can be used to reinforce behavior of lab animals.)

I think the lesson for a solitary player in the probability learning game is “Be conscious of missing a reinforcement, don’t ignore information about those occasions, and learn something from the information.” In other words, “Learn form your mistakes.”

But when involved in group (like fish schooling around different sources of prey, see the link below), the lesson that re-writes over this one seems to be: “Ignore information about missing a reinforcement given to a different group if others in your group are ignoring that information as well.”

I think the former describes laboratory experiments with individual players while the latter describes a Nash equilibrium among multiple probability learners playing on the same game board. Here’s the link–

There is a question about ESSs– in the same googles, I saw book sections about animals who hide the behavior of having sex with one who is not a mate. That would be hiding some of the information required to produce probability learning. For example, if the animal is foraging and therefore misses an opportunity for sex, but that information is hidden from the animal, then it would not have the information available to associate some amount of regret from foraging. (Please see the math in the above link, where there’s a model of regret.) Is behavior that hides information like this an ESS?

The fitness landscape could depend on both the and strategy, in which case I think that the evolution of each strategy wouldn’t be independent as suggested. That’s the case in the plots at the end of the article.

As John suggested, the scenario described could be transformed to a single game where the types are all the pairs with a modified fitness landscape. Then probably the solution to the combined system is the joint probability distribution of the solutions to the subsystem.

I’d be carefully overgeneralizing from this theoretical setup, since biologically speaking, a species in an arms race could still be evolving in other dimensions and there is dependence on the amount of mutation (we’ve assumed that there is none) and other factors.

Wow, a bit too much of mathematical complexity in my opinion, but that’s not my critic here. To me, ‘Evolution’ has nothing to do with ‘Population Dynamic’ and ‘fitness landscape’, so a better title here would be:
‘Relative Entropy in Population Dynamics’

I’ve got to define the premises, the ‘priors’. I concede to Darwinism that the ‘fittest’ will become the most numerous and successful and that elaborate mathematical descriptions of fitness landscape can describe Population Dynamic, but I don’t agree that this model fits ‘Evolution’.
So here I go: Evolution represents a rise in Complexity (as in Kolmogorov algorithmic complexity) and/or algorithmic depth (as define by Charles Bennett) and is better explained within the conceptual frame of Information theory.
Species with similar level of complexity could be viewed as an horizontal differentiation, but since ‘Complexity’ is not changed, they do not represent a case of ‘Evolution’, but a case of ‘Speciation’.
Both ‘Population Dynamic’ and ‘Speciation’ are interdependent and are probably a function of fitness landscape among other things.
But Evolution, as a rise in Complexity, is of different nature.
An increase in complexity during an Evolutionary event might even cause a fitness damage to their carrier, making them transiently mal-adapted or misfit, leading to small numbers in the population, in perfect agreement with Darwinism that the fittest must survive more.
I’ll go even further, what proof do we have that this ‘Complexity related Evolution’ is driven by the fittest?
It’s easy to understand that, in a given environment, Sub-optimal individuals have actually more reasons to try new strategies and explore more ecological niches than the Alphas.
So somehow, true Evolution may have more to thank the misfits than the fittest: not exactly what Darwin said.
If it’s not the fittest, what drives Evolution? well, some said its a Maxwell Demon. Evolution can be seen as a series of measurements of the environment and Natural Selection act as a filter of the useful measurements. Whatever it is, this process produces an ‘increased algorithmic complexity/depth’ but is not directly related to a successful population dynamic or speciation.
That comes only after, once mutants with increased complexity appear, they have to adapt to survive and speciation has to take place rapidly, with the fittest individuals surviving as usual, but at this point, it’s not Evolution anymore, just survival.

How To Write Math Here:

You need the word 'latex' right after the first dollar sign, and it needs a space after it. Double dollar signs don't work, and other limitations apply, some described here. You can't preview comments here, but I'm happy to fix errors.