Information Geometry (Part 11)

Last time we saw that given a bunch of different species of self-replicating entities, the entropy of their population distribution can go either up or down as time passes. This is true even in the pathetically simple case where all the replicators have constant fitness—so they don’t interact with each other, and don’t run into any ‘limits to growth’.

This is a bit of a bummer, since it would be nice to use entropy to explain how replicators are always extracting information from their environment, thanks to natural selection.

Luckily, a slight variant of entropy, called ‘relative entropy’, behaves better. When our replicators have an ‘evolutionary stable state’, the relative entropy is guaranteed to always change in the same direction as time passes!

Thanks to Einstein, we’ve all heard that times and distances are relative. But how is entropy relative?

It’s easy to understand if you think of entropy as lack of information. Say I have a coin hidden under my hand. I tell you it’s heads-up. How much information did I just give you? Maybe 1 bit? That’s true if you know it’s a fair coin and I flipped it fairly before covering it up with my hand. But what if you put the coin down there yourself a minute ago, heads up, and I just put my hand over it? Then I’ve given you no information at all. The difference is the choice of ‘prior’: that is, what probability distribution you attributed to the coin before I gave you my message.

My love affair with relative entropy began in college when my friend Bruce Smith and I read Hugh Everett’s thesis, The Relative State Formulation of Quantum Mechanics. This was the origin of what’s now often called the ‘many-worlds interpretation’ of quantum mechanics. But it also has a great introduction to relative entropy. Instead of talking about ‘many worlds’, I wish people would say that Everett explained some of the mysteries of quantum mechanics using the fact that entropy is relative.

Anyway, it’s nice to see relative entropy showing up in biology.

Relative Entropy

Inscribe an equilateral triangle in a circle. Randomly choose a line segment joining two points of this circle. What is the probability that this segment is longer than a side of the triangle?

This puzzle is called Bertrand’s paradox, because different ways of solving it give different answers. To crack the paradox, you need to realize that it’s meaningless to say you’ll “randomly” choose something until you say more about how you’re going to do it.

In other words, you can’t compute the probability of an event until you pick a recipe for computing probabilities. Such a recipe is called a probability measure.

This applies to computing entropy, too! The formula for entropy clearly involves a probability distribution, even when our set of events is finite:

But this formula conceals a fact that becomes obvious when our set of events is infinite. Now the sum becomes an integral:

And now it’s clear that this formula makes no sense until we choose the measure On a finite set we have a god-given choice of measure, called counting measure. Integrals with respect to this are just sums. But in general we don’t have such a god-given choice. And even for finite sets, working with counting measure is a choice: we are choosing to believe that in the absence of further evidence, all options are equally likely.

Taking this fact into account, it seems like we need two things to compute entropy: a probability distribution , and a measure That’s on the right track. But an even better way to think of it is this:

Now we see the entropy depends two measures: the probability measure we care about, but also the measure Their ratio is important, but that’s not enough: we also need one of these measures to do the integral. Above I used the measure to do the integral, but we can also use if we write

Either way, we are computing the entropy of one measure relative to another. So we might as well admit it, and talk about relative entropy.

The entropy of the measure relative to the measure is defined by:

The second formula is simpler, but the first looks more like summing so they’re both useful.

Since we’re taking entropy to be lack of information, we can also get rid of the minus sign and define relative information by

If you thought something was randomly distributed according to the probability measure but then you you discover it’s randomly distributed according to the probability measure how much information have you gained? The answer is

For more on relative entropy, read Part 6 of this series. I gave some examples illustrating how it works. Those should convince you that it’s a useful concept.

Okay: now let’s switch back to a more lowbrow approach. In the case of a finite set, we can revert to thinking of our two measures as probability distributions, and write the information gain as

We’ll use this idea to think about how a population gains information about its environment as time goes by, thanks to natural selection. The rest of this post will be an exposition of Theorem 1 in this paper:

Harper says versions of this theorem ave previously appeared in work by Ethan Akin, and independently in work by Josef Hofbauer and Karl Sigmund. He also credits others here. An idea this good is rarely noticed by just one person.

The change in relative information

So: consider different species of replicators. Let be the population of the th species, and assume these populations change according to the replicator equation:

where each function depends smoothly on all the populations. And as usual, we let

be the fraction of replicators in the th species.

Let’s study the relative information where is some fixed probability distribution. We’ll see something great happens when is a stable equilibrium solution of the replicator equation. In this case, the relative information can never increase! It can only decrease or stay constant.

We’ll think about what all this means later. First, let’s see that it’s true! Remember,

and only depends on time, not , so

where is the rate of change of the probability We saw a nice formula for this in Part 9:

where

and

is the mean fitness of the species. So, we get

Nice, but we can fiddle with this expression to get something more enlightening. Remember, the numbers sum to one. So:

where in the last step I used the definition of the mean fitness. This result looks even cuter if we treat the numbers as the components of a vector and similarly for the numbers and Then we can use the dot product of vectors to say

A population is said to be in an evolutionarily stable state if its genetic composition is restored by selection after a disturbance, provided the disturbance is not too large.

I will explain the math next time—I need to straighten out some things in my mind first. But the basic idea is compelling: an evolutionarily stable state is like a situation where our replicators ‘know all there is to know’ about the environment and each other. In any other state, the population has ‘something left to learn’—and the amount left to learn is the relative information we’ve been talking about! But as time goes on, the information still left to learn decreases!

Note: in the real world, nature has never found an evolutionarily stable state… except sometimes approximately, on sufficiently short time scales, in sufficiently small regions. So we are still talking about an idealization of reality! But that’s okay, as long as we know it.

I suppose ‘relative entropy’ could be a nice name for the fact that when I put away the dishes and kitchen gadgets, my wife doesn’t know where they are, and vice versa. This is not just a pun, it’s an actual example of the concept! What looks random to one person may be orderly to another.

If I’m beginning to understand the lowbrow approach, I can visualize the two-species case as a landscape height I which is a function of the two population sizes. Stable config q is at a local minimum of I, and as p evolves toward q it always moves downhill (perhaps circuitously) on I’s landscape.

Is this right as far as it goes? If so I am curious whether, given the functions fi, I is the *only* landscape having this spiffy property, and also (these Qs may be related come to think of it) whether p always follows the steepest path down I’s landscape.

The trajectory is not always the steepest path. It is possible to show that the replicator equation is a gradient with respect to the Fisher information metric / Shahshahani metric if and only if the fitness landscape is a Eucldiean gradient. I believe that Kimura was also aware of this before the geometric approach, and phrased the idea along the lines of “in the direction of unit variance” or something similar. Nevertheless it is not necessary for the trajectories to be gradients for the relative entropy to be monotonically decreasing.

The quintessential example is when the fitness landscape is of the form f(x) = A x for a symmetric matrix A. This is a model of a diploid (two of each chromosome) gene locus for a well-mixed population, where the replicators are the different possible alleles at the locus (and is naturally symmetric since there is no order to the two genes a diploid organism actually has). This is also the equation that Fisher and others first studied before the replicator equation appeared as a generalization (due to Taylor and Jonker in ~1978). If A is not symmetric (e.g. for cyclical rock-paper-scissors interactions), the trajectory is not a gradient, but may still converge as described by John above.

The relative entropy is not the only function to have this property — simple transformations of I (adding constants, exponentiating, etc.) will have the same property. For instance, Bomze was the first (I believe, it can be hard to track down all the original sources) to prove such results in this case and used the cross entropy between p and q rather than the relative entropy (which are equal up to a constant in q, so the derivative is the same if q is constant). However, the metric is unique with respect to this property for the replicator equation, and the relative entropy localizes to this metric.

To answer your first question, it is possible to have two interacting populations p and q (with their own landscapes f and g). In this case if there is a stable configuration (p’, q’), it is the sum of the relative entropies I(p’, p) + I(q’,q) that has the desired properties. So what you are saying is basically correct, just be careful not to abuse the term landscape for both the fitness landscape f and the relative entropy I.

I am guessing that since the trajectories are not in general gradients, it may be harder to identify “watershed” boundaries (hope I’m not hijacking yet another technical term here:-) than it is for – well, for real watersheds.

Speaking of many worlds, in reading this I feel as if I am looking into a parallel universe where many things are the same but certain things are slightly off. The model you describe is very similar to the work I’ve read in ML [machine learning] papers. The difference of focus seems to be that your write up focuses on characterizing the distributions and how they update, while the ML work, based on much the same assumptions, focuses on characterizing the class of functions learnable by such a model.

For example, on a cursory reading what you call mean fitness looks like the empirical performance in Valiant’s model. And the learning class most like evolution is correlational statistical queries – which are described as performing queries for the correlation between a known function f and an unknown g, relative to some distribution.

In machine learning Relative entropy goes by the name Kullback Liebler Divergence. And it is the basic work horse of all inference algorithms. Most algorithms can be written in terms of minimizing the divergence between the target distribution and your approximation of it.

What is unknown about the model is it leaves open as to whether evolution can learn independent of a distribution. Feldman suggests that some things may be at least weakly learnable and wonders if there are any corollaries to boosting in nature. Boosting is an awesome technique to take a bunch of weak learners and make a stronger one, it can also be understood in terms of a way to do well with weak computational resources in a zero sum game vs an adversary/nature.﻿

Deen Abiola – Thanks for the view from the machine learning side of the aisle! I’m a mathematical physicist by training who has decided to start working across fields. I think there’s a lot to be gained by combining ideas from machine learning, evolutionary game theory and also the theory of ‘chemical reaction networks’, which computer scientists call ‘stochastic Petri nets’. But, I’m a raw amateur at most of these subjects, so I can use all the help I can get.

Translating terms between fields is always important in this game, so thanks for the tips. As for ‘Kullback-Leibler divergence’, lots of people in all fields use that term, but I think it’s a disgustingly obscure name for such an important concept, so (like a bunch of other people) I use the term ‘relative information’, or for the negative, ‘relative entropy’. Another commonly used term is ‘information gain’.﻿

Is the replicator equation a good model of what goes on in evolution such that the results you write about are applicable to nature? Or is it something like an economic model, indicative in aggregate but not generally useful due to inability to capture nuance?

Note: in the real world, nature has never found an evolutionarily stable state…

Is the replicator equation a good model of what goes on in evolution such that the results you write about are applicable to nature?

I’m no expert on mathematical biology, but I’ve never seen any model in that subject that I’d call “a good model such that the results you write about are applicable to nature”… without a lot of caveats. In biology it’s usually hard to get enough data to tell how good the models really are. And for any model, you can easily think of important effects it neglects. So they should always be taken with an enormous grain of salt.

One important effect the replicator equation neglects is innovation: the ability for the population of a species to start at zero and rise to a nonzero value.

Of course, better models have been studied that take this crucial phenomenon into account! In my dreams, I’ll eventually get around to talking about these.﻿

Note: in the real world, nature has never found an evolutionarily stable state… except sometimes approximately, on sufficiently short time scales, in sufficiently small regions.

Deen﻿ wrote:

Why do humans not qualify as a good approximation of this?

I guess they are if you’re willing to work on a short enough time scale, in a sufficiently small region, and are more interested in how humans stay the same than how they’re evolving. (That sounds sarcastic, but it actually is interesting to understand what makes populations stay the same.)

If you look at a map of the world, you’ll find that in Europe most adults can digest milk, while in Asia most adults are lactose intolerant:

This is the result of a mutation that happened in the Mideast about 5000 years ago, and spread in societies that raised cows.

If you’re studying phenomena like this, assuming an evolutionarily stable state would be fatal. For other purposes it might be fine.

I think it’s fair to say that nature sometimes settles into evolutionarily stable states, with the caveat that they are short lived in isolated / regular environments, at least until the next major environmental change or until the next clever mutation opens up a new dimension of competition. Note that “evolutionarily stable” really means “selectively stable” in many contexts. There is always a background of mutation (natural diversification I like to call it) that selection acts on, and it’s not too hard to imagine that there are prolonged periods in which no significant mutations occur (a la the neutral theory of evolution). During these periods the population dynamics are probably similar to what the replicator equation describes, up to stochastic noise based on the size of the population, and other random factors not captured by the model.

Also consider that some parts of the genome are very similar across organisms and have been conserved for many millions of years (such as heat-shock proteins that cells use to adapt to sudden thermodynamic changes in the environment). If we restrict our attention to a sufficiently small subspace of the adaptive landscape, it would certainly seem that there is a long term and stable solution. In other words, while there is variation in other dimensions, in this subspace things are rather boring — the solution was found long ago and no one seems to gain by deviating from it. This sounds rather like the definition of evolutionarily stability me, or as close as one can hope to reach in nature. It is easy to imagine analogous situations, even for humans.

The vanilla replicator equation is a model of natural selection (a terrible term, by the way, natural proliferation would be much better, or natural preservation as Darwin suggested), not of evolution, so it’s more “selection-in-a-box” than “realistic and general model of evolution”. Selection-in-a-box can lead to evolutionarily stable states, so we just have to be creative in identifying such microcosms in nature. In the long run of course there is no global stability as evolution has no goal or pinnacle and environments always change.

Any insight into the conditions under which change is promoted? I am currently working with a certain genetic-programming algo for machine learning, and have noticed that population fitness has a ‘flight of the swallow’ shape: rapid improvement for a while settling to a plateau (possibly a very long one), until some dramatic change is discovered, which promotes a burst of improvement that completely obsoletes the species that dominated the previous plateau.

From the “faster algorithm” (aka “continued employment”) point of view, I of course, want to avoid getting stuck on plateaus, and find “the next new thing”. Yet, I despair; sometimes I get the feeling that the algorithm is a blind man, searching for the hole in a golf-course green (and to mangle metaphors, that hole then is the rabbit-hole that leads to a whole new golf course…)

Yes, I can give you some insights into the conditions in which changes occur. The first is when the boundary is circumvented and a new type enters the population. Now there is suddenly another dimension to explore and this can lead to new configurations that dominate others. Note that this is explicitly ruled out by selection-in-a-box models like the replicator equation (no innovation, assumption of all known types).

Another is a dramatic change in the environment, say due to a disaster or due to some permanent change in a particular environmental property (like rising temperature). For instance, sometimes bacteria in extreme conditions (like oceanic vents) develop different versions of certain proteins that would normally not operate at the higher temperatures. These proteins can turn out to be more efficient in the ancestral environment as an unintended consequence. In this case the alternative selection pressure may have lead to a more efficient version of the protein by taking a different path that was unlikely in the original environment.

From a practical modeling perspective, sometimes escaping a local optimum requires tunneling through a less fit type to get to a more fit type that is multiple steps of variation away from the local optimum. If selection is too strong or variation too low then this may not happen. Some researchers have found that bacteria can have increased mutation rates when under intense selection. If there is an interaction component to the algorithm and it depends on the population distribution, it may just be a matter of having a particular variation achieve the critical threshold in the population.

What you describe seems like it is very much the case for some real populations. There’s no consistent increase in fitness (though many people misinterpret Fisher’s Fundamental theorem to say so), rather there are occasional breakthroughs that occur because of drift or happenstance.

How could anyone resist a post called “Information Geometry, Part 11”?

Anyway, this blog post by Sean Carroll is a good nontechnical summary of what I was trying to say. He makes it sound as if there always exists a stable equilibrium state, which ain’t true, but oh well. More importantly, I wish he’d mentioned Marc Harper.

Anyone who does decide to read this should ignore the link above to some horrible aggregator site called “sciencenewsx.com” which puts up a non-removable floating gadget obscuring the text that begs you to “like” the post on various social media.

I’ll change the link to the one you suggested—thanks! I have complete dictatorial control over all backtracks appearing on this blog. For example I was the one who added the words ‘Sean Carroll’; the original aggregator didn’t bother to mention the post was written by him!

While the model is interesting it is way too simplistic to be of much use in biology. For one it doesn’t capture evolution at all if fitness function cannot change with time. It doesn’t capture spatial dependence as mentioned, it doesn’t capture dependence on non living environment.

I guess one could take most of those things into account by making fitness a function of time, space, environment and the number of all other species but such model would probably be intractable. And it would still not support species divergence.

And even if one had a model capturing everything mentioned it would still be of limited use since it would be exceedingly hard to get data to actually do any reliable modeling of even simple natural populations.

There is also a problem of randomness. A single random event can often significantly alter the fitness of many populations. For example a virus acquiring an ability to transfer between species, or a bacteria mutating to become resistant to it’s host immune defenses or to antibiotics.

Maybe you misunderstand why I’m interested in models in biology. Trying to use them to make quantitative predictions about specific situations is the last thing on my mind. My goal (and it’s not just mine) is to gain insight into which processes can cause which effects.

For this, the point of a model is not to believe the model is right and then believe what it says. The point is to say “if this were all that were going on, what would happen?” Mathematical biologists have made a lot of progress in understanding things this way.

In certain carefully limited regimes, models can also give quantitative results you can trust. But to delimit such a regime, you need to know biology very well, you need to spend years studying that specific regime (say: a population of HIV viruses in a patient taking different drugs), and you need to do lots of experiments. That’s obviously not where I’m going to make a contribution. I’m going to make a contribution by taking theoretical ideas from different subjects and understanding how they fit together better than anyone has before—mainly by virtue of the fact that nobody has bothered to try before.

I expect there are several hundred papers studying the replicator equation. It captures many interesting effects that are seen in biology. People have also written about lots of models that go beyond the replicator equation. Since I’m just learning this subject, I want to start near the beginning. But I want to focus on ideas related to information geometry and stochastic Petri nets, because they’re connected to evolution in interesting ways.

The replicator equation contains within it, secretly, the ability to model spatial dependence:

It’s easy to throw in explicit time-dependence into the fitness functions if you want. So as far as I can tell, some of its main conceptual limitations are:

1) the absence of innovation, meaning the ability of new types of replicators to appear starting from zero population.

2) its treatment of populations as continuous rather than discrete,

3) its treatment of the dynamics as deterministic rather than stochastic.

I know a nice way to deal with all these at once, but I need to keep some suspense up so people keep reading these blog posts, so I won’t describe it yet! (Anyone who’s been paying attention will already know, but please don’t tell.)

The fun part will be to see the role of entropy in these more general models. It’s a robust concept, so it probably won’t just shrivel up and die.

Someone above asked about use in biology so as a biologist (though molecular) I provided my 2c.

Yes, I know even simple models can be used to learn new things about the modeled phenomena though (as a resident skeptic ;) I find it a bit hard to imagine what novel things one could learn about the underlying biology by studying such a simple replicator equation.

Could be just my limited imagination but since you say you “expect there are several hundred papers studying the replicator equation,” perhaps you or someone else could point out one containing an example of such an insight.

That’s not to say that the model is not interesting from mathematical or (as someone mentioned) machine learning POV, only that i find it’s utility for actual biology rather low.

Since it’s late I won’t now dig up links to some papers applying the replicator equation to issues evolutionary and population biologists care about. (Even better, maybe Marc Harper or someone will do this while I’m asleep.) Right now I’ll address this:

such a simple replicator equation

In fact this equation

is far from simple. It’s really the general first-order time-independent ordinary differential equation in many variables! To see this, make the change of variables

and get

If we write

this becomes

or via the miracle of vector notation

where can be any smooth function of real variables, since was any smooth function of positive real variables.

And since any time-independent ordinary differential equation can be written as a system of 1st-order ones, the replicator equation is secretly the fully general time-independent ordinary differential equation!

So your complaint should not be that it’s too simple. On the contrary, your complaint should be that it’s so general that it contains every possible complication that’s possible in the subject of time-independent ordinary differential equations!

But that makes it all the more impressive that Marc Harper stated a simple theorem linking it to the concept of entropy.

And of course people who work on the replicator equation in biology look at particular examples, and classes of examples, inspired by biology… and they ask questions motivated by biology. They aren’t just doing general-purpose differential equation theory.

(this is a reply to John’s comment about simplicity, can’t reply directly)

The equation may be general but it’s still simple in my book, it’s very easy to describe or understand the thinking behind it (think kolmogorov complexity).

Compare it to standard model lagrangian for example, now that is a complex equation.

But if you think about it, SM only deals with interactions and transformation of a couple of particle species, and here we are trying to model evolution, a process that is many, many orders more complex. So the equation is also very simple when compared to the full complexity of the process it attempts to model.

I think I understand what John is trying to do. In the field of ecological diversity, there are two competing approaches. The traditional one is of understanding via biological mechanism. This of course takes large amounts of knowledge, collected over years of research. The alternative is finding patterns in the underlying process, without much thought to the mechanisms. Many traditionalists do not like the latter because it looks as if the statisticians are trying to short-circuit the years of work into establishing mechanisms. Yet, it appears that these mechanism-free probability and statistics approaches such as MaxEnt are very useful and practical.

Hello, I wonder if you have considered the following. In open systems, assuming for a moment a directionality to time, change will take place and that change will always be towards a state of increased entropy. At a meta level entropy could be considered as a “force” or “requirement” or perhaps an essential corollary of change in an open system. If one accepts this then I suggest that life, and indeed the whole evolution of the universe, is driven by entropy. Returning to life for a moment. All of the steps in evolution occurred as a result of chemical reactions in which the end state had a greater entropy than the starting state. But, the products – life forms- also accelerated increases in entropy, more than their individual comment atoms could have. In other words I suggest life arose because it catalyzed increases in entropy.

I suggest that the reasons humans eventually came into being is that, to date, they are the most efficient catalyzers of increased entropy. Consider nuclear fission, global warming, etc. etc. gram for gram a human being will increase the entropy of the universe more than if all his component atoms (or quarks) were scattered throughout the cosmos. A key test of this idea would be to determine whether life forms as they evolve, produce a greater net increase in entropy per gram, than the less evolved precursor – human>bonobo>crocodile>d. melanogaster> c. difficil.

As is probably painfully obvious from the foregoing, I am not a physicist, however I cannot see the logical fault in my reasoning. I realize entropy is not a force and has the dimensions of energy over temperature but it’s increase, and the idea that life catalyzes that process, does seem to offer some attraction as a “simple” solution to the origin of life and as to why it will not only have evolved elsewhere in the universe but will also evolve toward technical sophistication as a way of maximizing that catalysis.

This blog was recommended to me because I am trying to get the American Journal of Physics to retract an article titled “Entropy and evolution” by Daniel Styer published in Nov 2008. It contains an equation using the Bolztmann constant and estimates of the probabilities of organisms to calculate the entropy change of the biosphere to show the second law of thermodynamics is not violated by evolution. I consider the equation to be absurd. Organisms don’t have a temperature or an entropy. What follows is a link to the article. If it doesn’t work I can email it to you:

[…] The answer is, of course, not much. A subject near and dear to my heart: Evolution, Entropy, and Information, over at Cosmic Variance, referencing John Baez’s great series on information geometry […]

How does this work with replicator equations that have oscillatory transients? For example, consider

and

with

and

.

Now there is an evolutionarily stable state at which all trajectories tend toward in an oscillatory manner. Because of these oscillations, the proportions of each species must pass through the proportions of the evolutionarily stable state, and therefore the relative entropy (relative to the ESS) also oscillates.

It is true that the relative entropy decreases on each cycle (i.e. if you take a Poincare section) but it certainly doesn’t decrease monotonically. I suspect my system fails to satisfy some important criteria, but I can’t see what criteria that is.

Thanks for proposing this example! I’m not in the mood for calculations right now, but did you check to see if your proposed evolutionarily stable state really is one according to the technical definition I gave? A list of populations is an evolutionarily stable state if

for all populations , where and are the corresponding lists of probabilities:

People working on the replicator equation have (what seems to my naive eyes to be) a disturbing tendency to describe the dynamics in terms of the probabilities rather than the populations . This would be very confusing in examples where the probabilities at one time aren’t enough to determine the probabilities at a future time because the total populations also affect the answer. You seem to be claiming this is going on in your example.

But in my writeup I tried to avoid making any sloppy mistakes along these lines. So, maybe the problem is that your proposed evolutionarily stable state doesn’t really obey that inequality up there.

You’re right, it’s not an evolutionarily stable state. I had assumed that the definition of ESS coincided with the notion of an asymptotically stable state, but that only appears to be true sometimes (in particular, ESS => ASS, but ASS => ESS only for symmetric games). Interesting! I will need to think on this some more.

I think it is more common to use Lotka-Volterra or variants if one wants to track the population size. The replicator equation, by virtue of normalization, removes a degree of freedom. I won’t go into the details, but information geometers talk about the denormalization of the simplex, the positive orthant of Euclidean space (as did Shahshahani in his 1979 paper with a particular metric).

Lyapunov stability implies Nash equilibrium; strict Nash equilibrium implies asymptotic stability. It is possible to have a connected set of asymptotically stable points for the replicator dynamic yet for all the points in the set to be stationary. So the set is asymptotically stable but no particular point in it is (i.e. they are Nash equilibria but not strict Nash equilibria).

If you are interested in such matters, see “Evolutionary Game Theory” by Weibull (MIT Press). There is an example of the above on p107 (Figure 3.8) (and a similar example in Hofbauer and Sigmund’s “Evolutionary Games and Population Dynamics”).

Thanks, both of you, for clarifying some things! For some weird reason I’m only seeing the last two comments now. I’ve been distracted, but I didn’t think I was that distracted!

Marc wrote:

The replicator equation, by virtue of normalization, removes a degree of freedom.

Right, but what perturbs me is that the dynamics depends only on these normalized populations

just in the special case where the fitness functions are homogeneous functions of the population vector , i.e., when

This is true in some famous examples, but not the example Chris Taylor was looking at. In these other cases the normalized populations at time zero may not determine the normalized populations at later times, which is a bit like trying to do physics with position but not momentum.

In other words, you may not ‘want’ to track the population size, but you usually need to, to know what’s going to happen. I’m sure everyone who cares already knows this, but it makes me nervous when people seem to be ignoring this point.

What John says is of course completely correct. In many cases the replicator-equation and Lotka-Volterra equations can be transformed into one another without loss, but not for arbitrary fitness landscapes. There are drabacks to both approaches. By normalizing we are essentially letting the population be infinite, since we assume the population components are differentiable are so vary continuously. In the case of absolute population sizes, we allow fractional numbers of individuals (again continuous variation), which is also unrealistic. Most people seem to take a “it’s not such a big deal” perspective, but when one type falls down to 0.1 of an individual and then rebounds to dominate the population, it’s fairly suspect as a realistic effect.

Just as no MIS manager got fired for buying IBM PCs (even though better computers were available), one cannot fault anyone for applying Shannon’s notion of entropy in all sort of fields where it has no particular applicability over other notions of entropy or information content. Shannon’s notion comes into its own in matters of coding and communication as evidenced by its interpretation using the Noiseless Coding Theorem. But there is another more fundamental notion of entropy or information content that comes out of logic (partition logic that is, the dual to ordinary subset logic) and was even originally developed to measure biodiversity where it is called the Gini-Simpson index. See: http://www.mathblog.ellerman.org/2010/02/history-of-the-logical-entropy-formula/ for the history of this simple non-sexy formula in biodiversity, cryptography, economics, and logic. It is the special case of Rao’s quadratic entropy (defined in terms of a distance function) where one uses the logical distance function (the logical distance between two elements of a set is the complement of the Kronecker delta “closeness” function). The basic nature of the logical entropy formula is even more obvious in the quantum version (1-tr[rho squared]) where it provides the interpretation of the off-diagonal terms in a pure state density matrix and directly measures the information gained in a measurement.

Hi, David! Ultimately the use of different measures of entropy or biodiversity should be determined not by their ‘sexiness’ but by their mathematical properties. In biodiversity it’s common to consider the whole range of Rényi entropies

for , or even the limiting case . The reason is that these have good mathematical properties. As becomes small they pay more attention to rare events, or rare species. As you probably know, we can sneak up on by taking a limit, and then we get the Shannon entropy

On the other hand, gives the collision entropy

In biodiversity, Lou Jost has convinced all right-thinking people to exponentiate the Rényi entropies and get the ‘Hill numbers’ or ‘diversities’

I recall you liked this idea too.

But now let me get to my actual point: the theorem I’m discussing today seems to work only for Shannon entropy! I haven’t carefully checked it, but I did some calculations to see if the relative Rényi entropy decreases for other values of , and I couldn’t get this to follow the assumption that we’re computing the entropy of an evolutionarily stable state relative to the current state, which evolves according to the replicator equation. I’d love to be shown wrong, and someday I’ll to settle the matter definitively, but it didn’t seem to be working.

So: while I’m always eager to learn nice properties of different Rényi entropies, or Hill numbers, there may be times when one particular entropy measure is the only one that makes a theorem work… and that will trump any sort of ‘taste’ or ‘sexiness’ considerations.

[…] Last time we saw that if a population evolves toward an ‘evolutionarily stable state’, then the amount of information our population has ‘left to learn’ can never increase! It must always decrease or stay the same. […]

Suppose that a thermodynamical system is in a state , which is a probability distribution over its space of pure states. (You can as well take the system to be quantum and to be its density matrix.) Suppose further that the system has a Hamiltonian and temperature , so that the equilibrium state is .

Then the free energy of is given by . Splitting this up into sums over the system’s configurations gives

,

which is, up to the factor , just the relative entropy of with respect to !

Doing a web search reveals that this is quite well-known. I learned this yesterday from Paul Skrzypczyk.

Yes, I discovered this while preparing my talk for the Mathematics of Biodiversity conference.

I had all the right ingredients in front of my face:

1) In the beginning of the talk, I explained the idea that any probability distribution is a state of thermal equilibrium for your favorite temperature and some Hamiltonian (which is unique up to a constant):

This then allows to make the probability distribution temperature-dependent:

where the partition function serves to make the
probabilities sum to one:

2) Then I explained the concept of free energy, emphasizing that thanks to the above ideas, any probability distribution gives rise to temperature-dependent concepts of entropy and free energy, whose formulas I wrote down.

3) Then I explained that according to the replicator equation, relative entropy always increases, mentioning that this was a version of the second law of thermodynamics.

But of course the usual second law of thermodynamics says that free energy always decreases for a closed system whose energy is conserved!

So I did a calculation and noticed that relative entropy is ‘just’ the negative of free energy—or if you prefer, free energy is ‘just’ the negative of relative entropy. This calculation is essentially the same as the one you just showed us.

But I didn’t talk about this. Why not? Because there was something confusing me, which is in fact still confusing me. Maybe someone here can help me out.

In this blog post, I gave the proof that given some assumptions, the relative information is always decreasing. But information of what relative to what?

In my notation it’s the information of the equilibrium probability distribution relative to the time-dependent probability distribution :

Of course entropy has a minus sign, so we can also say relative entropy is always increasing. This is the entropy of the equilibrium probability distribution relative to the time-dependent probability distribution :

But free energy seems to be the other way around: up to a factor, it’s the entropy of the time-dependent probability distribution relative to the equilibrium probability distribution !

Thanks for the informative reply, John! After rereading the main post, I realize that I should have said ‘relative information’ rather than ‘relative entropy’ in order to get the sign conventions consistent with your terminology.

So I did a calculation and noticed that relative entropy is ‘just’ the negative of free energy—or if you prefer, free energy is ‘just’ the negative of relative entropy.

Either way is fine!

I think the in your last equation should be .

Am I making a dumb mistake?

If so, then I failed to notice it while following your reasoning.
Are there examples where the relative information of the time-dependent distribution with respect to the equilibrium distribution is not monotonically decreasing?

Whoops, I’ll fix that. There’s enough confusion about the roles of and without that mistake!

Are there examples where the relative information of the time-dependent distribution with respect to the equilibrium distribution is not monotonically decreasing?

I don’t know any.

This is why I’m so confused. On the one hand, since is essentially ‘free energy’, the laws of thermodynamics say this quantity should decrease in many situations. On the other hand, I haven’t found an easy proof that it decreases starting from the replicator equation and some other reasonable assumptions… while the proof that decreases is strikingly simple. I guess I should try a bit harder!

By the way, while I’ve been very distracted from our latest entropy project, one reason I’ve been distracted is because I’m talking to Jamie Vicary. And luckily, Jamie has some ideas on category theory and Bayesian networks that are closely related to our work! I haven’t quite fit them together, but I think it will be possible.

Here’s a little puzzle that’s very relevant. The category FinSet sits in a bigger category FinStoch whose objects are finite sets and whose morphisms are ‘stochastic maps’, where a stochastic map is a function sending each element of to a probability distribution on Is there a purely category-theoretic way to tell if a morphism in FinStoch comes from one in FinSet? You’re allowed to use the symmetric monoidal category structure on FinStoch if you want.

(You can also change the rules of the game in other ways if doing so you gives an interesting theorem! The general goal is to ‘recognize’ functions inside categories of a more stochastic nature.)

There’s a slightly similar puzzle that has a known answer. Let FinRel be the category of finite sets and relations. There’s purely 2-category-theoretic way to tell if a morphism in FinRel comes from one in FinSet: these are the morphisms with right adjoints. But this approach doesn’t seem to work for FinStoch.

It’s possible we should do something like this: embed FinSet in , the category where objects are finite sets and a morphism is an -shaped matrix of real numbers. Since is equivalent to the category of finite-dimensional real vector spaces there’s no purely category-theoretic way to recognize which morphisms come from functions. But, if you equip an object with a nice commutative Frobenius algebra structure, this amounts to equipping a finite-dimensional real vector space with a basis. And then, given two such vector spaces with basis, we should be able to recognize which linear maps between them send basis elements to basis elements!

Ok, so now we have two puzzles: first, figuring out whether the equilibrium state should be the prior or the posterior in talking about the decrease in relative information during the approach to equilibrium, and second abstractly characterizing FinSet as a subcategory of FinStoch.

Concerning the first, I wonder if the natural statement would be that relative information *always* decreases under a certain class of dynamics :
Since an equilibrium state satisfies , this would automatically imply that the relative information with respect to equilibrium is non-increasing, in both ways of choosing prior and posterior state!
For example, this inequality is known to hold when operates linearly on the probabilities, i.e. when is a stochastic map. The question is, what is the most general class of transformations for which this holds? Evolution according to the replicator equation probably allows stationary states which are not evolutionarily stable, right? This would suggest that the inequality cannot hold in this generality when is replicator equation dynamics.

Concerning the second problem, I haven’t yet been able to figure anything out, except for the details in your “purely 2-category-theoretic way to tell if a morphism in FinRel comes from one in FinSet”. The 2-morphisms are just inclusions of relations, right?
Could you say a bit more about the relation to Bayesian networks? I just wonder because I’ve recently thought a lot about these in the context of the foundations of quantum theory; this is the first outgrowth of that work. Bayesian networks don’t officially appear there yet, but they’re lurking in the background.

Concerning the first, I wonder if the natural statement would be that relative information always decreases under a certain class of dynamics :

I like that idea. By the way, is this true when is a stochastic operator, i.e. a linear operator that maps probability distributions to probability distributions? It feels like it should be true.

Evolution according to the replicator equation probably allows stationary states which are not evolutionarily stable, right?

I’m pretty sure that’s right, though I don’t know this stuff well enough to instantly cough up an example.

This would suggest that the inequality cannot hold in this generality when is replicator equation dynamics.

This would suggest it, but not actually prove it, so we should be careful!

Concerning the second problem, I haven’t yet been able to figure anything out, except for the details in your “purely 2-category-theoretic way to tell if a morphism in FinRel comes from one in FinSet”. The 2-morphisms are just inclusions of relations, right?

Right, or what people call ‘implication’: for example, ‘x is the best friend of y’ implies ‘x is the friend of y’. So FinRel becomes a locally posetal 2-category, sometimes called a ‘2-poset’.

Could you say a bit more about the relation to Bayesian networks?

Since Brendan Fong and Jamie Vicary are working on Bayesian networks at the CQT and I’m planning to write lots of posts about them in the Network Theory series, it’s hard for me to say just a bit about them.

I like that idea. By the way, is this true when T is a stochastic operator, i.e. a linear operator that maps probability distributions to probability distributions? It feels like it should be true.

Yes, that is true, as I tried to explain in the previous comment. Finding a good reference is surprisingly hard; the quantum generalization of this inequality appears for example as eq. (7) in this review paper.

Since Brendan Fong and Jamie Vicary are working on Bayesian networks at the CQT and I’m planning to write lots of posts about them in the Network Theory series, it’s hard for me to say just a bit about them.

Okay, I’m looking forward to those posts and Brendan and Jamie’s work!

Is there a purely category-theoretic way to tell if a morphism in FinStoch comes from one in FinSet? You’re allowed to use the symmetric monoidal category structure on FinStoch if you want.

Here’s a solution which you’re probably going to be disappointed with. I assume that the monoidal structure is the “disjoint union” which restricts to the coproduct in FinSet.

For each , fix some -element set . Then there is a unique morphism in FinStoch, and likewise a unique morphism . FinSet is the monoidal subcategory generated by these morphisms and all isos.

While this is most likely not the highbrow characterization you’re looking for, it shows at least that purely “external” characterizations are possible. In particular, every monoidal automorphism (or autoequivalence) of FinStoch preserves the subcategory FinSet.

Actually I was thinking about the monoidal structure on FinStoch that restricts to the product in FinSet, since a random variable taking values in is a stochastic map

and if we have two random variables

the pair of them can be considered a random variable in its own right

By the way, I learned this idea of treating random variables as stochastic maps from Jamie Vicary. It means the slice category [1]/FinStoch is quite interesting.

But the reason I’m not entirely satisfied with your characterization of FinSet inside FinStoch is not this, and it’s not that it’s insufficiently ‘highbrow’: it’s that you’re describing morphisms that generate this subcatgory, whereas I was hoping there’s some property that holds precisely for the morphisms in this category.

But I think I made some progress with Jamie and Brendan from a slightly different direction: starting from the category of finite-dimensional real Hilbert spaces, then picking out FinSet as the subcategory of special commutative dagger-Frobenius algebras with morphisms preserving the comultiplication, and then using that to also pick out FinStoch as a subcategory. This seems like a decent way to go about things, especially because ‘infinitesimal stochastic’ maps, i.e. generators of 1-parameter stochastic groups, are also present in the original big category.

Actually I was thinking about the monoidal structure on FinStoch that restricts to the product in FinSet […]

Ah, darn! I can see why this makes more sense in the context of Bayesian networks.

It means the slice category [1]/FinStoch is quite interesting.

Yes, it’s like FinProb, with functions generalized to stochastic maps. You had brought this up before!

But the reason I’m not entirely satisfied with your characterization of FinSet inside FinStoch is […] that you’re describing morphisms that generate this subcatgory, whereas I was hoping there’s some property that holds precisely for the morphisms in this category.

Yes, I know! That’s what I meant by “highbrow” ;)

But I think I made some progress with Jamie and Brendan from a slightly different direction: starting from the category of finite-dimensional real Hilbert spaces, then picking out FinSet as the subcategory of special commutative dagger-Frobenius algebras with morphisms preserving the comultiplication, and then using that to also pick out FinStoch as a subcategory.

Great to see you’re making progress! I’m a bit puzzled by the term “subcategory” here; you rather get FinSet and FinStoch as a category of algebras of a PROP (or something like that) in FHilb, right?

I’m a bit puzzled by the term “subcategory” here; you rather get FinSet and FinStoch as a category of algebras of a PROP (or something like that) in FHilb, right?

Yeah, I shouldn’t have called it a “subcategory”—I was trying to show off, but I got carried away.

Among category theorists, it’s common to show off by saying X is a subset of Y even if it’s not, as long as X is equipped with a monomorphism from X to Y. This is, after all, the ‘non-evil’ variant of the notion of subset, as used in structural set theory. Being a subset is then not a property, but a structure.

The obvious functor from FinSet (or FinStoch) to finite-dimensional real Hilbert spaces is faithful, but not full. In this situation, it’s bound to be confusing to think of this functor as making FinSet a ‘subcategory’ of FinHilb, since it’s like saying the category of groups is a subcategory of Set, which sounds really stupid.

That’s very cool! One nice aspect of category theory, in contrast to real life, is that it’s pretty clear what’s evil and what’s not ;)

This is off topic, but I wonder if that means that even the concept of ‘morphism’ is evil: a morphism is an element of the set of morphisms between two objects, and elements are evil! A similar “objection” applies to the ‘objects’ in a category. Does higher category theory offer a solution to this problem?

One nice aspect of category theory, in contrast to real life, is that it’s pretty clear what’s evil and what’s not ;)

Well, a lot of people have trouble understanding the concept of ‘evil’ in category and n-category theory… and I’m afraid maybe you do, too!

Elements of sets are not evil, nor are objects of categories.

What’s ‘evil’, in the technical sense of that word, is to make a definition or state a theorem asserting some property of objects in an n-category that holds for one object but fails to hold for some equivalent object.

This is fairly precise, but click the link to read more subtleties.

For example, it’s evil to define a function

to be an inclusion when it’s 1-1 and is a subset of …

…if we are thinking of functions as objects in the arrow category of Set. The reason is that the second clause, “ is a subset of “, can easily be true for some function

but not true for some function

that’s isomorphic in the arrow category.

To fix this, we can make the non-evil definition: a function

is an injection when it’s 1-1.

And so on…

One reason for avoiding ‘evil’ definitions and theorems is just that then all the facts we prove are guaranteed to be invariant under equivalence, which means we don’t need to check a lot of annoying fine print before applying them. But some category theorists get very annoyed that I use the term ‘evil’ in this way… because they want to do things that are evil in this sense, and don’t like being called ‘evil’.

Abstract. The equations of evolutionary change by natural selection are commonly expressed in statistical terms. Fisher’s fundamental theorem emphasizes the variance in fitness. Quantitative genetics expresses selection with covariances and regressions. Population genetic equations depend on genetic variances. How can we read those statistical expressions with respect to the meaning of natural selection? One possibility is to relate the statistical expressions to the amount of information that populations accumulate by selection. However, the connection between selection and information theory has never been compelling. Here, I show the correct relations between statistical expressions for selection and information theory expressions for selection. Those relations link selection to the fundamental concepts of entropy and information in the theories of physics, statistics, and communication. We can now read the equations of selection in terms of their natural meaning. Selection causes populations to accumulate information about the environment.

Abstract. If biology is the study of self-replicating entities, and we want to understand the role of information, it makes sense to see how information theory is connected to the ‘replicator equation’—a simple model of population dynamics for self-replicating entities. The relevant concept of information turns out to be the information of one probability distribution relative to another, also known as the Kullback–Liebler divergence. Using this we can see evolution as a learning process, and give a clean general formulation of Fisher’s fundamental theorem of natural selection.

How To Write Math Here:

You need the word 'latex' right after the first dollar sign, and it needs a space after it. Double dollar signs don't work, and other limitations apply, some described here. You can't preview comments here, but I'm happy to fix errors.