Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

May 1, 2011

Category-Theoretic Characterizations of Entropy

Posted by John Baez

Tobias Fritz, Tom Leinster and I seem to be writing a paper on category-theoretic characterizations of entropy—not only the good old Shannon entropy, but also the Rényi entropySqS_q, which depends on a parameter qq and reduces to the Shannon entropy as q→1q \to 1. Curiously, you can’t get the Shannon entropy by just sticking q=1q = 1 in the definition of Rényi entropy: you really need to sneak up on it, taking a limit as q→1q \to 1.

A while back I wrote a paper on Rényi entropy and free energy and blogged about it here. I got a bunch of very helpful feedback, which improved the paper immensely and allowed me state the main result in this way: the Rényi entropy is a ‘qq-deformed version’ of the ordinary entropy. The ordinary Shannon entropy can be defined as minus the derivative of free energy F(T)F(T) with respect to temperature TT. To get the Rényi entropy, we must use a generalization of the usual derivative, called a ‘q-derivative’:

Sq=−F(T/q)−F(T)1/q−1 S_q = - \frac{F(T/q) - F(T)}{1/q - 1}

To be completely honest, this should be called a ‘1/q1/q-derivative’. But that’s no big deal. What’s more interesting is that now it’s obvious why taking the limit q→1q \to 1 is a good idea: that’s how qq-derivatives reduce to ordinary derivatives!

To better understand the connection between qq-derivatives and Rényi entropy, it would help to understand what’s important—if anything—about Rényi entropy. It’s not exactly a central concept in statistical mechanics. As far as I can tell, it’s more of a curiosity. But there are a few interesting results about it here and there, and one of the more interesting can be found in Rényi’s original paper:

A. Rényi, On measures of information and entropy, in Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability 1960, pp. 547–561. Also freely available online via Project Euclid.

This paper starts with a nice characterization of Shannon entropy due to Fadeev, and then Theorem 3 gives a characterization of Rényi entropy.

Now Tom Leinster has taken Fadeev’s characterization of Shannon entropy and stated it in beautiful category-theoretic terms. Tobias Fritz suggested a possible avenue toward doing something similar with Rényi entropy, but we’re not there yet. We were discussing this on my other blog, but when Tom uttered the word ‘2-monad’, I realized it was time to move the conversation here.

Sometimes we think of [0,∞)[0,\infty) as a category with objects in [0,∞)[0,\infty) and a unique morphism from xx to yy iff x≥yx \ge y.

Sometimes we think of [0,∞)[0,\infty) as a category with one object, with morphisms in [0,∞)[0,\infty) and composition given by addition.

Of course there’s no real conflict: we can combine these viewpoints by working with a monoidal category with [0,∞)[0,\infty) as its set of objects, a unique morphism from xx to yy iff x≥yx \ge y, and addition as the tensor product. Indeed we sometimes want to get multiplication into the game too, and then we get a ‘rig category’. And sometimes we want to use the topology on [0,∞),[0,\infty), too. Then we’ve got a rig category internal to Top\mathrm{Top}.

All this is bog standard1: it’s just a fancy way of thinking about [0,∞)[0,\infty) as a topological ordered rig.

For finite probability measure spaces, the corresponding level shift is a bit more murky in my mind, and that’s what I’d like to start by talking about. I’m hoping it works in a very analogous way! But let’s see:

Sometimes we work with the category FinProbFinProb where finite probability measure spaces are objects and measure-preserving maps are morphisms.

Sometimes we work with an operad, the ‘Giry operad’, where the set of nn-ary operations is the set of probability measures on the nn-point set.

Of course this set is also known as the (n−1)(n-1)-simplex, so all of a sudden instead of doing statistical mechanics we’re in the simplicial world that homotopy theorists like to inhabit! Maybe these worlds are connected more intimately than people suspect. But never mind that for now. My question is:

What’s a nice way to combine viewpoints 1. and 2. for finite probability measure spaces, which is closely analogous to something that also works for [0,∞)[0,\infty)?

We could then hope to say that the Shannon entropy is a map

S:FinProb→[0,∞) S : FinProb \to [0,\infty)

that preserves a lot of the structures that FinProbFinProb and [0,∞)[0,\infty) have in common. Indeed, it might be uniquely characterized this way, based on Tom’s reinterpretation of Fadeev’s theorem.

You’ll notice an irksome apparent disanalogy between what I did with [0,∞)[0,\infty) and what I’m doing with finite probability measure spaces: in the second case, and only in that case, do we see operads entering the game.

But that’s probably because things haven’t been cleaned up enough yet.

I think that this is one of those puzzles where finally having had the wits to state it, I should be able to solve it. Anyway, this blog entry is mainly meant as a place where Tobias, Tom and interested bystanders can join me in discussing our paper.

1 I love it when Brits say ‘bog standard’, so I wanted to see if I could pull it off convincingly. How did I do? Apparently England is so wet that bogs are considered ‘standard’. I just like the way it sounds: curt and dismissive.

Posted at May 1, 2011 8:04 AM UTC

TrackBack URL for this Entry: http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2394

Some Related Entries

124 Comments & 2 Trackbacks

Re: Category-Theoretic Characterizations of Entropy

So, to start answering my own question, let me cook up a monoidal category that’s related to probability measure spaces but also to the ‘Giry operad’, with a monoidal structure somehow analogous to addition in [0,∞)[0,\infty).

Whenever we have any monoidal category MM and any object x∈Mx \in M we get an operad, the co-concrete operad of xx, whose set of nn-ary operations is hom(x,x⊗n)hom(x, x^{\otimes n}).

So, let’s create a category where the objects are finite sets and the morphisms are ‘stochastic maps’. More precisely, a morphism F:S→TF: S \to T in this category is function F:S→P(T)F: S \to P(T) where P(T)P(T) is the space of probability measures on TT. This functor PP is part of a monad called the ‘Giry monad’ and our ability to compose morphisms arises from this fact.

In jargonesque terms, we’re forming the full subcategory of the Kleisli category of Giry monad, where the objects are just finite sets.

Let’s hope this category becomes monoidal with disjoint union of sets as the tensor product.

Then the set of morphisms from 11 to n=1+1+⋯+1n = 1+ 1 + \cdots + 1 is the set of probability measures on the nn-element set. So, the co-concrete operad of the 1-element set seems to be the Giry operad!

The appearance of the ‘co’ here is a bit curious, but otherwise I like this construction.

Re: Category-Theoretic Characterizations of Entropy

Hi John. I think what you’re doing here is part of the general story about theories and operads. In other words, what you wrote doesn’t use anything special about the particular theory we’re considering.

What you call a co-concrete operad is generally called a co-endomorphism operad. The category of finite sets and stochastic maps is the opposite of the Lawvere theory corresponding to the Giry monad.

Generally, take a finitary algebraic theory TT. (As you know, there are several equivalent ways of defining “finitary algebraic theory”; use whichever you like.) Write TnT_n for the set of nn-ary operations. The Lawvere theory LL corresponding to TT satisfies L(n,1)=TnL(n, 1) = T_n. The finitary monad on SetSet corresponding to TT, which I’ll also call TT, satisfies T({1,…,n})=TnT(\{1, \ldots, n\}) = T_n. The underlying operad PTP_T in SetSet has TnT_n as its set of nn-ary operations. So the endomorphism operad of 1∈L1 \in L is always PTP_T.

Re: Category-Theoretic Characterizations of Entropy

Ah, I see that the category of finite sets and stochastic maps, and the observation that this is the opposite of our Lawvere theory, are in Tobias’s paper A presentation of the category of stochastic matrices. I suspect they’re also in some oldish work of Giry and Lawvere.

Re: Category-Theoretic Characterizations of Entropy

Yes, I guess this relation between monads, operads and algebraic theories was part of my thinking. The main thing that puzzled me was that:

The category of finite sets and stochastic maps is the opposite of the Lawvere theory corresponding to the Giry monad.

(emphasis mine). The category of finite sets and stochastic maps seems much more interesting than its opposite, since a ‘random process’ in physics, biology, etc. etc. can be thought of as a stochastic map, or a family of such. It makes a lot of sense to assign an entropy to a stochastic map: the entropy says how much randomness the map introduces. It also makes a lot of sense to think of a probability distribution on a finite set SS as a special case of a stochastic map: namely, a ‘random element’ f:1→Sf : 1 \to S).

So, I was very happy with this category, but puzzled that this is the opposite of some important Lawvere theory. But now I’m thinking that this is just that usual thing about how for any Lawvere theory ThTh, the category of finitely generated free models is the opposite of ThTh. Right?

Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

So we have several ways of thinking about the same category…

Yes, it’s a wonderful category, one of those primordial ones that repays endless investigation. I suspect that once we understand it thoroughly enough, the nice characterizations of Shannon entropy and Rényi entropy will just pop out. They’ll be the only functors out of this category (or monoidal 2-category!) that preserve certain amounts of structure.

Re: Category-Theoretic Characterizations of Entropy

For some reason your comment reminded me of this classic:

Lisa Simpson: I’m going to become a vegetarian.Homer Simpson: Does that mean you’re not going to eat any pork?Lisa: Yes.Homer: Bacon?Lisa: Yes Dad.Homer: Ham?Lisa: Dad, all those meats come from the same animal!Homer: Right Lisa, some wonderful, magical animal!

Re: Category-Theoretic Characterizations of Entropy

Right?

Not sure. It’s crucial that the simplices are the (finitely generated) free convex algebras, and that the finite-dimensional vector spaces are the (finitely generated) free vector spaces… because every vector space is free.

I don’t have a clear picture in my mind of the analogy between convex algebras and vector spaces (or modules), but the freeness was part of the idea I was trying to convey.

Re: Category-Theoretic Characterizations of Entropy

It seems to me as though the category of finitely generated free convex algebras (= simplices) is quite analogous to the category of finitely generated free vector spaces. We always look at the finitely-generated free things when defining a Lawvere theory, right, whether or not every algebra for the theory is free?

Re: Category-Theoretic Characterizations of Entropy

Right. However, having studied this category for a few months, I came to the conclusion that it is usually more natural and always technically more convenient to consider conical combinations instead of convex combinations. In other words, ℝ≥0\mathbb{R}_{\geq 0}-semimodules are nicer than convex algebras, but not less general. Another option is to consider only those conical combinations where the total weight is ≤1\leq 1. In order to get the corresponding Lawvere theories, we simply have to replace the category of stochastic matrices by the category of all matrices with entries in ℝ≥0\mathbb{R}_{\geq 0} or by the category of matrices over ℝ≥0\mathbb{R}_{\geq 0} where each column sums to ≤1\leq 1. In any case, by this we are back to the trichotomy between probability measures, measures, and subprobability measures.

By the way, the stuff that is in my paper A presentation of the category of stochastic matrices is also just general abstract nonsense about theories. It does the following: given a presentation of a (non-symmetric) operad over SetSet, it gives a presentation of the associated Lawvere theory. It is so simple that it might as well be trivial to the experts. Some day I would like to see whether this is in the case, and if not, then rewrite it in generality.

Re: Category-Theoretic Characterizations of Entropy

By “the associated” Lawvere theory of an operad, do you mean the Lawvere theory with the same algebras as the operad? You may know that the relationship (which is an adjunction) between Lawvere theories and operads is part of a general picture of “generalized operads / multicategories”, which I said a bit about in another context here. But I don’t know to what extent one can extract an “operation on presentations” (i.e. given a presentation for one, get a presentation for the other) from the abstract nonsense.

Re: Category-Theoretic Characterizations of Entropy

Let’s see. A presentation of an operad is a parallel pair FX→→FYFX \stackrel{\to}{\to} FY of maps of operads, where XX and YY are signatures (objects of SetℕSet^\mathbb{N}), FF assigns to a signature the free operad on it, and the two maps are any maps of operads (not necessarily of the form “FF of something”). The operad presented is the coequalizer of this pair.

The same goes for Lawvere theories, now interpreting FF as the free theory on a signature.

In the adjunction between theories and operads, the left adjoint goes from operads and theories. So that’s going to turn a presentation of an operad into a presentation of a theory. (And if you think about it concretely, that’s clear too.) The other way round doesn’t seem instantly obvious from this point of view, but I’ve only given it a minute’s thought.

Re: Category-Theoretic Characterizations of Entropy

Here’s a related thought, concerning your second list:

the category FinProbFinProb

the Giry operad.

Again, it comes from some general stuff about theories.

There’s a one-to-one correspondence between finitary endofunctors of SetSet and functors FinSet→SetFinSet \to Set. (In one direction, the correspondence is given by restriction. In the other, it’s given by left Kan extension.) But also, for any category CC, there’s a one-to-one correspondence between functors C→SetC \to Set and discrete opfibrations over CC. So: finitary endofunctors of SetSet correspond to discrete opfibrations over FinSetFinSet.

Now, a finitary algebraic theory corresponds to a finitary monad TT on SetSet, and therefore gives rise to a finitary endofunctor of SetSet (by throwing away the unit and multiplication). So we can apply the correspondence just described. The resulting category E(T)E(T), discretely opfibred over FinSetFinSet, has:

as objects: pairs (n,ν)(n, \nu) where nn is a finite set and ν∈T(n)\nu \in T(n) is an nn-ary operation

Re: Category-Theoretic Characterizations of Entropy

…it’s also the case that the first item can be derived from the second by throwing away some information.

(I say “also” because I showed in this previous comment that the category FinProbFinProb can be obtained from the Giry monad by throwing away some information. Actually the second item on your second list was the Giry operad, but I’m hoping you don’t mind.)

The poset ([0,∞),≥)([0, \infty), \geq) is obtained from the monoid ([0,∞),+,0)([0, \infty), +, 0) by taking the (unique) slice category. That is, ([0,∞),+,0)([0, \infty), +, 0) is a one-object category, and the slice category over the unique object is the poset ([0,∞),≥)([0, \infty), \geq).

I call this “throwing away some information” for the following reason. Whenever you have a monoid MM, you can view it as a category with single object ⋆\star and form the slice category M/⋆M/\star. Its objects are the elements of mm, and a map m→nm \to n is an element p∈Mp \in M such that m=npm = n p. If MM has cancellation then M/⋆M/\star is a preordered set, the order relation being divisibility. But different monoids can have the same underlying preordered set. For example, all groups have the same underlying preordered set, up to equivalence: it’s trivial. All groups of the same order have the same underlying preordered set, up to isomorphism.

I’m not sure if the following is relevant, but I feel compelled to finish the thought. Whenever you have a category CC and an object c∈Cc \in C, you not only have a slice category C/cC/c: you also have a forgetful functor C/c→CC/c \to C. This is a discrete fibration. And discrete fibrations on CC correspond to functors Cop→SetC^{op} \to Set; in this case, it’s the representable C(−,c)C(-, c). (What else could it possibly be?)

Applied to our monoid ([0,∞),+,0)([0, \infty), +, 0), this gives a forgetful functor
([0,∞),≥)→([0,∞),+,0).
([0, \infty), \geq) \to ([0, \infty), +, 0).
It’s trivial on objects, and sends a map x≥yx \geq y in the domain category to the map x−yx - y in the codomain category. It corresponds to the unique representable functor ([0,∞),+,0)op→Set([0, \infty), +, 0)^{op} \to Set. Now functors ([0,∞),+,0)op→Set([0, \infty), +, 0)^{op} \to Set are sets with a right action by this monoid, and in this case it’s the right regular representation: [0,∞)[0, \infty) acting on itself, on the right, by addition.

So all of that stuff comes out of the monoid [0,∞)[0, \infty) by general categorical mechanisms. You can find this in Section 4 of Lawvere’s paper Taking categories seriously.

Re: Category-Theoretic Characterizations of Entropy

Though I seem to have exposed the Giry operad as a part of a monoidal category FinStochFinStoch with

finite sets as objects

stochastic maps as morphisms

this is off by a ‘level shift’ from the monoidal category MM with [0,∞)[0,\infty) as objects and + as tensor product.

After all, the entropy of a probability measure on a finite set is a number. But numbers are objects in MM, while probability measures on finite sets are certain morphisms in FinStochFinStoch.

So let’s make up a monoidal category XX with one object, numbers in [0,∞)[0,\infty) as morphisms from this object to itself, and + as composition of morphisms. By Eckmann-Hilton, + will also be the tensor product of morphisms. This will be good.

Now, I claim that Shannon entropy can be reinterpreted as a monoidal functor

F:FinStoch→X F: FinStoch \to X

How? Well, first of all I should define FF on morphisms (on objects it’s trivial). A morphism f:1→Tf: 1 \to T in FinStochFinStoch is just a function F:1→P(T)F: 1 \to P(T), where P(T)P(T) is the set of probability measures on the finite set TT. So, it’s a probability measure on TT. Let F(f)F(f) be the Shannon entropy of this probability measure.

Next, a general morphism f:S→Tf : S \to T in FinStochFinStoch is a function f:S→P(T)f : S \to P(T). So, it’s an SS-tuple of probability measures on TT. Let F(f)F(f) be the sum of the Shannon entropies of these probability measures.

Now, I’m hoping that FF is a monoidal functor!

The proof is supposed to be reminiscent of Tom’s proof that Shannon entropy defines a ‘lax point’. Indeed I’m trying to develop this stuff as an alternative way of saying everything Tom said. While his formulation is elegant, it doesn’t seem to emphasize the fact that a probability distribution and its entropy are ‘things of the same type’. That’s what I’m trying to do now.

It may seem odd to check that FF is monoidal before I check it’s a functor, but I think preservation of tensor products is easier — at least after I tell you the tensor product in FinStochFinStoch. Let me do that.

Here’s the idea: given an SS-tuple of probability distributions on TT and an S′S'-tuple of probability distributions on T′T', there’s an obvious S+S′S+S'-tuple of probability distributions on T+T′T+T'.

More precisely, suppose we have morphisms f:S→Tf : S \to T and f′:S′→T′f': S' \to T' in FinStochFinStoch. Then I want their tensor product to be a morphism from S+S′S+S' to T+T′T+T'. Such a thing is a function
S+S′→P(T+T′) S+S' \to P(T+T')
or in other words, it’s a pair of functions
S→P(T+T′),S′→P(T+T′) S \to P(T+T') , \qquad S' \to P(T+T')
But there are natural inclusions
P(T)→P(T+T′) P(T) \to P(T + T')P(T′)→P(T+T′) P(T') \to P(T + T')
so we can get a pair of the above sort from a pair of functions
S→P(T),S′→P(T′) S \to P(T) , \qquad S' \to P(T')
and that’s what ff and f′f' actually are!

Well, now I’m in a position to check that FF is a monoidal functor—but now it’s time for dinner, so I’ll quit here.

Re: Category-Theoretic Characterizations of Entropy

It seems to me that this FF is not a functor. Consider for example the object S={0,1}S=\{0,1\}. The set of probability distributions P(S)P(S) contains the ‘fair coin’, which is the distribution where both outcomes occur with probability 1/21/2.

Let’s look at the endomorphism f:S→Sf:S\rightarrow S which maps both elements to the fair coin. It has the property that fg=ff g=f for any other g∈End(S)g\in End(S). If HH would be compatible with this kind of composition, then we would deduce H(f)+H(g)=H(f)H(f)+H(g)=H(f), so that H(g)=0H(g)=0!

Re: Category-Theoretic Characterizations of Entropy

I agree with Tobias’s conclusion that FF isn’t a functor. According to my calculations, it is monoidal and does preserve identities, but doesn’t preserve composition.

Before I give some details, may I suggest that we avoid using TT for finite sets? We’ve got a monad kicking around, and monads are usually called TT. Plus, TT stands for temperature. (It’s one of several notational clashes. There’s pp, which wants to stand for both a probability measure and the parameter pp of “pp-norm”. And there’s μ\mu, which wants to stand for both a measure and the multiplication of a monad.)

Take stochastic maps f:A→P(B)f: A \to P(B) and g:B→P(C)g: B \to P(C). Their composite in FinStochFinStoch is the composite
A→fP(B)→P(g)P2(C)→multCP(C).
A \stackrel{f}{\to}
P(B) \stackrel{P(g)}{\to}
P^2(C) \stackrel{mult_C}{\to} P(C).
in FinSetFinSet. If I’ve done it right, this is
a↦(∑bf(a)bg(b)c)c∈C.
a \mapsto
\Bigl(\sum_b f(a)_b g(b)_c\Bigr)_{c \in C}.
You can think of a map A→P(B)A \to P(B) as an A×BA \times B matrix of nonnegative reals with the property that each row — or do I mean column? — sums to 11. This is a stochastic matrix. From this point of view, composition is matrix multiplication.

Now, the image of this composite map under FF is
∑aH((∑bf(a)bg(b)c)c∈C)=∑a{H(f(a))+∑bf(a)bH(g(b))}
\sum_a H\Bigl(\Bigl(\sum_b f(a)_b g(b)_c\Bigr)_{c \in C} \Bigr)
=
\sum_a \Bigl\{ H(f(a)) + \sum_b f(a)_b H(g(b)) \Bigr\}
by the formula for the entropy of a convex combination (a straightforward generalization of your “magical property”). This is equal to
∑aH(f(a))+∑a,bf(a)bH(g(b)).
\sum_a H(f(a)) + \sum_{a, b} f(a)_b H(g(b)).
But for FF to be a functor, we’d need that to be equal to
∑aH(f(a))+∑bH(g(b)),
\sum_a H(f(a)) + \sum_b H(g(b)),
and in general, it’s not.

Re: Category-Theoretic Characterizations of Entropy

Before I give some details, may I suggest that we avoid using TT for finite sets? We’ve got a monad kicking around, and monads are usually called TT. Plus, TT stands for temperature.

I’ve never been big on calling monads TT; it would make sense for ‘tonads’, but… anyway, you’re right, temperature is TT in this game.

So: the functoriality fails!

But here’s a fallback plan. Lets take that monoidal category XX whose morphisms are numbers in [0,∞)[0,\infty), and make it into a monoidal 2-category using inequalities as 2-morphisms. (A ‘monoidal 2-poset’.) Doesn’t your calculation show

F(fg)≤F(f)+F(g)? F(f g) \le F(f) + F(g) ?

So that we’re getting a lax 2-functor?

Tobias’ counterexample illustrates why this is a sensible notion: the amount of randomness in a composite process should be less than or equal to the sum of the amounts of randomness introduced at each stage.

But how come your ‘lax point’ idea used the magical property to prove an equation, while here we’re just using it to prove an inequality? Maybe in some important cases this inequality is really an equation.

So, right now I’m stuck when it comes to translating Tom’s operadic characterization of Shannon entropy into a property of the map

F:FinStoch→X F: FinStoch \to X

It’s a monoidal lax 2-functor (right?) but that fact alone is not enough to capture the ‘magical property’ of Shannon entropy (right?).
If so, I think we need to take into account more structure that FinStochFinStoch and XX have in common, and say that FF preserves some of this extra structure. Something about ‘convex linear combinations’.

Re: Category-Theoretic Characterizations of Entropy

I haven’t thought about the inequalities yet, but isn’t XX merely a monoidal category rather than a monoidal 2-category? The objects of the monoidal category are the nonnegative reals, the morphisms are inequalities, and the monoidal structure is addition. (When you talk about the inequalities as being the 2-morphisms, I think you’re already viewing the monoidal category as a one-object 2-category.)

Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

I haven’t thought about the inequalities yet, but isn’t XX merely a monoidal category rather than a monoidal 2-category? The objects of the monoidal category are the nonnegative reals, the morphisms are inequalities, and the monoidal structure is addition.

Well, there’s a lot of level-shifting going on in this game! But in the comment where I introduced ‘X’, I defined it as a monoidal category with one object, elements of [0,∞)[0,\infty) as morphisms, and addition as composition and also tensor product of morphisms. The reason for this ‘overly elevated’ description of [0,∞)[0,\infty) is that I wanted to think of Shannon entropy as giving a functor

F:FinStoch→X, F: FinStoch \to X ,

and it’s the morphisms, not the objects, of FinStochFinStoch that get assigned entropies.

So then, when Tobias and you pointed out that FF is not a functor, but merely obeys

F(fg)≤F(f)+F(g) F(f g) \le F(f) + F(g)

I fought back by including 2-morphisms in XX, namely inequalities between numbers. This makes XX into a monoidal 2-category. And, I believe it makes FF into a lax functor, which however is strictly monoidal. I’m hoping you’ll check me on that!

Now, you may think it’s absurdly highbrow to treat [0,∞)[0,\infty) as a monoidal 2-category, but monoidal 2-categories seem like the ‘least fancy’ kind of nn-category that allows us to treat entropy as some sort of lax nn-functor

F:FinStoch→X F: FinStoch \to X

By the way, I like the of shifting viewpoints on FinStochFinStoch and XX using the slice construction you described here, both for FinStochFinStoch and XX. But actually, it seems that for a uniform treatment I need a coslice category instead. It seems that:

1) Starting from the category with one object, [0,∞)[0,\infty) as morphisms, and ++ as composition, and forming the coslice category under the only object, we get the poset ([0,∞),≤)([0,\infty), \le).

2) Starting from the monoidal category with one object, [0,∞)[0,\infty) as morphisms, and ++ as both composition and tensor product, and forming the coslice category under the only object, we get the monoidal poset ([0,∞),≤)([0,\infty), \le).

3) Starting from the monoidal category FinStochFinStoch with finite sets as objects and stochastic maps as morphisms, and forming the coslice category under the terminal object, we get the monoidal category FinProbFinProb.

Since FinProbFinProb and FinStochFinStoch are two important things that entropy is defined on, both interesting to us, but entropy assigns numbers to objects of FinProbFinProb but morphisms of FinStochFinStoch, I think there’s something worth pondering here.

Re: Category-Theoretic Characterizations of Entropy

Interesting that coslice categories seem to appear frequently!

Question: What does your FinProbFinProb stand for, exactly? Previously I thought that we had taken it to have finite probability spaces as objects and measure-preserving functions as morphisms. But in your point 3) above, you also seem to have included stochastic maps. With this definition of FinProbFinProb, entropy is not functorial!

Using the former definition of FinProbFinProb, I had been thinking about coslice categories under objects of FinProbFinProbhere. The coslice category under Ω∈FinProb\Omega\in FinProb seems to be the category of random variables with sample space Ω\Omega. Then we can describe joint distributions of random variables in terms of the categorical product on this coslice category! (This also proves that FinProbFinProb itself does not have products: if it would, the coslice categories would have to inherit them; in this case, the product would not depend on Ω\Omega; but for two random variables with given distribution, their joint distribution does depend on the sample space. qed.)

Re: Category-Theoretic Characterizations of Entropy

Tobias wrote:

Question: What does your FinProbFinProb stand for, exactly? Previously I thought that we had taken it to have finite probability spaces as objects and measure-preserving functions as morphisms. But in your point 3) above, you also seem to have included stochastic maps.

Whoops! I hadn’t noticed that the coslice category of objects under 11 in FinStochFinStoch had more morphisms than FinProbFinProb.

With this definition of FinProbFinProb, entropy is not functorial!

I guess you’re right. Not in the original sense, where we had a functor from FinProbFinProb to the category with numbers as objects and inequalities ≥\ge as morphisms. But how about mapping FinProbFinProb to some 2-category with numbers for objects (entropies of probability distributions), numbers for morphisms (entropies of stochastic maps), addition for composition of morphisms, and inequalities as 2-morphisms. Could entropy be a lax functor?

(When I say “some category with…” it’s because I can think of more than one possibility like this. I haven’t had time to sort through the details, and I’d be glad for help.)

Re: Category-Theoretic Characterizations of Entropy

David wrote:

I thought the entropy notion attached to arrows of what you call FinStoch is what is called conditional entropy, since a map p:X→Yp:X \to Y is a conditional distribution.

I’m a bit under the weather; maybe that explains why I can’t see why a stochastic map deserves to be called a conditional distribution. Say we have a stochastic map f:S→S′f: S \to S' For each point in SS, a stochastic map gives a probability distribution on S′S'. Here’s how I’m associating an entropy to that. If SS is a single point, f:S→S′f: S \to S' is a probability distribution on the finite set S′S', and I take its Shannon entropy. If SS is a bunch of points, I get a bunch of probability distributions, and I add their Shannon entropies.

Since my definition of the entropy for a stochastic map f:S→S′f: S \to S' uses a sum over points of SS, it’s pretty easy to show that this entropy gives a map

Re: Category-Theoretic Characterizations of Entropy

I’m a bit under the weather; maybe that explains why I can’t see why a stochastic map deserves to be called a conditional distribution.

But then you go to explain how each point in SS gives a probability distribution on S′S'. That’s precisely the same as a conditional probability, P(s′|s)P(s'|s). For fixed s∈Ss \in S, the sum of P(s′|s)P(s'|s) over s′s' is one.

So then, yes you have an entropy for each s∈Ss \in S. You want a single number now, so choose to add them. I was just pointing out that the standard way of treating the entropy of ff is called conditional entropy which retains the full set of entropies for each point ss, and waits for a distribution on SS to form the expectation of the entropy.

Re: Category-Theoretic Characterizations of Entropy

And I wonder why you’re adding the entropies for each point ss.

I could see the sense in taking the mean, which is working out how much uncertainty is introduced by stochastic f:S→S′f: S \to S' relative to a uniform distribution on SS. There’s at least a barycentric flavour to that choice.

Doing it your way, the entropy for the stochastic map f:S→S′f:S \to S' which is uniform on S′S', and independent of SS, depends on the size of SS. I can’t see the intuitive sense of that.

Intuitively speaking, it’s because a stochastic map f:n→mf: n \to m is an nn-tuple of independent random variables taking values in the finite set mm, so I’d expect the information in the whole bunch to be gotten by adding up the information of each one.

There may be other good things to do, but right now all I’m trying to do—with miserable lack of success—is capture Fadeev’s characterization of Shannon entropy in some neat category-theoretic language. None of my attempts have succeeded in capturing the magical property of Shannon entropy.

Tom has two nice ways to do it already! But neither completely satisfies me, for reasons explained here. I suspect I can take his results and restate them in a way that suits my philosophy of decategorification. But perhaps not today…

Re: Category-Theoretic Characterizations of Entropy

I hope your fever improves soon. I can see that in very concrete terms, we’re dealing with row stochastic matrices. To sum two such matrices, we just make a block diagonal matrix.

Now you want to assign a number, called entropy, to any such matrix in such a way that the entropy of the sum is equal to the sum of the entropies of the block submatrices. And a simple way to do this is just to take the entropy of a matrix to be the sum of the entropies of each of its rows.

It doesn’t strike me as a natural thing to do, but let’s see if it accords with Tom’s set up.

Re: Category-Theoretic Characterizations of Entropy

There’s been an overwhelming amount of activity on this thread, which is great, but means that I’ve only just got to this. I think the answer’s no.

To recap: we had the category FinStochFinStoch, whose objects are the finite sets and whose maps A→BA \to B are the A×BA \times B row-stochastic matrices, i.e. matrices of nonnegative reals whose rows sum to 11. Composition, if written in diagrammatic order, is matrix multiplication: the composite of

A→fB→gC
A \stackrel{f}{\to} B \stackrel{g}{\to} C

is the matrix product fgf g. We assigned to each map f:A→Bf: A \to B in FinStochFinStoch the real number

F(f)=∑aH((fab)b∈B)
F(f) = \sum_a H((f_{a b})_{b \in B})

where HH is Shannon entropy. (By definition of FinStochFinStoch, the expression in brackets is a probability distribution, so we can indeed take its Shannon entropy.)

Re: Category-Theoretic Characterizations of Entropy

Entropy as a functor

Great. Very intriguing stuff.

This is possible consider the Entropy respect to the Valuations Counting Measure. The valuations are generated by process of objects construction(reconstruction)-diagram building dynamics-constructive morphisms. Then Entropy is dynamical variable.
If process depends(defined) on the entropy then appears non trivial natural transformations process(respect to induced endofunctors process). That weakens the associativity and any conditions represented by finite diagrams.

The Entropy dynamically classifying the objects by their complexity (defined). This results in some kind of convergence-the Entropic Memory convergence.
In Algorithmic point of view the search algorithm-search for greater complexity. This algorithm has cognitive property, the entropic memory.
Figuratively, the valuating process stays no long where was recently.

Re: Category-Theoretic Characterizations of Entropy

I love “seem to be writing a paper”! It encourages my fond hope that without anyone trying, a paper will simply become written.

On the most important question, I think your usage of “bog standard” is pretty much on target. It maybe doesn’t sound quite right… but perhaps that’s because it sounds funny coming from a Californian. As your link points out, a key factor in the phrase’s success is that “bog” is British slang for “toilet” — which makes “bog standard” doubly British.

Re: Category-Theoretic Characterizations of Entropy

(1) I have used the term ‘bislice category’ for the following: let CC be a category, x,y∈Cx, y\in C objects, and f:x→yf:x\rightarrow y a morphism. Now consider the category x/C/yx/C/y where the objects are triples (z,iz,pz)(z,i_z,p_z) with iz:x→zi_z:x\rightarrow z and pz:z→yp_z:z\rightarrow y such that pziz=fp_z i_z=f. Morphisms are the obvious ones compatible with the given data. Does this kind of construction have a name? If not, is ‘bislice category’ appropriate? (It seems to be a full subcategory of a bislice category…)

(2) Is there a better way to introduce the relative entropyH(X/Z)H(X/Z) than by defining it as H(X)−H(Z)H(X)-H(Z)? This definition seems a bit ad hoc. Also, is the term ‘relative entropy’ appropriate for this concept, which slightly differs from ordinary relative entropy?

Also I have been looking at Tom’s section on power means. It seems excellently written and I have nothing to say about the open questions mentioned, so I did not actually change anything. One thing that’s missing is an explanation of the relation to entropies — I suppose it’s motivated by the rig-theoretical characterization of Rényi entropies (John’s corollary 2)?

Re: Category-Theoretic Characterizations of Entropy

Hi Tobias. I’ll try to explain what the power means are doing there quite soon, partly in answer to your implied question and partly in answer to John’s question. But I’ve been at a conference all day and haven’t got the energy right now.

Re: Category-Theoretic Characterizations of Entropy

I’m really busy this second, but: we definitely should not use “relative entropy” to mean the difference in entropies; that term already has a standard meaning and that other standard concept will eventually become very important in what we’re talking about!

Re: Category-Theoretic Characterizations of Entropy

Re: Category-Theoretic Characterizations of Entropy

That’s usually referred to as conditional entropy. When I have read the term ‘relative entropy’ it has always been as a synonym for Kullback-Leibler divergence, which concerns a non-symmetric distance on between distributions over the same space.

Re: Category-Theoretic Characterizations of Entropy

Ooops! That has cleared up a misconception: I had somehow thought of these two concepts as the same thing, although clearly they are not.

Then, of course I agree completely with John: I should definitely *not* use the term ‘relative entropy’. And relative entropy is indeed a very natural concept and it should be interesting to characterize it in categorical language!

Re: Category-Theoretic Characterizations of Entropy

We are already talking in this thread about the Kullback-Leibler distance = Kullback-Leibler divergence = relative entropy. The cross entropy equals entropy plus relative entropy, so mathematically there’s nothing new going on here.

The joint entropy, conditional entropy and mutual information are crucial tools of information theory. Shannon’s noisy channel coding theorem is all about mutual information. So let me try to explain these quantities a bit.

Joint entropy is the simplest of these: if XX and YY are random variables with joint distribution X×YX\times Y, then the joint entropy H(X,Y)H(X,Y) is defined to be the ordinary entropy of the joint distribution:

H(X,Y)=H(X×Y)H(X,Y)=H(X\times Y)

So there is nothing mysterious going on here. If H(X)H(X) quantifies the amount of randomness in XX, then H(X,Y)H(X,Y) just quantifies the amount of random in XX and YY together.

How much randomness can there be in XX and YY together? Obviously, it should be at least as much as is contained in XX alone or in YY alone:

H(X,Y)≥H(X),H(X,Y)≥H(Y)H(X,Y)\geq H(X),\quad H(X,Y)\geq H(Y)

There are situations when these are actually equalities: for example when X=YX=Y. On the other hand, XX and YY together can contain at most as much randomness as XX contains plus what YY contains:

H(X,Y)≤H(X)+H(Y)H(X,Y)\leq H(X)+H(Y)

This holds with equality if and only if XX and YY are probabilistically independent: this is the additivity of entropy.

Conditional entropy and mutual information can now be defined as derived quantities. They are precisely those which the above inequalities guarantee to be non-negative. So, conditional entropy is

H(X|Y)=H(X,Y)−H(Y)H(X|Y)=H(X,Y)-H(Y)

It quantifies something like this: given that we know the value of YY, how much additional randomness do we expect to be contained in XX? Mutual information is the quantity

I(X:Y)=H(X)+H(Y)−H(X,Y)I(X:Y)=H(X)+H(Y)-H(X,Y)

It quantifies the amount of knowledge one has about XX when one knows the value of YY, and vice versa. It vanishes if and only if XX and YY are probabilistically independent.

Since these are just simple combinations of (joint) entropies, inventing additional terms like ‘conditional entropy’ and ‘mutual information’ is redundant. It helps in getting more succinct notation and in confusing people by throwing around many different terms which refer to the same things.

Besides the two inequalities above, which are the non-negativity of conditional entropy and mutual entropy, there is another one which states the non-negativity of conditional mutual information, which is a quantity that unifies and extends the above concepts. I have found categorical formulations of all these inequalities here (and erroneously used the term “relative entropy” where I should have said “conditional entropy”). There are many other useful inequalities which can be derived from these, e.g. the data processing inequality.

Re: Category-Theoretic Characterizations of Entropy

Re: Category-Theoretic Characterizations of Entropy

One thing to be careful about is that the notation “H(X,Y)H(X,Y)” is commonly used for both the cross entropy and the joint entropy. David has taken it to mean cross entropy, while in my contributions it always stands for joint entropy.

Re: Category-Theoretic Characterizations of Entropy

As David said, H(X,Y)−H(X)H(X, Y) - H(X) is called conditional entropy. The relative entropy is something else, and something very important in its own right: I explained it in part 6 of my series on information geometry. I was trying not to use much math there, but secretly here’s how I think about it.

In general, there’s no way to define the entropy of a single probability measure on measure space XX. Entropy is defined for a pair of probability measures μ,ν\mu, \nu such that μ\mu is absolutely continuous with respect to ν\nu:

where the fraction is a Radon-Nikodym derivative. This is the entropy of μ\murelative toν\nu.

In the special case where XX is a finite set, and only in that case, there exists a canonical probability measure ν\nu on XX, namely normalized counting measure. So in that case and only in that case can we talk about the entropy of a single probability measure μ\mu, by which we really mean its entropy relative to the canonical probability measure (plus a constant depending on the cardinality of XX).

The ‘relativity’ of relative entropy has an important philosophical significance, which I explained much more carefully over on part 6. Namely: there is no good absolute notion of the information contained in a piece of data! It only makes sense to ask how much information some data contains relative to the prior beliefs of the person who obtains that data!

Since we’ve mainly been talking about probability measures on finite sets in this discussion, we are not yet really grappling with the ‘relativity’ of entropy. That’s okay for a while, but someday someone will go further and think about relative entropy as a functor. And maybe it’ll be us. After all, Rényi’s own characterization of Rényi entropy is very clearly a characterization of relative Rényi entropy. He doesn’t provide any really interesting characterization of ‘absolute’ Rényi entropy.

Re: Category-Theoretic Characterizations of Entropy

I’ve been meaning to take the time to try to absorb what you guys are doing here, but I’ve been busy at a conference and you’re making it hard to catch up. But now you’ve raised a couple details I’ve thought about lately (and discussed a little with Tom in email), so I have a spot to grab hold.

The first (trivial) point is that of course S(μ,ν)S(\mu,\nu) makes perfect sense even if ν\nu isn’t a probability measure. As you point out, if XX is a finite set and you normalize ν\nu you get an additive fudge factor, but letting ν\nu be non-normalized counting measure you get Shannon entropy exactly. Similarly you get differential entropy on ℝ\mathbb{R}, or even ℝn\mathbb{R}^n if ν\nu is Lebesgue measure, which of course can’t be normalized. So if you want to think about the right version of your story for relative entropy, it seems unnatural to me to restrict attention to the case when ν\nu is a probability measure.

The more interesting point is specifically to do with this sentence:

In the special case where XX is a finite set, and only in that case, there exists a canonical probability measure ν\nu on XX, namely normalized counting measure.

Namely, in what relevant precise sense is counting measure–or normalized counting measure—canonical? (I can think of many canonical descriptions of normalized counting measure; my point is that it’s not clear to me what bearing any of them has on this circle of ideas.)

I guess to summarize, I’m trying to understand what mathematical justifications there are for not doing everything on a more-or-less arbitrary measure space (X,ν)(X,\nu) in the first place.

Re: Category-Theoretic Characterizations of Entropy

So if you want to think about the right version of your story for relative entropy, it seems unnatural to me to restrict attention to the case when ν\nu is a probability measure.

[…]

I guess to summarize, I’m trying to understand what mathematical justifications there are for not doing everything on a more-or-less arbitrary measure space (X,ν)(X,\nu) in the first place.

I guess I’m approaching this whole project not as a topic in pure mathematics but from the viewpoint of probability theory—and its twin offspring: statistics, and statistical mechanics.

From a mathematical viewpoint, S(μ,ν)S(\mu,\nu) certainly makes sense whenever μ\mu and ν\nu are nonnegative measures, μ\mu is absolutely continuous to ν\nu, and the integral converges. And it’s even a concept worth thinking about!

But I’m mostly thinking about probability theory, which offers a nice conceptual interpretation of what’s going on. A probability measure describes your beliefs about some situation, and relative entropy S(μ,ν)S(\mu,\nu) is the ‘information gain’ as you update your prior belief ν\nu to a new one, μ\mu.
This is the reason I’m interested in this stuff.

In this context it’s hard to know what it means to, say, multiply a probability measure by 2. Do you believe everything is twice as likely to happen as before?

However, even in probability theory people sometimes consider ‘improper priors’, meaning nonnegative measures with infinite total mass, like Lebesgue measure on the real line. If someone claims a particle is ‘randomly located on the real line’, we might try to describe that using the measure dxd x, which can’t be normalized to a probability measure. This causes certain problems (divergent integrals, and difficulty in deciding on a correct normalization), but it’s sometimes a course worth pursuing.

By ‘canonical’ I try to always mean ‘functorial with respect to isomorphisms’. I could explain what that means, but maybe you’d be happier if I say what it amounts to in this special case. It amounts to saying that normalized counting measure on a finite set SS is invariant under every bijection f:S→Sf: S \to S.

On the other hand, for a measurable space XX with infinitely many points, there’s typically1 no probability measure on XX that’s invariant under all measurable maps f:X→Xf : X \to X with measurable inverses.

This gives the theory of probability measures on finite sets a distinctive feel. The canonical nature of entropy in this special case is an example of what I mean; when we leave this realm we really need relative entropy.

1 “Typically”? Well, as I try to prove this claim I seem to need some condition that ensures there are lots of measurable subsets of XX. All the counterexamples seem absurdly pathological. The claim is true if, for example, XX is a Polish space.

Re: Category-Theoretic Characterizations of Entropy

Mark wrote:

I’ve been meaning to take the time to try to absorb what you guys are doing here, but I’ve been busy at a conference and you’re making it hard to catch up.

We certainly are. Sorry. Personally, I’m not even on top of my own contributions, let alone anyone else’s. I guess we’ll stop generating new ideas after a while, and move into a more organizational phase.

One thing on my mental to-do list is to try out the idea you explained to me by email: that the key ingredient when defining entropy is an operation for turning (suitable) measures into functions. This operation is barely visible when we’re only looking at finite sets. Nevertheless, if the idea’s right then it should provide a guide for generalizing everything we do from finite sets to more general spaces.

Re: Category-Theoretic Characterizations of Entropy

Tom wrote:

One thing on my mental to-do list is to try out the idea you explained to me by email: that the key ingredient when defining entropy is an operation for turning (suitable) measures into functions. This operation is barely visible when we’re only looking at finite sets.

This remark, and what you’ve written about pairing measures with functions, is very close to things Tobias and I have been saying. So I think we’re all on similar wavelengths. But it might be worthwhile laying out the slightly different versions;

Tobias is studying relative Rényi as a functor out of the coslice category 2/FinStoch2 / FinStoch. An object in here is a finite set with two probability distributions on it, say ν\nu and μ\mu.

(We’d better make sure that ν≪μ\nu \ll \mu if we’re going to define any sort of relative entropy H(ν,μ)H(\nu, \mu). So as often the case, it seems that nowhere vanishing probability distributions on finite sets are more convenient than arbitrary probability distributions on finite sets.)

I keep running around emphasizing that for finite sets there’s a canonical probability measure, which allows us to turn nonnegative functions into probability measures, and lets us define the entropy of one other probability measure. And I keep warning everyone that this fact is somewhat deceptive, because it goes away when we get to infinite measurable spaces. And I keep arguing that this is a mathematical hint that fundamentally, relative entropy is what really matters. The real reason I believe this is that I’m a Bayesian, so I believe the ‘prior’ μ\mu must never be taken for granted.

So: I’m very happy to study the category of finite sets equipped with nowhere vanishing probability distributions, because it’s so beautiful. But when we go to infinite sets, we really want to consider a different category, to get pairs of measures (or pairs consisting of a measure and a function) into the game. And it may be good to warm up already when working with finite sets.

(I’m perfectly happy to write a paper where we do a lot of things nicely for finite sets, and then let other people generalize the ideas to other measurable spaces. I’m trying to develop a surgical strike approach to math where I do something quickly and then move on. So, I don’t mind collaborating on just one paper even if the really good results show up later in some series of papers.)

Re: Category-Theoretic Characterizations of Entropy

John wrote:

By ‘canonical’ I try to always mean ‘functorial with respect to isomorphisms’. I could explain what that means, but maybe you’d be happier if I say what it amounts to in this special case. It amounts to saying that normalized counting measure on a finite set SS is invariant under every bijection f:S→Sf:S \to S.

I’m happy enough with ‘functorial with respect to isomorphisms’ if it also comes with a hint as to which categories to think about. (Fortunately, more than hints have been given in this discussion.) And on reflection, I’m much happier with that than with your spelling out the special case.

The reason for that is that once you point out that normalized counting measure on a finite set is the unique probability measure invariant under all bijections, I think “Yes, but up to a scalar multiple, non-normalized counting measure is the unique measure invariant under all bijections, even on countable sets. And non-normalized counting measure also has the property, which normalized counting measure lacks, of playing nicely with disjoint unions.” (With or without normalization, it plays nicely with Cartesian products.)

On the other hand, I can’t think of any category structure to put on countable measure spaces that makes the first statement say anything about functoriality, or for that matter makes the second statement say anything about sums, or even makes the product of measure spaces into a product in the category. (An even stronger statement is true: I don’t know any very nice category of measure—as opposed to probability—spaces at all. This is another point I’ve discussed briefly with Tom.) So telling me you want functoriality with respect to isomorphisms makes a much more compelling case than telling me you want invariance under bijections.

Re: Category-Theoretic Characterizations of Entropy

While we’re on the topic, I’ll share something I wrote to Mark a month or so ago. It’s a quick-and-dirty characterization of counting measure and its scalar multiples. Note that normalized counting measure—I mean, counting measure normalized to be a probability measure—is not a scalar multiple of counting measure, as the scale factor you’d need depends on the cardinality of the set.

Let MM be the category whose objects are finite measure spaces (with all subsets measurable) and whose maps are the injections i:X→Yi: X \to Y such that ii induces a measure-isomorphism between XX and iXiX.

Let II be the category finite sets and injections.

There is a forgetful functor M→IM \to I. Little theorem: its sections are precisely the functors

Ft:I→M(t∈[0,∞))
F_t: I \to M \quad (t \in [0, \infty))

where Ft(X)F_t(X) equips a set XX with counting measure scaled by a factor of tt. (Proof: think about the one-element set.)

Re: Category-Theoretic Characterizations of Entropy

I guess I’m misusing the word ‘canonical’ as defined on the nLab: there it’s used to mean ‘natural with respect to isomorphisms’. But there should also be some name for ‘functorial with respect to isomorphisms’, because there are lots of things like that. So I guess for now I’ll use ‘canonical’.

For example, every countable set has a ‘canonical’ measure on it, namely counting measure. It’s ‘canonical’ in some hand-wavy sense of not requiring any arbitrary choices for its definition. This construction doesn’t extend to a functor F:C→DF: C \to D from the category of countable sets and functions to the category of measure spaces and measure-preserving maps. But it does extend to a functor F:C0→D0F: C_0 \to D_0 where the subscript indicates the ‘core’ of a category: that is, the subcategory with the same objects but only the isomorphisms as morphisms. So, the construction is ‘canonical’ in my sense of ‘being functorial under isomorphisms’. Note that any multiple of counting measure is also canonical.

Here’s a similar example. Let RR be the category of Riemannian manifolds and isometries. There’s a standard way to make any Riemannian into a measure space by giving it a kind of ‘Lebesgue measure’. (For oriented Riemannian manifolds people often talk about the ‘volume form’, but we don’t need an orientation to get a measure.) This construction doesn’t extend to a functor F:R→DF: R \to D from RR to the category of measure spaces and measure-preserving maps. But it does extend to a functor F:R0→D0F: R_0 \to D_0.

By the way, I feel like apologizing for my slightly grumpy reaction to your suggestion that we extend a lot of the ideas being discussed here from probability measure spaces to measure spaces. At the time I couldn’t think of a really good reason for doing that, except that generalizing is always nice. But you’ll see that a key idea in my new post is to take any probability measure on a finite set and get from it (canonically!) a 1-parameter family of measures on that set. These can be normalized to give probability measures, and physicist always do that before computing their entropy… but I need to not normalize them to get certain things to work. I’m not looking at the entropy of non-probability measures (yet), but such measures are definitely sneaking into the story.

Re: Category-Theoretic Characterizations of Entropy

I guess I’m misusing the word ‘canonical’ as defined on the nLab: there it’s used to mean ‘natural with respect to isomorphisms’.

Another good reason not to use “canonical” in that sense. (-: If we said “core-natural” instead for “natural with respect to isomorphisms”, then we could analogously say “core-functorial” for “functorial with respect to isomorphisms”.

Re: Category-Theoretic Characterizations of Entropy

It would be nice if there were terms with a bit more of a warm fuzzy glow than ‘core-natural’ and ‘core-functorial’. I might suggest ‘canonical’ and ‘god-given’, but after the debacle concerning ‘evil’ I think I won’t keep pushing my religion on people.

Seriously: there’s actually an advantage of using terminology that’s clunky and unappealing here, which is that it’s less likely to be used in the vague hand-wavy way that ‘canonical’ and ‘natural’ already are. So, it’s more likely that if someone uses it, they really know what they mean.

Re: Category-Theoretic Characterizations of Entropy

Your two paragraphs seem to me to contradict each other. Are “core-natural” and “core-functorial” insufficiently clunky and unappealing? Or do you want something that is both clunky and unappealing and has a warm fuzzy glow? (-:

Re: Category-Theoretic Characterizations of Entropy

Mike wrote:

Your two paragraphs seem to me to contradict each other.

Yes, they do. On the one hand, there are advantages to the warm fuzzy glow of a term like ‘canonical’, because constructions with this property really are good, and it’s nice to have the term indicate that. On the other hand, terms with warm fuzzy glows tend to get used vaguely, so there’s an advantage to a horrendously clunky and unappealing term like ‘core-functorial’: nobody would ever say that unless they know what it means and mean exactly that.

(By the way, I do know that ‘warm fuzzy glow’ is a mixed metaphor. I suppose you could stick a lightbulb in a wool sock…)

Re: Category-Theoretic Characterizations of Entropy

Just to be clear, I never meant to suggest considering the entropy of an arbitrary measure, which I also couldn’t see a really good reason to do. I was just suggesting considering the relative entropy of a probability measure with respect to an arbitrary measure, which is a thoroughly classical thing.

Re: Category-Theoretic Characterizations of Entropy

Yes, thanks for reminding us about that stuff! Somehow I forgot to take this philosophy into account during the present project.

I have been thinking about how to regard relative entropy (i.e. the Kullback-Leibler divergence) and Rényi’s qq-deformed versions of it as functors. It seems to fit into the picture rather nicely. Arguably more nicely than with the bare entropies themselves!

Let’s consider the category FinStoch\Fin\Stoch this time. Objects are natural numbers and morphisms are stochastic matrices of the appropriate sizes. We can also call these Markov morphisms. So, in contrast to before, we can now allow probabilistic maps!

The protagonist enters: the coslice category 2/FinStoch2/\Fin\Stoch. An object of this category is just a finite set together with two probability distributions on it. A morphism of this category is a Markov morphisms which preserves these distributions. As a coslice category, 2/FinStoch2/\Fin\Stoch is a discrete opfibration over FinStoch\Fin\Stoch.

I conjecture that relative Rényi entropy Iq(P|Q)I_q(P|Q) is a functorIq:2/FinStoch→ℝ¯+I_q:2/\Fin\Stoch \rightarrow \overline{\mathbb{R}}_+where again ℝ¯+=ℝ+∪{∞}\overline{\mathbb{R}}_+=\mathbb{R}_+\cup\{\infty\} is regarded as a poset.

This claim boils down to asserting that IqI_q is non-increasing under Markov morphisms. I gather that this is known, but haven’t yet been able to find a reference for it. (A special case is contained in Rényi’s paper.) The following proof attempt is not yet complete:

(d) splitting one outcome with probability pp into two new outcomes which have probability λp\lambda p and (1−λ)p(1-\lambda)p, respectively.

Hence, it is enough to prove the assertion for each case separately.

(2) It is clear that IqI_q does not change under generators of type (a) or (b). For (d), this follows from a simple calculation.

(3) For (c), the corresponding inequality is less trivial to prove. I have verified it in the cases q=1/2q=1/2, 11 and 22 and found that if it holds for some qq, then it also holds for q′=1−qq'=1-q. So it remains to prove it in generality. Also, I have disregarded degenerate cases like zero probabilities.

In fact, this monotonicity property seems very reminiscent of David’s remark on the Fisher-Rao metric. What is the precise relation between the Fisher-Rao metric and the Kullback-Leibler divergence? Is the former some sort of infinitesimal version of the latter?

(This might be explained in John’s information geometry posts, but I have to admit to not having followed them all.)

Re: Category-Theoretic Characterizations of Entropy

What is the precise relation between the Fisher-Rao metric and the Kullback-Leibler divergence? Is the former some sort of infinitesimal version of the latter?

Yes, but this is true of all members of a 1 parameter family of divergences, linked to a corresponding family of affine connections. Information geometry studies all this:

The mathematician Cencov (Chentsov) proved in the 1960s and 1970s that on the space of probability distributions on a sample space containing at least three points,

There exists a unique intrinsic metric. It is the Fisher information metric.

There exists a unique one parameter family of affine connections. It is the family of α\alpha-affine connections later popularized by Amari.

Both of these uniqueness are, of course, up to the multiplication by a constant.

Amari and Nagaoka’s study in the 1980s brought all these results together, with the introduction of the concept of dual-affine connections, and the interplay among metric, affine connection and divergence. In particular,

Given a Riemannian metric gg and a family of dual affine connections Γα\Gamma_{\alpha}, there exists a unique set of dual divergences DαD_{\alpha} defined by them.

Given the family of dual divergences DαD_{\alpha}, the metric and affine connections can be uniquely determined by second order and third order differentiations.

Something I always meant to work out was which, if either, of the Renyi or Tsallis divergences forms Amari’s family. Favouring looking up over hard work, equation (3.78) in Amari’s book settles it in Tsallis’s favour (reparameterising the α\alpha and taking the pp and qq as normalised).

I’m a total novice at this, but the conclusion I tentatively draw is that relative entropy is extremely closely related to the entropy of a composite, for finite sets at least. Would you agree?

The minus sign on the left-hand side of the first equation is a bit puzzling. Maybe it’s a hint that −Hα(P‖Q)-H_\alpha(P \Vert Q) is a more basic quantity than Hα(P‖Q)H_\alpha(P \Vert Q), even though it’s negative. Anyway, let’s try the same thing for diversity, a.k.a. Rényi extropy. This is related to Tsallis entropy by

Re: Category-Theoretic Characterizations of Entropy

Divergence from PP to QQ = cross entropy of PP and QQ−- entropy of PP.

Cross entropy is minimized when P=QP = Q. If the true distribution is PP, encoding outcomes with a code based on PP is most efficient, or one is least surprised with a world distributed according to PP, if one believes it to be distributed according to PP.

Re: Category-Theoretic Characterizations of Entropy

What is the precise relation between the Fisher-Rao metric and the Kullback-Leibler divergence? Is the former some sort of infinitesimal version of the latter?

As you feared, precisely this question was the topic of Part 7 of my information geometry posts.

First of all, you should say ‘relative entropy’, not ‘Pullback-Liebling divergence’ or something like that. Why? Because the latter term is quite uninformative and esoteric-sounding, while the former makes it absolutely clear what’s going on: entropy only makes sense relative to some prior assumptions about what’s likely to happen.

Even someone who knows nothing about the subject will find the phrase “relative entropy” vaguely interesting (relativity is cool, so is entropy)… but not “Kullback-Leibler divergence”. It’s too important a concept to be burdened with a name like that.

Okay, now that that’s straightened out:

Briefly, the Fisher-Rao metric is the Hessian of the relative entropy.

More precisely, if MM is the manifold consisting of all probability distributions on some measure space, and S(x,y)S(x,y) is the entropy of x∈Mx \in M relative to y∈My \in M, then

Here p(q)\mathbf{p}^{(q)} is the escort distribution (though for this argument it doesn’t matter what that is), and M1−q(w,x)M_{1 - q}(\mathbf{w}, \mathbf{x}) denotes the power mean of order 1−q1 - q of the numbers x1,…,xnx_1, \ldots, x_n, weighted by w1,…,wnw_1, \ldots, w_n. But diversities are always ≥1\geq 1, and a mean of numbers ≥1\geq 1 is ≥1\geq 1. So this is greater than or equal to Dq(p)⋅1D_q(\mathbf{p}) \cdot 1, as required.

Re: Category-Theoretic Characterizations of Entropy

Tobias had a technical question here. I think he’s trying to show that the Rényi entropy IqI_q obeys the inequality [..]

Thanks for getting back to this. I’m glad to see that someone comments on my attempts ;) Apparently I haven’t explained myself well, since my incomplete proof is about something else…

I was talking about the relative Rényi entropy Iq(P|Q)I_q(P|Q) which takes two distributions as arguments, and I would like to show the following inequality: from P=(p1,…,pn)P=(p_1,\ldots,p_n) we construct

P′=(p1+p2,p3,…,pn)
P'=(p_1+p_2,p_3,\ldots,p_n)

and use the same formula to define Q′Q' in terms of QQ. I wonder whether the relative Rényi entropy satisfies

Iq(P′|Q′)≤Iq(P|Q)?
I_q(P'|Q') \leq I_q(P|Q) \quad ?

I can prove this for particular values of qq, but so far not in generality. If it’s true, and if one can take care of zero probabilities and this kind of things, then that would turn relative Rényi entropy into a nice functor

Re: Category-Theoretic Characterizations of Entropy

Yes, I have also noticed that it’s often hard keep track of things. So probably I should provide more detail and background information in my comments…

Anyway, yes, I am precisely talking about this definition of relative entropy.

Then I might as well write down the inequality that I am trying to prove. I will now also write rir_i for the entries of the second distribution in order to avoid a clash of notation with ‘qq’ as the parameter of relative Rényi entropy.

Re: Category-Theoretic Characterizations of Entropy

I’ll use the power mean notation that I mentioned here.
Probably the most important theorem about power means is that MtM_t is (non-strictly) increasing in t∈[−∞,∞]t \in [-\infty, \infty]. (A special case of this that most people know is M0≤M1M_0 \leq M_1, the arithmetic-geometric mean inequality.)

If q≥1q \geq 1 then raising to the power of q−1q - 1 is an increasing function and −1≤q−1-1 \leq q - 1, so the inequality holds. If q≤0q \leq 0 then raising to the power of q−1q - 1 is a decreasing function and −1≥q−1-1 \geq q - 1, so again the inequality holds. If 0≤q≤10 \leq q \leq 1 then raising to the power of q−1q - 1 is a decreasing function and −1≤q−1-1 \leq q - 1, so the opposite inequality holds.

(I’m kind of ignoring the cases q=1,±∞q = 1, \pm \infty, but by continuity that all works out.)

Re: Category-Theoretic Characterizations of Entropy

Cool! Since the q−1q-1 factor in the definition of relative Rényi entropy reverses the inequalities again when q<1q\lt 1, we get that the conjectured inequality holds for q≥0q\geq 0.
I will write this down tomorrow as a new section in our notes.

but I was rather hoping that you’d made a slip in the reduction, multiplying through by (q−1)(q - 1) on each side and forgetting to reverse the inequality when q<1q \lt 1. If I understand correctly, you did make some such slip, so that by my argument,

Iq(P′|Q′)≤Iq(P|Q)
I_q(P' | Q') \leq I_q(P | Q)

for all q∈[0,∞]q \in [0, \infty]. And presumably the opposite inequality holds for all q∈[−∞,0]q \in [-\infty, 0]. Have I got that right?

Re: Category-Theoretic Characterizations of Entropy

In case you want to avoid operads, here’s an operad-less version of the operadic characterization of Shannon entropy.

The price you pay is having to work with monoidal 2-categories. I’ll skip some details. First I’ll state the result as quickly as possible. Then I’ll explain it a bit, including what it has to do with the operadic version.

Notation: I will regard a probability distribution on a finite set EE as a family p=(pe)e∈Ep = (p_e)_{e \in E} of nonnegative real numbers summing to 1.

Let MΔM_\Delta be the symmetric monoidal category in which:

an object is a finite set

a map A→BA \to B is a map f:A→Bf: A \to B of finite sets together with a probability distribution on each fibre f−1(b)f^{-1}(b) (b∈Bb \in B). I will write such a map as a pair (f,p)(f, p), where p(b)=(p(b)a)a∈f−1(b)p(b) = (p(b)_a)_{a \in f^{-1}(b)} is the probability distribution on f−1(b)f^{-1}(b).

you should be able to guess what composition is

tensor product is disjoint union.

Symmetric monoidal functors MΔ→CatM_\Delta \to Cat are the same thing as algebras in Cat\Cat for the operad Δ=(Δn)n∈ℕ\Delta = (\Delta_n)_{n \in \mathbb{N}}; but I’m not allowed to mention operads, so I’ll just describe the two such symmetric monoidal functors that I need. (By “monoidal” I mean weak/pseudo/strong monoidal.)

The first one is constant at the terminal category 11.

The second one, R:MΔ→CatR: M_\Delta \to Cat, is defined as follows.

On objects, R(A)R(A) is the monoid (ℝA,+,0)(\mathbb{R}^A, +, 0) as a one-object category.

Now, CatCat is actually a monoidal 2-category, and we can regard MΔM_\Delta as a monoidal 2-category with only identity 2-morphisms. We may therefore speak of monoidal lax transformations between monoidal functors MΔ→CatM_\Delta \to Cat. Note the order of the words “monoidal” and “lax”: the monoidal structure is preserved up to iso, but the 2-category structure is only preserved laxly. (I’m dropping the word “symmetric” just to make things less cluttered, but really it’s there.)

In particular, we can consider monoidal lax transformations from the constant functor 11 to RR.

Note: Not quite. Fadeev’s theorem has a continuity condition, so I think I need to be working with (2-)categories internal to, or enriched in, TopTop. I haven’t yet figured out the correct thing to say, but I’m confident it can be done.

Now I’ll explain how this relates to the operadic characterization, which will also indicate why it’s true.

For any algebraic theory TT, there’s a free finite product category LTL_T containing a model of TT, namely, the Lawvere theory of TT. Its objects are the finite sets (or natural numbers), and a map A→BA \to B consists of an operation of arity |A||A| for each b∈Bb \in B.

Similarly, for any symmetric operad PP, there’s a free symmetric monoidal category MPM_P containing an algebra for PP. Its objects are the finite sets (or natural numbers), and a map A→BA \to B consists of a map f:A→Bf: A \to B of finite sets together with an operation of arity |f−1(b)||f^{-1}(b)| for each b∈Bb \in B. I’d like to call this the symmetric monoidal theory of PP.

In this case we start with the symmetric operad Δ\Delta whose nn-ary component Δn\Delta_n is the (n−1)(n - 1)-simplex. The resulting category MΔM_\Delta is the one described above. Its universal property is that for any symmetric monoidal category EE, the symmetric monoidal functors MΔ→EM_\Delta \to E are the Δ\Delta-algebras in EE.

We take E=CatE = Cat. The two algebras we’re interested in are the terminal one, and the monoid (ℝ,+,0)(\mathbb{R}, +, 0) with the obvious action by Δ\Delta. (It’s a convex space.) The monoidal functor MΔ→CatM_\Delta \to Cat corresponding to the algebra ℝ\mathbb{R} is the RR described above.

Lax maps between Δ\Delta-algebras in CatCat then correspond to monoidal lax transformations between monoidal functors MΔ→CatM_\Delta \to Cat. So the theorem just stated is equivalent to my original statement: the lax maps 1→(ℝ,+,0)1 \to (\mathbb{R}, +, 0) are the scalar multiples of Shannon entropy.

One more thing. We’ve encountered at least four categories so far:

convex algebras/spaces

finite sets and stochastic maps

finite probability spaces

MΔM_\Delta

I claim that, almost, all four come about in an automatic way from the Giry monad. The first is its category of algebras. The second is its Lawvere theory. The fourth is constructed in two steps: first you take the underlying symmetric operad of the Giry monad, and then you take the symmetric monoidal theory of that operad.

But what about the third, finite probability spaces? Well, it’s almostMΔ/1M_\Delta/1, the slice of MΔM_\Delta over its generating object 11. The “almost” has to do with zero probabilities. I’ll explain the near-equivalence FinProb≈MΔ/1FinProb \approx M_\Delta/1 by pretending that zero probabilities don’t exist.

So: an object of MΔ/1M_\Delta/1 is a finite set AA with a map (f,p):A→1(f, p): A \to 1. Here ff is the unique map of sets A→1A \to 1, and pp provides a probability distribution on each fibre. In other words, an object of MΔ/1M_\Delta/1 is a finite probability space (A,p)(A, p).

A map (A,p)→(B,r)(A, p) \to (B, r) in MΔ/1M_\Delta/1 is a pair (h,u)(h, u) where h:A→Bh: A \to B is a map of sets and uu provides a probability distribution on each fibre h−1(b)h^{-1}(b). And for it to be a map in the slice category, a certain triangle has to commute, and that turns out to be equivalent to the equation
pa=rh(a)u(h(a))a
p_a = r_{h(a)} u(h(a))_a
for all a∈Aa \in A. Dividing gives u(h(a))au(h(a))_a in terms of pp and rr. With a bit more calculation, you see that uu is uniquely determined by pp and rr, in such a way that the pair (h,u)(h, u) amounts to just a measure-preserving map hh. Thus, it’s the same as a map in FinProbFinProb — as long as you ignore those zero probabilities.

I see that there have been lots of comments about slice categories so far, and I haven’t read them all, so maybe someone’s already reached this conclusion.

It might be possible to take advantage of the forgetful functor
FinProb≈MΔ/1→MΔ
FinProb \approx M_\Delta/1 \to M_\Delta
in order to characterize entropy in terms of transformations between functors FinProb→CatFinProb \to Cat. A snag is that FinProbFinProb and MΔ/1M_\Delta/1 aren’t monoidal categories. But perhaps there’s a good way forward.

Re: Category-Theoretic Characterizations of Entropy

This looks like great stuff, Tom! It will take me a while to digest it. I just thought I should respond to this:

In case you want to avoid operads…

I’m not trying to avoid operads, though my thrashing around may look like that. What I’m trying to do is take your characterizations of Shannon entropy and re-express them in a way that looks something like this:

Theorem: The only maps

F:A→X F: A \to X

with certain properties are the multiples of Shannon entropy.

Here AA should be some gadget related to probability measures or stochastic maps or some other nice thing that could have an entropy. XX should be a gadget of the same mathematical type, but related to numbers: namely, things that are entropies.

I don’t really care what type of gadget AA and XX are: categories, monoidal 2-categories, algebras for some operad or monad or 2-monad, possibly internal to Top, etc. FF should be a map between gadgets of this type, possibly lax.

I want a theorem like this because I strongly feel that entropy, like cardinality, is a map that takes an object and distills it into a number saying how ‘big’ it is (how much information it has). Certain things we can do with these objects match up to certain things we can do with these numbers, and that’s why entropy is useful.

I think your ‘lax point’ characterization of Shannon entropy is secretly doing what I want here… but only secretly. Your lax points look like maps

G:1→X G : 1 \to X

where 11 is something very dull and XX looks like ‘numbers’. Somehow the probability measures are hidden in the laxness and the fact that 11 and XX are being treated as PP-algebras.

The idea that a lax map of PP-algebras F:1→XF : 1 \to X looks like a PP-algebra internal to XX is not the part I’m having trouble with. Jim and I used this idea to try to formalize the ‘microcosm principle’ in Definition 57 of HDA2. We said something roughly like this:

Definition. Let OO be an operad, let AA be an (n+1)(n+1)-categorical OO-algebra, and let 11 be the terminal (n+1)(n+1)-categorical OO-algebra. Then we define an nn-categorical OO-algebra
object in AA to be a morphism of OO-opetopic sets a:τ→Aa: \tau \to A. If n=0n = 0, we call this simply an OO-algebra object in AA.

But somehow I’m being very slow about taking this lax map formulation and translating it into one that I think people will find more intuitive. No matter how fancy the machinery, I think people will appreciate clearly seeing a map sending things that have entropy to their entropies.

Re: Category-Theoretic Characterizations of Entropy

I too am struggling to convert Tom’s account into what I know. f:A→Bf: A \to B in MΔM_{\Delta} is a stochastic function (aka conditional probability distribution) in reverse. So I’m happier with (MΔ)op(M_{\Delta})^{op}, but that shouldn’t be too big an obstacle. I think I just need to see an answer to the following:

What needs to be specified to give the monoidal lax transformation, α\alpha, corresponding to Shannon entropy?

Presumably, for each finite set AA, it will assign a functor, α(A)\alpha(A), from the terminal category to the category R(A)R(A). Each only has a single object. Doesn’t the arrow in the terminal category have to be sent to the constant zero function from AA to ℝ\mathbb{R}? What else?

Re: Category-Theoretic Characterizations of Entropy

David wrote:

What needs to be specified to give the monoidal lax transformation, α\alpha,
corresponding to Shannon entropy?

I’ll let Tom answer that… I understand his operadic approach better than his new approach. But I won’t explain the operadic approach, either: I’ll just say what crucial property of Shannon entropy seems to be captured by the ‘laxness’ in that approach. This property has the so-called magical property in Fadeev’s theorem as a special case, so it indeed seems crucial if we’re trying to uniquely characterize Shannon entropy in some nice way.

Suppose PP is a probability distribution on the set {1,…,n}\{1, \dots, n\} and PiP_i are probability distributions on some finite sets SiS_i, where 1≤i≤n1 \le i \le n. Then there’s a way to form a probability distribution on the set S1+⋯+SnS_1 + \cdots + S_n. I’ll call this probability distribution P∘(P1,…,Pn)P \circ (P_1, \dots, P_n). Here’s how it works.

Say you want to pick a random element of S1+⋯+SnS_1 + \cdots + S_n. First you choose a number from 11 to nn using the probability distribution PP. Then, if you picked the number ii, you choose an element of SiS_i using the probability distribution PiP_i.

The only mysterious thing here should be the expression P(S(P1),…,S(Pn))P(S(P_1), \dots , S(P_n)). Here I’m using the fact that a probability measure PP on the set {1,…,n}\{1,\dots, n\} lets us form a weighted average of nn numbers. I’m using P(S(P1),…S(Pn))P(S(P_1), \dots S(P_n)) to mean the weighted average of the numbers S(Pi)S(P_i).

Tom saw that this crucial property would fall out naturally from having a certain lax map of operad algebras!

On the one hand we can use PP to glom together nn probability measure spaces. On the other, we can use it to glom together nn numbers. But we don’t have

Re: Category-Theoretic Characterizations of Entropy

Right. But you’ll notice in this explanation that you only apply entropy SS to distributions, arrows into 11, or out of 11 if you prefer the opposite category. That’s why I wanted to see what that α(A)\alpha(A) does, to know how it fits with Tom’s induced functor ℝA→ℝB\mathbb{R}^A \to \mathbb{R}^B, as I rather suspect that he isn’t according an entropy to an arrow f:A→Bf: A \to B, for non-singleton BB.

Re: Category-Theoretic Characterizations of Entropy

f:A→Bf: A \to B in MΔM_\Delta is a stochastic function (aka conditional probability distribution) in reverse.

Really? You’re saying that a map from AA to BB in MΔM_\Delta is a stochastic map from BB to AA, i.e. a function from BB to the set P(A)P(A) of probability distributions on AA? I don’t think it is.

The definition is that a map A→BA \to B in MΔM_\Delta is a pair (f,p)(f, p) where f:A→Bf: A \to B is a map of sets and p=(p(b))b∈Bp = (p(b))_{b \in B} is a family consisting of a probability distribution p(b)p(b) on f−1(b)f^{-1}(b) for each b∈Bb \in B.

Example: suppose that AA and BB each have cardinality 2. Then there are two maps A→BA \to B in MΔM_\Delta (both isomorphisms), but there are infinitely many stochastic maps from BB to AA.

What needs to be specified to give the monoidal lax transformation, α\alpha, corresponding to Shannon entropy?

Right. We’ve got two functors 1,R:MΔ→Cat1, R: M_\Delta \to \Cat. Recall that RR assigns to a finite set AA the additive monoid ℝA\mathbb{R}^A, seen as a one-object category.

We’re interested in monoidal lax transformations α:1→R\alpha: 1 \to R. The “monoidal” part is essentially a property rather than structure, so let’s forget about it for now and just think about lax transformations α:1→R\alpha: 1 \to R.

A strict transformation α:1→R\alpha: 1 \to R consists of

for each finite set A∈MΔA \in M_\Delta, a functor αA:1→R(A)\alpha_A: 1 \to R(A)

But as you noticed, all of this is trivial: since R(A)R(A) has only one object, the functors αA\alpha_A are uniquely determined, and the commutativity of the square is automatic.

A lax transformation is the same, except that that square of functors isn’t required to commute: there is instead a natural transformation inside it. Thus, for each map (f,p):A→B(f, p): A \to B in MΔM_\Delta, we have a natural transformation
1→αAR(A)↓⇓α(f,p)↓R(f,p)1→αBR(B).
\begin{matrix}
1 &\stackrel{\alpha_A}{\to} &R(A)&\\
\downarrow&\Downarrow \alpha_{(f, p)} &\downarrow&R(f, p) \\
1 &\stackrel{\alpha_B}{\to} &R(B).&
\end{matrix}
A natural transformation of this type is simply an element of the monoid R(B)=ℝBR(B) = \mathbb{R}^B.
So, what needs to be specified is: for each map (f,p):A→B(f, p): A \to B in MΔM_\Delta, an element α(f,p)∈ℝB\alpha_{(f, p)} \in \mathbb{R}^B. In particular, taking BB to be a one-element set, this means specifying a real number for each probability distribution on a finite set.

Of course, the natural transformations α(f,p)\alpha_{(f, p)} are subject to some coherence axioms, and we also have to remember that α\alpha was supposed to be monoidal. The upshot is that α(f,p)∈ℝB\alpha_{(f, p)} \in \mathbb{R}^B must have bb-component
C⋅H(p(b))
C \cdot H(p(b))
where HH is Shannon entropy and CC is a real constant independent of everything.

Re: Category-Theoretic Characterizations of Entropy

Really?

Oh, you’re right. You could say they’re special stochastic functions, where in the row stochastic matrix every column has only one nonzero entry (or interchange ‘row’ and ‘column’ looking at things your way). Grouping them according to fibres, the stochastic matrix would be a series of distributions, stepping down, indexed by b∈Bb \in B, where your AA is the disjoint union of SiS_i mentioned here.

I hadn’t really got the idea of lax points before, so thanks. Do they crop up interestingly in a range of places?

So, what needs to be specified is: for each map (f,p):A→B(f, p):A \to B in MΔM_{\Delta}, an element α:(f,p)∈ℝB\alpha: (f,p)\in \mathbb{R}^B.

Great, so one doesn’t even have an entropy per se for one of your special stochastic functions. It awaits a distribution on BB to be able to give a number, the conditional entropy.

Re: Category-Theoretic Characterizations of Entropy

the free finite product category containing an algebra for the Giry monad

with

the free symmetric monoidal category containing an algebra for the Giry operad.

(By the “Giry monad” I mean, probably improperly, the monad on SetSet whose algebras are convex spaces. By the “Giry operad” I mean its underlying symmetric operad.)

The first category is the Lawvere theory of the Giry monad, which is the category FinStochFinStoch. The second is what I called MΔM_\Delta. By the universal property, there’s a canonical functor MΔ→FinStochM_\Delta \to FinStoch. This has nothing to do with the particular monad we’re studying: it’s a completely general construction.

This functor MΔ→FinStochM_\Delta \to FinStoch is bijective on objects (which in both cases are natural numbers). I guess it’s faithful, which would justify your description of maps in MΔM_\Delta as “special stochastic functions”.

Let PP be an operad of sets, and let YY be a PP-algebra in CatCat. A lax point, or internal PP-algebra, is a lax map 1→Y1 \to Y of PP-algebras in CatCat.

Example Let PP be the terminal operad (non-symmetric, say). Then PP-algebras in a monoidal category are monoids; in particular, PP-algebras in CatCat are (strict) monoidal categories. An internal PP-algebra in a monoidal category YY is a lax monoidal functor 1→Y1 \to Y, which is the same thing as a monoid in YY. Since PP-algebras are the same thing as monoids, another way to phrase this is “an internal monoid in YY is an internal monoid in YY”. So it’s good terminology, in this case at least.

Example Let MM be a monoid. Let PP be the operad whose unary operations form the monoid MM, and with no non-unary operations. Then PP-algebras in a monoidal category are objects equipped with a left MM-action. In particular, PP-algebras in CatCat are “MM-categories”. An internal PP-algebra in an MM-category YY is an object y∈Yy \in Y equipped with a map

m⋅y→y
m \cdot y \to y

for each m∈Mm \in M, satisfying the evident action-like axioms. It’s reasonable to call this an “internal MM-object in YY”, so again the terminology is sensible.

I’d add the caveat, though, that the category we’re concerned with here is a one-object category (the additive monoid ℝ\mathbb{R}). So if we want to say something like “Shannon entropy and its multiples are the only internal convex algebras in ℝ\mathbb{R}”, we need to do so with the awareness that there’s a level shift.

Re: Category-Theoretic Characterizations of Entropy

Thanks. I guess I’d like to know whether there’s something deep about Shannon entropy being an internal algebra, or whether it’s a fluke. Could there be something systematic about it? Hence my query below as to what might happen with a slight shift of monad to subprobabilities or all measures.

Is there also a Rényi-Tsallis monad, where points of ℝ\mathbb{R} might be one of the two entropies?

Re: Category-Theoretic Characterizations of Entropy

Nice! I think I understand the ideas.

One question about your category MΔM_\Delta: are the morphisms automatically surjective maps? For a probability distribution to exist on a fiber, that fiber in particular has to be nonempty, right? So each fiber has to be nonempty, meaning that the function is surjective.

On the issue of zero probabilities: maybe one can deal with this by identifying two measure-preserving maps whenever they coincide almost surely? This defines a quotient category of FinProb\Fin\Prob in which two probability spaces are isomorphic whenever there exists a measure-preserving bijection between their supports.

Re: Category-Theoretic Characterizations of Entropy

Thanks. Yes, the morphisms in MΔM_\Delta are automatically surjective, because there’s no probability distribution on the empty set.

I’m not worried about the fact that MΔ/1M_\Delta/1 and FinProbFinProb are only nearly isomorphic (because of zero probabilities). It might be that FinProbFinProb isn’t quite the right category to be considering. The way I view the present evidence is that MΔ/1M_\Delta/1 has more of a right to be considered categorically natural than FinProbFinProb does, because the former arises canonically from the finitary Giry monad and the latter, as far as I know, doesn’t.

Re: Category-Theoretic Characterizations of Entropy

A little while ago, John, I sent you an email preemptively apologizing that I won’t be able to put as much energy into this as I’d like to. It might look like that apology was misplaced, given that I’ve written quite a lot here, but actually what I’ve done has been pretty hurried. If I wasn’t for other things going on, I’d be taking a more reflective approach, trying to see the connections between the many ideas so far rather than just flinging out more and more of them. I’m hoping someone else will do the connecting up: maybe you, maybe Tobias, maybe someone else, maybe a future me.

Anyway, I agree with you that it would be nice to describe entropy as the unique map

(things that have entropy) →\to numbers

with certain properties. Of course Rényi and Fadeev have already done this. We’re trying to make entropy look categorically inevitable, though, and it may be that the best categorical description is not of this form.

Re lax points or internal algebras, I should probably have mentioned a while ago that this has very strong echos of the microcosm principle; I had realized this, but should have said so.

Re: Category-Theoretic Characterizations of Entropy

What kind of changes to this set-up might be expected to continue to yield something interesting? E.g., what are the lax points of ℝ\mathbb{R} for the subprobability monad? I see there’s a notion of entropy for subprobability measures

Re: Category-Theoretic Characterizations of Entropy

Thanks for the new stuff, Tobias! I’ll check it out! In my opinion, all three of us should feel free to systematize notation between the various parts of this ‘paper’, and otherwise improve it.

Speaking of notational matters: I’m a bit nervous about using Δn\Delta_n for the (n−1)(n-1)-simplex, since Δn\Delta^n is often used for the nn-simplex. But maybe that’s okay because they’re both fundamental in different ways.

I’m also slightly nervous about using Δ\Delta for the monad sending any set to the set of finitely supported probability measures on that set, since Δ\Delta is often used for the category of simplices (not with convex maps, but with maps sending vertices to vertices in an order-preserving way).
I’m a bit more nervous about calling this monad Δ\Delta the ‘Giry monad’, since people seem to use that phrase for the monad sending any Polish space XX to the Polish space of subprobability measures on XX. At least that’s what they do here and here.

So, for now, I’m tend to use P(X)P(X) for the space of probability measures on a space XX, and call it P(n)P(n) when X={1,…,n}X = \{1, \dots, n\}. This PP is a monad on Polish spaces… but I’ll be quite happy if we focus attention to probability measures on finite sets in this paper, in which case we could also use finitely supported probability measures if that’s technically more convenient.

Wow, what a boring and annoying comment! And worst of all, it seems like I’m picking on Tom throughout. Anyway, I thought I’d mention these small things. I had some other more interesting things to say, but I’ll do that in another comment.

Re: Category-Theoretic Characterizations of Entropy

My attitude to these things is just to write macros. And as we’re some way off actually having a paper, it doesn’t bother me too much.

Anyway, I probably agree with you about most things.

On Δn\Delta_n for the (n−1)(n - 1)-simplex, well, the reason I chose this notation is of course that Δn\Delta^n is the nn-simplex. But yeah, then we’re more or less compelled to use Δ\Delta for the corresponding operad or monad or theory, which sucks. I think P\mathbf{P} is a good alternative.

I like to use PP for a generic operad, though I don’t mind breaking that habit. I know you’re not into TT for monad, John, but it really is standard (from Theory or Triple, I guess) and I don’t see a great reason to change it. Confusion with temperature might not be a real risk.

As for what the “Giry monad” is, I’m quite likely using it in a nonstandard way. Giry herself constructs the monad assigning to a measurable or Polish space the space of probability measures on it (I mean, she constructs two different monads). What I’ve been calling the Giry monad, I think Tobias calls the finitary Giry monad. I prefer descriptive names: perhaps the “convex monad”?

Mostly I’m not really bothered yet because we seem to be a long way off having a paper: I mean, we don’t even know what the main result is yet, do we? But that’s not to say that it’s not worth settling on conventions early.

Re: Category-Theoretic Characterizations of Entropy

I’m not too worried about this notational stuff either… it’s just good to talk about it a little, to see how strongly we feel about various things.

More importantly: why are you writing about power means on our nLab pages, Tom? I’m not complaining—it’s great, but I feel you’re way ahead of me and I’m dimly trying to comprehend what you’re up to. But now I finally have a guess: you’re going to show that if we replacing the usual ‘weighted arithmetic mean’ by a power mean in this formula crucial to the characterization of Shannon entropy:

But I guess I have another question beside my general question “am I reading your mind correctly?” Namely: does the Rényi entropy with parameter qq really obey a modified version of the above equation where PP acts via a power mean rather than the arithmetic mean?

Re: Category-Theoretic Characterizations of Entropy

Unfortunately the above formula does not hold… Here comes a counterexample.

An interesting special case of Rényi entropy is the min-entropyH∞H_\infty. Given an nn-element probability distribution P=(p1,…,pn)P=(p_1,\ldots,p_n), its min-entropy is

H∞(P)=log1maxipiH_\infty(P)=\log\frac{1}{\max_i p_i}

(Side note: the corresponding convex algebra structure on ℝ+\mathbb{R}_+ in this case is the maximum operation, i.e. tropical addition!)

So let’s use HqH_q for q=∞q=\infty and consider the distributions P1=(1/2,1/2)P_1=(1/2,1/2), P2=(1)P_2=(1) and P=(1/2,1/2)P=(1/2,1/2). Then P∘(P1,P2)=(1/4,1/4,1/2)P\circ (P_1,P_2)=(1/4,1/4,1/2), so that the left-hand side becomes log(2)log(2). However, both terms on the right-hand side are also log(2)log(2)! So in this case, the conjectured equation does not hold.

Due to continuity in qq, this also provides a counterexample for any other large enough value of qq.

Re: Category-Theoretic Characterizations of Entropy

To add to what Tobias wrote, here’s the general formula for the Rényi entropies of an operadic composite.

Notation: HqH_q is the Rényi entropy of order q∈[0,∞]q \in [0, \infty]. Its exponential is Dq=exp(Hq)D_q = \exp(H_q), the “Rényi extropy” of order qq. Ecologists call DqD_q the Hill number of order qq; the DD stands for “diversity”.

We’re going to take the operadic composite P∘(P1,…,Pn)P \circ (P_1, \ldots, P_n) where P=(p1,…,pn)P = (p_1, \ldots, p_n). Since I’m more used to working with the extropies than the entropies, I’ll give the formula for those first:

Note that in all cases, the right-hand side is a function of P=(p1,…,pn)P = (p_1, \ldots, p_n) and Hq(P1),…,Hq(Pn)H_q(P_1), \ldots, H_q(P_n): you don’t need to know P1,…,PnP_1, \ldots, P_n themselves. But only in the case q=1q = 1 does the right-hand side involve a term Hq(P)H_q(P). (See comment below.)

The definition of Tsallis entropy is Tq=1q−1(1−∑ipiq)T_q = \frac{1}{q - 1}(1 - \sum_i p_i^q). The relation with Rényi entropy, HqH_q is given by

Tq=1q−1(1−e(q−1)Hq).
T_q = \frac{1}{q - 1} (1 - e^{(q - 1)H_q}).

It seems to be commonly the case that some of the nice properties of the Shannon entropy are retained by the Rényi entropy and some by the Tsallis entropy, but that leaving q=1q = 1 behind you can’t have both at the same time.

Re: Category-Theoretic Characterizations of Entropy

I meant that there are pleasant features of the Shannon entropy found in Rényi entropy but not in Tsallis entropy, and vice versa.

For example, the Rényi entropy of two independent systems is the sum of their individual Rényi entropies. One can formulate a Tsallis entropy version of independence but it’s messier – there’s an adjustment term depending on qq and the product of the entropies.

Similarly, Tsallis entropy comes into its own when you’re looking to express the entropy of a joint system by decomposing it, while the Rényi entropy formulation is much messier ((48) of this).

Re: Category-Theoretic Characterizations of Entropy

I can’t understand why Beck says this. Rényi entropy, Rényi extropy and Tsallis entropy are all related to each other by invertible, order-preserving transformations. In that sense they’re all “effectively” the same. But why does he believe that Rényi extropy and Tsallis entropy are particularly similar? Rényi extropy is,

(∑piq)1/(1−q)
\Bigl( \sum p_i^q \Bigr)^{1/(1 - q)}

while Tsallis entropy is

1q−1(1−∑piq).
\frac{1}{q - 1} \Bigl( 1 - \sum p_i^q \Bigr).

I wouldn’t call them any more similar to each other than to Rényi entropy,

Re: Category-Theoretic Characterizations of Entropy

I want to explain something about the relationship between Rényi entropy HqH_q, Rényi extropy DqD_q, and “Tsallis” entropy TqT_q, and why I’ve come to regard Rényi extropy as the superior form. (I put “Tsallis” in quotes because the entropies that often bear his name enjoyed a lively existence in information theory and theoretical ecology for over 20 years before Tsallis discovered them.) These three quantities are, respectively,

As I mentioned before, each is related to the others by an invertible, order-preserving transformation, and to that extent they’re interchangeable.

But the unique selling point of Rényi extropy is that it’s an effective number:

Dq(1/n,…,1/n)=n.
D_q(1/n, \ldots, 1/n) = n.

Since the uniform distribution maximizes DqD_q, another way to say this is that maxpDq(p)=nmax_{\mathbf{p}} D_q(\mathbf{p}) = n, where the maximum is over probability distributions p\mathbf{p} on {1,…,n}\{1, \ldots, n\}.

Any kind of entropy can be transformed, in a more or less unique way, into an effective number. For example, the effective number form of both Rényi entropy and Tsallis entropy is Rényi extropy. This is because all three are functions of ∑piq\sum p_i^q and Rényi extropy is an effective number.

In ecology, Dq(p)D_q(\mathbf{p}) is called the Hill number of order qq, and is used as a measure of diversity. That’s why I use the letter D, and that’s why I prefer the term diversity (of order qq) to “Rényi extropy” or “Hill number”. The idea is that we have an ecological community made up of nn species in proportions p1,…,pnp_1, \ldots, p_n. Diversity is maximized when the species are evenly balanced. I wrote a long explanation a couple of years ago.

The effective number property makes DqD_q behave very intuitively. If the diversity of an ecosystem is 2144, that means it’s as diverse as a system of 2144 species in equal proportions. There are “effectively” 2144 species.

Lou Jost has pointed out with compelling examples that if your entropy is not an effective number, the numbers it produces will be treacherous.

To elaborate on his example here, suppose there’s been an oil spill.
Ecologists claim it’s caused devastation, but the oil company announces that biodiversity in the marine ecosystem has dropped by just 2%. Pressed for details, the oil company scientists say they’re using a standard measure of diversity, the Tsallis entropy of order 22:

T2(p)=1−∑pi2
T_2(\mathbf{p}) = 1 - \sum p_i^2

(known to ecologists as the Simpson or Gini-Simpson index). They reveal that the pre-spill diversity was 0.99, and now it’s 0.97, which is indeed a drop of 2%.

But the ecologists are right. If T2(p)=0.99T_2(\mathbf{p}) = 0.99 then D2(p)=100D_2(\mathbf{p}) = 100, and if T2(p)=0.97T_2(\mathbf{p}) = 0.97 then D2(p)=33D_2(\mathbf{p}) = 33. So before the spill, the marine community was as diverse as a community of 100 species in equal proportions, and after the spill, it’s as diverse as a community of 33 species in equal proportions. If the pre- and post-spill communities really did consist of species in equal proportions, then the spill would have wiped out two thirds of them. Any sensible measure — that is, any effective number — would drop by two thirds. But T2T_2, not being an effective number, drops by a completely misleading 2%.

Going back to the pure mathematics of it, one reason I favour effective numbers is that they seem to be the right thing in my ongoing efforts to study “size”. For example, I observed above that the cardinality of a finite set is the maximum value of DqD_q of a probability distribution on it, precisely because DqD_q is an effective number. There’s also a version of DqD_q for distributions on metric spaces, and its maximum value is closely related to the magnitude of the space. Again, that only works because DqD_q is an effective number.

I guess the unique selling point of Tsallis entropy is that it can be expressed as a sum

this is the interpretation of entropy as expected surprise; σq\sigma_q assigns to a probability a measure of surprise at the event occurring. But I don’t really know. Is that what users of Tsallis entropy would point to as its fundamental property?

Re: Category-Theoretic Characterizations of Entropy

Is that what users of Tsallis entropy would point to as its fundamental property?

We could do with a Tsallis entropy fan to present their case. It does seem to be that with Tsallis you can modify equations satisfied by Shannon entropy in the simplest way, e.g., back here, perhaps by using deformed logarithms.

I think also that the fact that the Tsallis divergence (relative entropy) fits into Amari’s family of α\alpha-divergences must be telling us something.

Something I’d certainly hope to see explained if category theory is to throw light on entropy, etc., is why differential geometry seems to emerge so naturally. It hasn’t always been presented in the most straightforward of ways, but time and again when I worked with a machine learning group those who understood various algorithms as performing projection along geodesics came up with the best ideas. There’s even the EM-algorithm which involves alternating projections along the geodesics of a pair of dual connections (whose divergences are Kullback-Leibler, or whatever we decided to call it, and inverse Kullback-Leibler).

Amari explains this in one of his clearest papers Information geometry in optimization, machine learning and
statistical inference. For a basic video introduction to information geometry try Dasgupta’s lecture. You have to see once in your life the idea of projecting from a prior along a geodesic to the space of distributions matching in some respect the empirically observed distribution. This is where the corresponding exponential family intersects that space. Alternatively, you project from the empirical distribution to the exponential family.

Re: Category-Theoretic Characterizations of Entropy

I was wrong, and the truth is more interesting than I’d believed.

Back here, I wrote down the formula for the Rényi entropy of an operadic composite. Actually, things look nicer if you use the exponential of Rényi entropy, the “Rényi extropy” or Hill number DqD_q. We had a composite

P∘(P1,…,Pn)
P \circ (P_1, \ldots, P_n)

in the operad for convex algebras (whose space of nn-ary operations is Δn=Δn−1\Delta_n = \Delta^{n - 1}), and we were writing P=(p1,…,pn)P = (p_1, \ldots, p_n). I correctly gave the formula

Re: Category-Theoretic Characterizations of Entropy

Thanks a million, Tom! Sitting in the talks today I got some good ideas, and your calculations seem to be related. I’ll try to say something more precise soon, but right now I’ll just say that these ‘escort distributions’ (I’d never heard that term) are the equilibrium states at different temperatures of a physical system with a fixed Hamiltonian (which physicists think about all the time, and are crucial to my understanding of Rényi entropy). So, I’m very happy to see them showing up: this is starting to look like what I always wanted! I think we’ll soon have a very nice paper. I’m not sure what it’ll say, but (curiously) I think the really hard part is almost done.

Re: Category-Theoretic Characterizations of Entropy

‘escort distributions’ (I’d never heard that term) are the equilibrium states at different temperatures of a physical system with a fixed Hamiltonian.

I think it is good idea to considering the entropy evolution, in this interpretation, as Hamiltonians renormalization flow: morphisms in Hamiltonians’ category (homomorphisms of the bilinear forms).

Because of the entropy is a “more flexible measure of dimensionality” (rational in constructive case), it is possible considering this flow as homotopies chain connecting the Hamiltonians even with different equilibrium dimensions.

Re: Category-Theoretic Characterizations of Entropy

Sitting in a talk today, I learned about epitopes. Of course, I thought of you and Jim.

For a minute I had no idea what you were talking about… probably because I was reading that page about epitopes!

(Tom’s making a joke about ‘opetopes’.)

By the way: what is this conference you’re at?

I hope you’re feeling well again.

Thanks – I’m feeling great. First of all, it was just a 36-hour bug. Second of all, your new calculations on the quasi-operadic nature of Rényi extropy fill me with hope. Third of all, they resonate in an exciting way with the ideas I had this morning. And fourth of all, I’ve been spending lots of time listening to talks by Susan Holmes, whose work combines fascinating math with lots and lots of attention to real-world biological data. Ant networks, pyrosequencing, metagenomics, the microbiome, treespace, the statistical analysis of heterogeneous datasets… my mind is reeling in a most pleasant way from a bombardment of exciting new ideas!

I hope that sometime you and Mike Shulman and anyone else interested in operads visits Azimuth and tell me what you think about the phylogenetic operad. It’s very similar to some classic operads showing up in homotopy theory, but I’ve never seen it before.

It was pretty good. I learned a lot about infectious diseases. (Fun fact: keeping cows drastically cuts your risk of catching malaria.) Christina Cobbold gave a great talk on two aspects of diversity. The first part was on the diversity of variants of trypanosome, the parasite that causes sleeping sickness. Its nasty trick is to keep changing its surface, so that as soon as your immune system generates the white blood cells to fight it, it mutates into something slightly different. The new white cells are now useless, so your poor immune system has to start the process all over again. If you get infected, it’s a bit like catching many diseases in succession: you get ill, then recover, then get ill again, then recover… all because of the diversity of the variants.

The second part was about our joint work on measuring diversity in general. If all those epidemiogists and ecologists and evolutionary biologists in the audience were watching carefully, they’d have seen the formula for Rényi extropy.

Re: Category-Theoretic Characterizations of Entropy

I thought a little bit about this a year or two ago and wonder if you’ve considered the following point of view.

Consider the category of finite probability spaces and surjections. We have a covariant functor to vector spaces which gives functions and assigns to a map the corresponding ‘integration along the fiber’. Consider this instead as a functor to affine spaces and try to deform the constant term (fixing the linear part). One easily checks that Shannon entropy is the unique deformation, i.e. that the assignment

is a functor iff H(f)H(f) is a multiple of the Shannon entropy. To be clear about what kind of object this is, H(f)H(f) is a function on the target of ff. I think that to be perfectly honest we require some continuity here.

To see why this is so, lets just write out functoriality in a simple case

which is exactly the usual formula “total entropy = expected entropy of a subsystem + entropy from selecting a subsystem”.

Is this at all related to what you guys are doing?

[Josh: I fixed the math. To enter Latex at this blog, choose a text filter such as “itex to MathML with parbreaks” from the drop-down menu on the comment form. Most Latex works as usual, including AMS embellishments. Displays go between double dollars. Tom]

Re: Category-Theoretic Characterizations of Entropy

Could you expand on this some more? In particular, do you really mean to start from the category of finite probability spaces with surjections? That category is equivalent to the category of finite sets with surjections, and defining entropy here doesn’t make much sense. Maybe you meant to take the category of finite probability spaces with measure-preserving functions? If so, then we are using this category a lot. We call it FinProb\Fin\Prob.

Then, can you give a more precise definition of the functor? What is ‘integration along the fiber’? And how do you prove that Shannon entropy is the unique (say, continuous) deformation of this functor?

The “total entropy = expected entropy of a subsystem + entropy from selecting a subsystem” formula is indeed something we are concerned with. Tom has found an operadic formulation of it.

Re: Category-Theoretic Characterizations of Entropy

Tobias,

Sure, I suppose I ought to be using your category FinProb. In spelling out the assumptions I need in the following, my approach lost a lot of its charm, but here’s what I meant:

My first functor, which I’ll be deforming, takes a finite probability space to the vector space of functions on it, and literally averages them over the fibers of a map. For a map to a point, this is “expectation”, and for a general map it is some kind of conditional expectation.

For a map f: X –> Y, I am going to define H(f), an element of Functions(Y). I would like to do this in such a way that the following assignment is a functor:

Re: Category-Theoretic Characterizations of Entropy

Josh: it’s tough to delete your old post without rendering all comments on that post invisible, and while I could manage it in half an hour, Tobias’ post beginning “Could you expand on that some more?” would then become quite mysterious, along with your remark “In spelling out the assumptions I need in the following, my approach lost a lot of its charm.” So, let’s leave history as it is.

Your idea is interesting! What made you think about deforming the pushforward operation on functions?

(By the way: are you summing over each fiber, or averaging over it? The usual pushforward operation on functions sums over the fiber, and that’s functorial. Does averaging over the fiber give a functor?)

Re: Category-Theoretic Characterizations of Entropy

Re: Category-Theoretic Characterizations of Entropy

So I now verified that this makes sense. What’s nice about it is that it’s a bit less abstract than Tom’s operadic formulation. There are a few subtleties which I would like to discuss. First of all, working with FinProb\Fin\Prob as the category of all finite probability spaces and measure-preserving functions is once again problematic: in degenerate cases with zero probabilities, conditional probabilities don’t exist, but these are required for defining the ‘integration over the fibers’.

Secondly, as you already noticed, the symmetry, i.e. the invariance under isomorphism, is not automatic. Without artificially imposing non-negativity, it does not follow from your other requirements. The following reasoning shows why.

Let us fix any non-trivial object Z∈FinProbZ\in\Fin\Prob. Then given any functor
F:FinProb→Aff,
F:\Fin\Prob \rightarrow \Aff,
we can modify it as follows: We choose any fixed g∈Trans(F(Z))g\in\Trans(F(Z)). (Here, Trans(F(Z))\Trans(F(Z)) is the vector space of translations on F(Z)F(Z). It acts on F(Z)F(Z) by addition.) Then we can define a new functor
F′:FinProb→Aff
F': \Fin\Prob \rightarrow \Aff
as follows: F′F' equals FF on objects. On morphisms f:X→Yf:X\rightarrow Y, we define F′(f)F'(f) as follows:
F′(f)=F(f)+δX,Zg−δY,Zg
F'(f)=F(f)+\delta_{X,Z}g-\delta_{Y,Z}g
where δX,Z\delta_{X,Z} is 11 if XX is equal to ZZ (equal, not just isomorphic), and 00 otherwise. You can check easily that this turns F′F' into a functor.

In fact, what I have done here is to construct F′F' from FF using a gauge transformation. So, what might be correct is the following:

“Any deformation of the ‘integration along the fiber’ functor FinProb→Aff\Fin\Prob\rightarrow\Aff is a multiple of Shannon entropy up to a gauge transformation.”

Thirdly, how can we express categorically the requirement that the new functor FinProb→Aff\Fin\Prob\rightarrow\Aff is supposed to be a ‘deformation’ of the original functor FinProb→Vect\Fin\Prob\rightarrow\Vect? One possibility is to go from affine space to projective space: given an affine space, consider its set of lines. Two lines are called equivalent if they are parallel. The equivalence classes of lines then are the projective space associated to the affine space. Morphisms between affine spaces descend to morphisms of the associated projective spaces, since they map lines to lines. So we have a functor Aff→Proj\Aff\rightarrow\Proj. ‘Deformation’ now means that the composed functors FinProb→Proj\Fin\Prob\rightarrow\Proj have to coincide.

Re: Category-Theoretic Characterizations of Entropy

Re: Category-Theoretic Characterizations of Entropy

Tobias,

Thanks for your interest in this. The way I was thinking of things was really as actual vector spaces with affine maps between them. In this case there is a nice functor to the ordinary category of vector spaces just by taking the derivative (linear part). This is really the sense that I’m deforming, which as you’ve suggested is not really the category of abstract affine spaces.