Skip to the Main Content

Note:These pages make extensive use of the latest XHTML and CSS Standards. They ought to look great in any standards-compliant modern browser. Unfortunately, they will probably look horrible in older browsers, like Netscape 4.x and IE 4.x. Moreover, many posts use MathML, which is, currently only supported in Mozilla. My best suggestion (and you will thank me when surfing an ever-increasing number of sites on the web which have been crafted to use the new standards) is to upgrade to the latest version of your browser. If that's not possible, consider moving to the Standards-compliant and open-source Mozilla browser.

September 5, 2007

Category Theory in Machine Learning

Posted by David Corfield

I was asked recently by someone for my opinion on the possibility that category theory might prove useful in machine learning. First of all, I wouldn’t want to give the impression that there are signs of any imminent breakthrough. But let me open a thread by suggesting one or two places where categories may come into play.

For other areas of computer science the task would be easier. Category theory features prominently in theoretical computer science as described in books such as Barr and Wells’ Category Theory for Computing Science. Then there’s Johnson and Rosebrugh’s work on databases.

As for machine learning itself, perhaps one of the most promising channels is through probability theory. One advantage of working with the Bayesian approach to machine learning is that it brings with it what I take to be more beautiful mathematics. Take a look at this paper on statistical learning theory. It belongs to the side of the cultural divide where category theory doesn’t flourish. If, on the other hand, we encounter mathematics of the culture I prefer, it is not unlikely that category theory will find its uses.

In a couple of posts (I and II) I discussed a construction of probability theory in terms of a monad. It struck me there that the natural inclination of the Bayesian to think about distributions over distributions fits this construction well. For example, the hot topic of Dirichlet processes are distributions over distributions.

Graphical models, which include directed graphs, are another hot topic. If we remember that a category is a kind of directed graph, perhaps something can be done here. Graphical models can also be blended with probabilities. I once tried to think of Bayesian networks, a result of this blend, as forming a symmetric monoidal category.

Another dimension to spaces of probability distributions is that they can be studied by differential geometry in a field known as information geometry. In this list there are some references to the use of information geometry in machine learning. As a distant propect, perhaps category theoretic aspects to differential geometry could come to play a role.

Statistical field theories, which encompass quantum mechanics and quantum field theory in their Euclidean
formulation, are technically similar to a nonparametric Bayesian approach,

and we’re right here about category theory and such field theories, perhaps something interesting could happen.

Another speculative thought was to tie the kernels appearing in machine learning to John’s Tale of Groupoidification. Perhaps this might be done to encode invariances more intelligently. Currently, RBF kernels get used a lot, even though they don’t encode your background knowledge well. For example, two images varying just in one pixel are close in the space of images, so if one is classified as a ‘3’, there is a high probability that the other is too. But shift an image two pixels to the right and the images are far apart in the space of images, so the kernel is agnostic about what the label of one images means for the other. One needs to encode this invariance in the kernel.

Two people who very much believe that the kernels used in machine learning are not the right ones for the tasks we need to perform in the world are Lecun and Bengio. The problem is with the shallowness of the architecture, they say here. Instead they advocated neural nets with deep architectures. These have that ‘catastrophic’ behaviour that small changes in the weights may lead to very different performance.

A neural net architecture is a mapping from space of weights to a certain space of functions, but this is not a 1-1 mapping. Many weight settings may correspond to the same function. Singularity theory can be used to study some of the properties, as Watanabe does. That’s the kind of mathematics where category theory should show.

There is also some work directly applying category theory to neural networks, such as here, but I haven’t followed it.

All in all, you can see that there are no sure fire bets. I very much doubt we’re as far advanced as a mathematical physicist wondering about categories in their field in 1993. If anyone else has some reasons to be optimistic, do let us know.

Posted at September 5, 2007 1:32 PM UTC

TrackBack URL for this Entry: http://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/1418

32 Comments & 0 Trackbacks

Re: Category Theory in Machine Learning

Hi David.
This might be a somewhat silly question, but have people done much work on using the technology of type theory with machine learning? That’s generally where a lot of the uses of categories come in to CS, at least the kind of CS I seem to read about.

Also, I had never heard of information geometry before. The idea of the metric being related to the “amount of information” between two configurations is fun. I’m surprised I haven’t seen it come up more often in quantum mechanics.

Re: Category Theory in Machine Learning

Not at all a silly question. As far as I’m aware there’s no work done here, but it would have the advantage of being able to tap into an enormous literature. People do try to get machines to parse sentences grammatically. Perhaps that would be a place to start.

As for information geometry, there’s even a quantum version, as discussed at this Fields Institute workshop. One of the people with whom John dined yesterday evening, Ray Streater, has contributed to the field.

Re: Category Theory in Machine Learning

On a somewhat related note:

As far as I am aware, Nils Baas (usually mentioned here in relation to his work on classifying spaces of 2-vector bundles) is seriously thinking about applying nn-category theory – in the context of his framework of hyperstructures – to biology.

We think that the present framework is quite useful for analysing the nature of emergence - in particular the dependence on observational mechanisms. In order to put this into a more mathematical framework category theory is very useful. We let the systems be represented as objects in a category and the interactions as morphisms. A complex system with interactions is then represented by a diagram. Since the morphisms are represented by arrows, this may be viewed as a process oriented representation.

Re: Category Theory in Machine Learning

Re: Category Theory in Machine Learning

I like the idea that all forms of generalisation are adjunctions. It makes some kind of sense to me - the idea being that an adjunction gives what is in some sense the ‘best’ way to represent an object in one category using an object in another category. Much of machine learning could loosely be thought of in this way.

Re: Category Theory in Machine Learning

Dan Piponi wrote:

the idea being that an adjunction gives what is in some sense the ‘best’ way to represent an object in one category using an object in another category.

We can make this very precise by saying that the search for a left or right adjoint to a functor F:C→DF: C \to D is the search for the ‘best approximation’, from either above or below, to the — possibly nonexistent! — inverse of FF.

For example, if CC and DD are partially ordered sets, FF has a right adjoint if we can find for each d∈Dd \in D a greatest possible c∈Cc \in C with F(c)≤dF(c) \le d. This is what I mean by a ‘best approximation from below’ to the possibly nonexistent object F−1(d)F^{-1}(d). We then define the right adjoint of FF to be the G:D→CG : D \to C that picks out this best possible cc for each d∈Dd \in D.

The whole business is a bit subtler for categories that aren’t posets, but it’s still quite standard stuff — except for the realization that an adjoint functor is ‘the best possible reply to an impossible demand’.

I’m not sure I believe analogy is best modeled in terms of adjoint functors. I’ve always thought of analogy as being like ‘proportion’ — that is, equations of the form

αναλογια (analogia) originally meant ‘proportionality’, in the mathematical sense, and it was indeed sometimes translated to Latin as proportio. From there analogy was understood as identity of relation between any two ordered pairs, whether of mathematical nature or not.

This is all about finding one side of a commutative square given the other three.
I don’t see what this has to do with adjoint functors — in any precise way, that is.

Nonetheless, there’s some relation to adjoint functors, since you could still
say every analogy is the best possible reply to an impossible demand — an impossible demand that would be possible if the necessary inverses existed.

For example:

poetry:music::the Iliad:???

The demand here is: if poetry were music, the Iliad would be what?

If we had inverses at our disposal, we could write this as a proportion:

poetrymusic=theIliadx \frac{poetry}{music} = \frac{the Iliad}{x}

Then, with luck, we could solve for xx:

x=theIliadmusicpoetry x = {the Iliad} \frac{music}{poetry}

However, the necessary inverses don’t really exist, so we need to use analogy to guess an answer to this impossible problem.

Mathematical puzzle: what do proportions like A:B::C:X, or commutative squares, have to do with adjoint functors? Presumably a good answer would take everything I just said and get it to fit together better!

Re: Category Theory in Machine Learning

I have seen this statement before. But am not sure I entirely follow what it is supposed to mean in detail. But it sounds very interesting.

Can you give me a simple example, which illustrates this idea that an analogy is a span?

I really like the general idea that it is possible to understand informal everyday concepts like “analogy” in a precise manner. This is similar to how the informal everyday concepts “stuff”, “structure”, “property” are nicely formalized by the forgetful functor game. In that case, I find a couple of examples very helpful to understand how it works.

Same probably here. So: does anyone have a nice example for understanding the statement that “an analogy is a span”?

‘The child is father to the man’. Hopkins, Gerard Manley. 1918. Poems.

Hopkins is commenting on Wordsworth’s original sentiment, namely a poetic statement of an otherwise commonplace observation: what you are, and feel, and think, learn, and believe as a child creates a path you will take into adulthood.

Superficially, this is an adjoint of the conventional chronology “The Man is father of the child” but does a kind of recategorification in replacing time spans with paths, and the implication that many potential paths exist. I’m not saying that these poets predicted feynman diagrams or holonomy, but I did want to indicate that metaphor is a higher level of n-category than naive models suggest.

Re: Category Theory in Machine Learning

Re: Category Theory in Machine Learning

Interesting stuff. I’ve been pondering categorical ways of thinking about analogies for a while. First of all I think that the example JB gives is a slightly degenerate example of an analogy: analogies of this form don’t have much structure to them. Anyway, let me solve the analogy problem.

poetry:music::the Iliad:???

The Illiad relates to poetry in a very specific way: it is a prototypical, famous example of a poem. We can think of this as defining a map “take prototypical example of”, and apply it to music. Thus Instead of writing

Where PrtEx is shorthand for the “take prototypical example” function.

The reason I say that this is a slightly degenerate example of an analogy is that it has almost no structure to it, and is based on knowing what “the Iliad” is. Most useful analogies (i.e. ones that people will actually use when they are trying to explain something) have a lot more structure. For example, suppose you are trying to explain the concept of choosing a password to encrypt a file to your grandmother. She doesn’t understand what encryption is, so you make an analogy with a concept she does understand: the concept of locking a valuable item in a safe. In order to make this analogy to her, you have to map each concept in the “encryption” scenario to an object in the “safe” scenario. Thus:

encryption <—–> safe

password <—–> key for safe

plaintext <—–> valuable item

to encrypt <—–> to lock the safe

to decrypt <—–> to unlock the safe

At first glance it looks like an analogy is simply a bijection of sets. In fact this not the case - to start with these are very structured sets, and most of the work that the analogy does is done by the way the concepts on the “known” side of the analogy link into other concepts. For example, the concept of locking a safe “interfaces” with the concept of other people wanting to steal things. Your granny will know that she should not tell anyone what the password is because she already knows that if she had a safe then she would be careful who she gave the key to. She transports her existing understanding across the analogy.

In summary, an analogy is a way of mapping structure from some “known” situation to an “unknown” situation; to put it another way, an analogy is a “homomorphism of concepts”. This is a lot like what functors do in category theory, which was what originally motivated me to think about formalizing analogy with category theory.

My personal theory is that the above idea can be used in AI as follows: a fully fledged concept can be built up as a sequence of partial analogies to some basic set of concepts. In fact Lakoff and Johnson have written an interesting book called Metaphors We Live By which outlines how a few important basic concepts are analogically related to all sorts of other concepts.

Re: Category Theory in Machine Learning

The analogy fits to the degree it matches closely with the primary categorizing descriptions/words.

Joseph Campbell describes the archetypal “Hero’s Journey”. The Iliad describes the journey of a king and return to his kingdom, so it belongs to a Hero’s Journey. Another example is “The Lord of the Rings” particularly the third book “Return Of The King”. I think the epoch of music that matches with ancient Greece are Classical Music offerings often using Latin.

I think the kind of music which matches the Iliad archetype is a mirror canon, an inversion of the music found earlier in the piece, which can be found in GEB by Hofstadter in his discussion of analogy and self-reference. Bach prepared “The Musical Offering” for King Frederick the II, which contains fugues and canons. They could be imagined as corresponding to the adventures of Ulysses. So my answer is that
x = The Musical Offering composed by Bach.

This type of analogical reasoning is pretty hard to capture by computer program. I favor Lakoff’s theory of embodied mathematics because it incorporates a sense of primal origin, archetypal thinking, as does Memory Evolutive Systems, which features Category Theory, that others have mentioned.

Re: Category Theory in Machine Learning

Dan’s remark reminds me of one perspective I have on adjunctions–that they are related to approximations. The simplest example illustrating this is to take the embedding of the integers in the reals but view this as a functor from one category to another:
both sets are partially ordered sets and a poset is viewed as a category by having an arrow x—>y whenever y is greater than or equal to x. The inclusion functor in this case has a left and right adjoint, and these are the ceiling and floor functions. Now think of the ceiling and floor functions as approximating a real number by an integer.

Re: Category Theory in Machine Learning

I had some ideas about this about the time of my book with Cordier:
Shape Theory: Categorical Methods of Approximation,
in 1989.
There has been a little written about it by myself and others but the problem of incorporating Kernel type constructions is still a hard one as far as I can see.

It may help to look at some of the work recorded in the Dagstuhl seminars on representation of spaces:
http://drops.dagstuhl.de/portals/index.php?semnr=04351
Some of this is fairly standard TCS but the use for instance by formal concept analysis of lattice theoretic stuff is serious artificial intelligence.

Though it’s not machine learning, evolutionary computation, and particularly coevolutionary computation, has several places where category theory could contribute quite a lot. I’ve formulated most of my own ideas about coevolution in terms of preorders, many of which could conceivably and usefully be replaced by categories. Besides that, there are places where I’ve seen what amount to functors and sheaves. I’m sorry about the long comment, but I wanted to give some setup and intuition.

A typical formulation of a coevolutionary algorithm starts with an interactive domain. You can view one of these as a function p:S×T→Rp:S\times T\rightarrow R. When an element of SS (candidate solution) interacts with an eleemnt of TT (test), it receives a result from the preorder RR. The order on RR allows results to be compared; higher values are better for the candidate solution. Payoff matrices as used in game theory can be written in this way, though there is no need for RR to contain numbers. pp might encode the outcome of a game of chess, and p(me,you)p(me,you) might be loselose, where lose<draw<winlose \lt draw \lt win (it might be useful to encode why I lost, and make RR a full-blown category). Since I brought up chess, I should say that games make up a good fraction of example applications, so I’m going to lapse into using words like “strategy” and “play” when to be more general I should be saying “solution” and “interact.”

One of the most important observations here is that in most games of interest, no single element of SS actually suffices as a “solution.” Rather, some mixture (distribution – think Nash equilibrium strategy), bag, or other structure built from elements of SS usually works better (I should say that I’m ignoring TT for now for simplicity’s sake…but there is a lot to say about it). In rock-paper-scissors, for instance, none of the “primitive” or “atomic” strategies rock, paper, or scissors, is a very good one to play; picking one of them at random with equal probability at each interaction is usually a better strategy. Well, that is just a probability distribution over S={rock,paper,scissors}S=\{rock,paper,scissors\} with all choices equiprobable. The collection of all probability distributions over SS is what we’re really searching in this case.

So while most algorithms move through elements of SS (and TT), what we really seek is an element of some structure built from SS. The structures, like mixtures, are almost always free structures built from SS. They’re often called configurations, so we might as well write CSCS for configurations built from SS and recognize CC as a functor (it’s not completely clear to me whether CC should be covariant or contravariant, but in examples it seems the most natural thing to do is treat it as contravariant; there are usually restrictions or “degeneracies” which make more sense than inclusions/emebeddings/…). Calling it “free” suggests an adjoint, and indeed there’s often a forgetful functor lying around too.

The state of an algorithm at time tt is usually a pair of subsets St⊂SS_t\subset S and Tt⊂TT_t\subset T (in fact, “populations” in coevolutionary algorithms are usually multisets/bags or distributions, so think of StS_t and TtT_t as supports for simplicity). As noted above, we draw potential solutions from some structure of configurations CSCS. However, what we actually have on hand when the algorithm is running isn’t all of SS, but StS_t. Thus, the structures actually available at time tt are those in CStCS_t, so that if we stopped the algorithm and asked it for the best thing it had found, it’d have to give us its best guess from CStCS_t. It’s as if we know about rock and paper but haven’t discovered scissors yet. From the local perspective afforded by the state {rock,paper}\{rock,paper\}, paper would seem like the best strategy to play, even though from our global perspective (which the algorithm does not have) we know this is not the case. This local/global distinction smells like a sheaf to me, though I’ve never been able to formulate this as a sheaf in a useful way. But the important question is: are there generic statements which can be made about whether a particular algorithm, when progressing from StS_t to St+1S_{t+1} to … is getting “closer to” or creating a better and better approximation to the best solution in CSCS (where its guess are are drawn from CStCS_t, CSt+1CS_{t+1}, …)?

One more word, about TT. Currying pp leads to a function λS:T→RS\lambda_S:T\rightarrow R^S via which we recognize that elements of TT name functions S→RS\rightarrow R (which is interesting to think about: you, as a chess player, act as a function from the set of other chess players to the preorder {lose<draw<win}\{lose\lt draw\lt win\}!). Each t∈Tt\in T induces a preorder on SS therefore by pullback of the order on RR through tt-as-function. Any subset T′⊂TT^'\subset T can be thought of as inducing an order on SS (in a variety of ways…) by “integrating” these individual orders from the t∈T′t\in T^'. Is there some natural way to turn that information into an order (or some other sense of direction) on CStCS_t? Can that then be used to identify potential solutions from CStCS_t which have some hope of being good approximations for the best we could find in CSCS?

What Anthony Bucci writes above is incomplete but brilliant. It is what I was stumbling about trying to say when I was taking Category Theory at the University of Massachusetts, 1973-1977, and writing papers on molecular cycbernetics (studying actual microbiology under Bruce Levin) and doing extensive research on learning theory and Artificial Intelligence under the guidance of Oliver Selfridge (“father of Machine Perception”, William Kilmer, Ed Riseman, and others, and studying how Chess Masters and Intenational Masters learn Chess (with Danny Kopec, who then became Chess advisor to the University of Edinburgh’s AI group) plus making regular trips to Cambridge MA to speak with Minsky and others at the AI Lab.

Well, I wouldn’t accept “brilliant,” but incomplete is certainly true.
Lately I’ve been reading up on the Chu construction, as there’s an obvious connection between functions of form p:S×T→Rp:S\times T\rightarrow R and objects in Chu(Set,R)\mathrm{Chu}(Set,R). The morphisms in Chu(Set,R)\mathrm{Chu}(Set,R) are close to, but not quite, the “useful” monotonic maps between the candidate sets. The Chu category should allow sensible comparison between different views (S′,T′)(S',T') and (S″,T″(S'',T'') of “the same” function p:S×T→Rp:S\times T\rightarrow R. It also suggests that the tensor of SS and TT, and not their Cartesian product, is the appropriate way of representing the space of possible interactions when the entities involved become more complicated than just elements of a set. Since the tensor of two state spaces seems to be the right way to represent joint systems in many other contexts, it’d be nice if that proved true here as well.

Re: Re: Re: Your Boat, Gently Down the Stream

Chu

(How’s that, John?)

Interesting: as far as I can recall, this is the first time the Chu construction has ever come up at the Café. (Edit: no, it seems I’ve blabbed about them a little before.)

For those who didn’t know, the Chu construction is a marvelous construction which contains many categorical dualities under one big self-dual umbrella (read: **-autonomous category). Specifically, I have in mind the category of Chu spaces, which embodies a large number of concrete dualities in which the 2-element set appears as a Janusian or ambimorphic object, but the construction is rather more general, as Anthony indicates.

Let’s start with Chu spaces. A Chu space is a pair of sets (A,B)(A, B) equipped with a function ⟨−,−⟩:A×B→2\langle -, - \rangle: A \times B \to 2, which we think of as a ‘pairing’. (Sometimes, again as indicated by Anthony, one thinks of the elements of AA as strategies adopted by a Player and the elements of BB as ‘co-strategies’ adopted by an Opponent, and the pairing pits a strategy against a co-strategy and spits out an outcome, either 0 or 1, which says who wins.) A morphism from (A,B)(A, B) to (X,Y)(X, Y) is a pair of functions f:A→Xf: A \to X, g:Y→Bg: Y \to B, which are adjoint to one another:

And, it is a very nice fact that many familiar concrete dualities embed (as full subcategories) in Chu spaces, in a duality-preserving manner. For example, there is a concrete duality between sets and complete atomic Boolean algebras: the dual of a set XX is a complete atomic Boolean algebrs Y=hom(X,2)Y = hom(X, 2), and the dual of a complete atomic Boolean algebra YY is a set X=hom(Y,2)X = hom(Y, 2) [the set of Boolean algebra maps preserving arbitrary infs and sups], and there is in this way a pairing X×Y→2X \times Y \to 2 between the underlying sets. Hence we obtain (full) embeddings

Set→Chu:X↦(X,hom(X,2))Set \to Chu: X \mapsto (X, hom(X, 2))

CABool→Chu:Y↦(Y,hom(Y,2))CABool \to Chu: Y \mapsto (Y, hom(Y, 2))

and the self-duality transposition of ChuChu restricts to the concrete duality between SetSet and CABoolCABool. Other familiar dualities, such as between Boolean algebras and Stone spaces, between posets and completely distributive lattices, between finite linear orders and finite intervals, etc., embed fully in ChuChu in this duality-compatible way.

Moreover this gives a sensible framework in which to speak of, for example, a structure-preserving map from a Stone space to a continuous lattice: just treat them all as Chu spaces! The underlying set of a Chu space structure (A,B,⟨−,−⟩)(A, B, \langle -, - \rangle) would be the first component AA, and the underlying set of the dual structure would be BB.

And, there’s more: the category of Chu spaces carries a nice closed symmetric monoidal category structure, in fact a star-autonomous category structure. I won’t go into all the details, but as one would imagine, the underlying set of the Chu function space (X,Y)(A,B)(X, Y)^{(A, B)} is the set of Chu space maps (A,B)→(X,Y)(A, B) \to (X, Y). Actually, with a little knowledge of star-autonomous categories, one can pretty much figure out how all this structure works, using the fact that the Chu space (2,1)(2, 1) (with the obvious pairing) is the dualizing object which turns the abstract self-duality of ChuChu into a concrete one.

More generally, starting with a closed symmetric monoidal category VV with pullbacks, together with a chosen object DD (in place of 2) to play the role of a “dualizing object”, one can mimic this construction to produce a **-autonomous category Chu(V,D)Chu(V, D). (See Barr’s paper below.)

The Chu construction has been something of a cottage industry in the past fifteen years or so. There is an embarrassment of riches to choose from, but the article by Vaughan Pratt gives a nice reader-friendly introduction with applications to game semantics of linear logic. Another early article by Barr (who was Chu’s adviser) with lots of categorical details and constructions (again with a view to linear logic) can be found here. Finally, as long as people are talking about Hopf algebras over at the Secret Blogging Seminar, I might as well mention a 1995 paper by Rick Blute on the connection between Hopf algebra representations and linear logic.

Re: Category Theory in Machine Learning

Well well,

I have thought the idea in 2008 August and read several books how n-category theory and it’s relatives can be transfered into a machine learning model or/and help a machine learning approach in explaining several cognitive processes which machine can not do. N-Cat is very applicable to layered based generative models. However, the problem is efficiency.

Re: Category Theory in Machine Learning

I think it’s more that there are so many different “educated guesses” about what the next good stepping stone in artificial reasoning/machine learning that there’s little “truly compelling” theories of learning utilising category theory yet.

(There’s also the issue that in many pratical machine learning projects, one ideally wants models that are only as general as absolutely necessary to solve the problem in order to both make learning from relatively restricted training data both pratically tractable but also reducing the model complexity increases the likelihood that a “sharp” model instance will be learned.)

Re: Category Theory in Machine Learning

there are so many different “educated guesses” about what the next good stepping stone in artificial reasoning/machine learning

My uneducated guess is that the current main stumbling block is in going from the informal to the formal, not only fuzziness but vagueness must be resolved.
Perhaps there is something to find in machine learning about feature extraction, synthesizing the hypothesis space from scratch (well… the raw data actually).
There is no lack of math crunching ready to fire once “the problem is well defined”, it’s the murky part which is still missing not the “high maths”.