"A good stock of examples, as large as possible, is indispensable for a thorough understanding of any concept, and when I want to learn something new, I make it my first job to build one." – Paul Halmos

Note: this is a repost of a Facebook status I wrote off the cuff about a year ago, lightly edited. As such it has a different style from my other posts, but I still wanted to put it somewhere where it’d be easier to find and share than Facebook.

Gradient descent, in its simplest where you just subtract the gradient of your loss function, is not dimensionally consistent: if the parameters you’re optimizing over have units of length, and the loss function is dimensionless, then the derivatives you’re subtracting have units of inverse length.

This observation can be used to reinvent the learning rate, which, for dimensional consistency, must have units of length squared. It also suggests that the learning rate ought to be set to something like for some kind of characteristic length scale , which loosely speaking is the length at which the curvature of starts to matter.

It might also make sense to give different parameters different units, which suggests furthermore that one might want a different learning rate for each parameter, or at least that one might want to partition the parameters into different subsets and choose different learning rates for each.

Going much further, from an abstract coordinate-free point of view the extra information you need to compute the gradient of a smooth function is a choice of (pseudo-)Riemannian metric on parameter space, which if you like is a gigantic hyperparameter you can try to optimize. Concretely this amounts to a version of preconditioned gradient descent where you allow yourself to multiply the gradient (in the coordinate-dependent sense) by a symmetric (invertible, ideally positive definite) matrix which is allowed to depend on the parameters. In the first paragraph this matrix was a constant scalar multiple of identity and in the third paragraph this matrix was constant diagonal.

This is an extremely general form of gradient descent, general enough to be equivariant under arbitrary smooth change of coordinates: that is, if you do this form of gradient descent and then apply a diffeomorphism to parameter space, you are still doing this form of gradient descent, with a different metric. For example, if you pick the preconditioning matrix to be the inverse Hessian (in the usual sense, assuming it’s invertible), you recover Newton’s method. This corresponds to choosing the metric at each point to be given by the Hessian (in the usual sense), which is the choice that makes the Hessian (in the coordinate-free sense) equal to the identity. This is a precise version of “the length at which the curvature of starts to matter” and in principle ameliorates the problem where gradient descent performs poorly in narrow valleys (regions where the Hessian (in the usual sense) is poorly conditioned), at least up to cubic and higher order effects.

In general it’s expensive to compute the inverse Hessian, so a more practical thing to do is to use a matrix which approximates it in some sense. And now we’re well on the way towards quasi-Newton methods.

As a warm-up to the subject of this blog post, consider the problem of how to classify matrices up to change of basis in both the source () and the target (). In other words, the problem is to describe the equivalence classes of the equivalence relation on matrices given by

.

It turns out that the equivalence class of is completely determined by its rank . To prove this we construct some bases by induction. For starters, let be a vector such that ; this is always possible unless . Next, let be a vector such that is linearly independent of ; this is always possible unless .

Continuing in this way, we construct vectors such that the vectors are linearly independent, hence a basis of the column space of . Next, we complete the and to bases of in whatever manner we like. With respect to these bases, takes a very simple form: we have if and otherwise . Hence, in these bases, is a block matrix where the top left block is an identity matrix and the other blocks are zero.

Explicitly, this means we can write as a product

where has the block form above, the columns of are the basis of we found by completing , and the columns of are the basis of we found by completing . This decomposition can be computed by row and column reduction on , where the row operations we perform give and the column operations we perform give .

Conceptually, the question we’ve asked is: what does a linear transformation between vector spaces “look like,” when we don’t restrict ourselves to picking a particular basis of or ? The answer, stated in a basis-independent form, is the following. First, we can factor as a composite

where is the image of . Next, we can find direct sum decompositions and such that is the projection of onto its first factor and is the inclusion of the first factor into . Hence every linear transformation “looks like” a composite

of a projection onto a direct summand and an inclusion of a direct summand. So the only basis-independent information contained in is the dimension of the image , or equivalently the rank of . (It’s worth considering the analogous question for functions between sets, whose answer is a bit more complicated.)

The actual problem this blog post is about is more interesting: it is to classify matrices up to orthogonal change of basis in both the source and the target. In other words, we now want to understand the equivalence classes of the equivalence relation given by

.

Conceptually, we’re now asking: what does a linear transformation between finite-dimensional Hilbert spaces “look like”?

Let be a commutative ring. A popular thing to do on this blog is to think about the Morita 2-category of algebras, bimodules, and bimodule homomorphisms over , but it might be unclear exactly what we’re doing when we do this. What are we studying when we study the Morita 2-category?

The answer is that we can think of the Morita 2-category as a 2-category of module categories over the symmetric monoidal category of -modules, equipped with the usual tensor product over . By the Eilenberg-Watts theorem, the Morita 2-category is equivalently the 2-category whose

objects are the categories , where is a -algebra,

morphisms are cocontinuous -linear functors , and

2-morphisms are natural transformations.

An equivalent way to describe the morphisms is that they are “-linear” in that they respect the natural action of on given by

.

This action comes from taking the adjoint of the enrichment of over , which gives a tensoring of over . Since the two are related by an adjunction in this way, a functor respects one iff it respects the other.

So Morita theory can be thought of as a categorified version of module theory, where we study modules over instead of over . In the simplest cases, we can think of Morita theory as a categorified version of linear algebra, and in this post we’ll flesh out this analogy further.

I was staring at a bonfire on a beach the other day and realized that I didn’t understand anything about fire and how it works. (For example: what determines its color?) So I looked up some stuff, and here’s what I learned.

In the previous post we described a fairly straightforward argument, using generating functions and the saddle-point bound, for giving an upper bound

on the partition function . In this post I’d like to record an elementary argument, making no use of generating functions, giving a lower bound of the form for some , which might help explain intuitively why this exponential-of-a-square-root rate of growth makes sense.

The starting point is to think of a partition of as a Young diagram of size , or equivalently (in French coordinates) as a lattice path from somewhere on the y-axis to somewhere on the x-axis, which only steps down or to the right, such that the area under the path is . Heuristically, if the path takes a total of steps then there are about such paths, and if the area under the path is then the length of the path should be about , so this already goes a long way towards explaining the exponential-of-a-square-root behavior.

Let denote the partition function, which describes the number of ways to write as a sum of positive integers, ignoring order. In 1918 Hardy and Ramanujan proved that is given asymptotically by

.

This is a major plot point in the new Ramanujan movie, where Ramanujan conjectures this result and MacMahon challenges him by agreeing to compute and comparing it to what this approximation gives. In this post I’d like to describe how one might go about conjecturing this result up to a multiplicative constant; proving it is much harder.