Bill Thurston passed away yesterday at the far-too-young age of 65. My thoughts and prayers (for what they’re worth) are with his family.

I never read any of Dr. Thurston’s mathematical work in great detail (geometric topology never really being my thing), but from the popular surveys I read, he was of course brilliant. I can’t really do justice to his mathematical accomplishments, but I’ll quickly say for non-mathematical readers that he was probably best known for his geometrization conjecture, which helped pave the way for the proof last decade of the Poincaré conjecture.

What I did read were some of his writings on the philosophy of math; not the boring questions of how “real” mathematical entities are, but thoughts on how we do mathematics. I am incredibly grateful that I had the opportunity to let Dr. Thurston know how much those writings meant to me a couple of years ago. It’s nice to think that I contributed, even infinitesimally, to the net happiness in his life.

People who know me well know that I don’t have any particular objection to targeted advertising. Like most things, it can be used for good purposes or bad purposes, but I think on balance, it’s probably better than untargeted advertising. Either way, it’s more efficient for many advertisers, which is good, because that means they can invest more capital into developing and improving products.

I’m not sure whether YouTube targets the ads that sometimes play before videos, either based on the user or the video. But if so: YouTube, I can’t imagine that there are that many people who want to see something about “The Purina Cat Chow Real Stories Project” when they’re looking for a Nick Cave song.

I’m not sure how one returns to blogging after a two-year hiatus. Probably it should be with something more substantial than this post, which is about regularization in statistics and how I came to understand it. (This account comes from Vapnik’s Nature of Statistical Learning Theory, which I should have finished by now but I keep getting distracted by “classes” and “research” and also “television” and “movies.”)

One of the most basic questions I had when initially learning statistics was: Why do we need to regularize least-squares regression? The answer, I was told, was to prevent overfitting. But this wasn’t totally satisfactory; in particular, I never learned why overfitting is a phenomenon that should be expected to occur, or why ridge regression solves it.

The answer, it turns out, is right there on the Wikipedia page. Speaking very generally: we have some data, say , and a class of functions from which we want to draw our model. The goal is to minimize the distance between and — if we can measure this distance, we have a functional , which we seek to minimize.

Now we want this functional to have the property that if the data of the response variable changes a tiny bit, then the at which our functional is minimized only changes a tiny bit. (Because we expect the data we measure to have some level of noise.) But this is not the case in general — actually, it’s not even the case when our functional is simply , i.e., when we’re doing least-squares linear regression. However, if we instead consider the related functional , then this does have the desired property — our problem is well-posed. And in general, we can regularize many implicit functionals in many paradigms in the same way.

I also learned — though not from Vapnik — that you can derive ridge regression by putting a normal prior on the parameters in your model and then doing standard Bayesian things. But this doesn’t really explain why overfitting is a problem in the way the above account does — at least I don’t think so.

Question: Is there a natural correspondence between well-posed problems and “good” prior distributions on the parameters? (Meta-question: Is there a way to make that question more precise?)

I’ve been on spring break for the past week, so this is even older news than it would have been, but Alex Chilton died earlier this month. If you don’t know who Alex Chilton is, he started off as the singer of the Box Tops, who scored a late-’60s hit with “The Letter.” But in certain circles he’s better known as the lead singer and songwriter for the defining power pop band, Big Star. Big Star’s music was a formative influnce on me as a person, if not as a mathematician, so I’m breaking from my usual routine of crackpot mathematics and expository articles to pay tribute.

Most of what I could say has already been said better, so I’ll link to some other tributes shortly. People have talked about how Big Star were a major influence on all sorts of bands in the ’80s and ’90s up to the present, from R.E.M. to the Replacements to Elliott Smith. People have talked about the timeless nature of Big Star’s songs, about how “Thirteen” is maybe the ultimate expression of what it’s like to be that age. But one thing that’s perhaps been lost in the chatter is the fact that so much of Chilton’s work with Big Star is just plain wonderful music. Witness “Nighttime” and “Blue Moon,” back-to-back tracks from Big Star’s final album, Third (also known as Sister Lover) — to my ears, two of the most achingly beautful pop songs ever recorded.

The best tribute I’ve read — maybe the definitive one — is Paul Westerberg’s in the New York Times. (Westerberg wrote and performed, with the Replacements, the song that gave this post its title.) But the best possible tribute, in my mind, would be if this blog post turned one person on to the music of the late, great Alex Chilton. It might change your life. More likely it won’t. But either way, it’s great damned music, and I hope — and expect — that it’ll live on far after I’m gone.

I found this little gem in an old post on Terry Tao’s blog. There’s not really enough content in it to merit an entire Extremal Toolbox post, but it’s too cool not to point out.

Theorem. Graphs of order and girth at least 5 have edges.

Proof. Suppose that has edges. Define the function to be the “adjacency characteristic function.” Now, by Cauchy-Schwarz:

.

Clearly for fixed, is unbounded as . But the first expression is times the number of (possibly degenerate) 4-cycles in the graph; it’s easy to check that there are degenerate 4-cycles, so as n is unbounded our graph must contain a 4-cycle. QED

Now, it’s possible to get better bounds by cleverly doing “surgery” on the graph and just using pigeonhole. (See here for details.) But it’s tricky and far less beautiful than the argument with Cauchy-Schwarz, which really demonstrates one way in which C-S can be thought of as “strengthening” pigeonhole.

Proof. Consider a tree on n vertices with root . Assign to each edge an element of the vector space , obtained by setting 1s in the the coordinates corresponding to v and w and 0s elsewhere. I claim that these vectors are linearly independent; for suppose otherwise, and letthe vectors corresponding to sum to 0. There is a natural “distance” function on E w/r/t ; let have maximal distance in S, and suppose WLOG that t is farther than s from . Then the coordinate corresponding to t is nonzero for exactly one element of S, and the sum over all elements of S must therefore be nonzero. This is a contradiction. So in particular these |E| = n-1 elements are distinct and nonzero, which means (by Pigeonhole) that there are at most of them.

Since it’s not yet on Thomas’ blog, and I need to use the data, I thought I’d make it publicly available. All credit goes to Thomas Sauvaget for the sequence-to-HTML code, and to Polymath for the sequence itself. (Probably someone specific discovered it, but I don’t know who.) Tables below the fold.