Well, well, well. Look who’s come crawling back

I’m not sure how one returns to blogging after a two-year hiatus. Probably it should be with something more substantial than this post, which is about regularization in statistics and how I came to understand it. (This account comes from Vapnik’s Nature of Statistical Learning Theory, which I should have finished by now but I keep getting distracted by “classes” and “research” and also “television” and “movies.”)

One of the most basic questions I had when initially learning statistics was: Why do we need to regularize least-squares regression? The answer, I was told, was to prevent overfitting. But this wasn’t totally satisfactory; in particular, I never learned why overfitting is a phenomenon that should be expected to occur, or why ridge regression solves it.

The answer, it turns out, is right there on the Wikipedia page. Speaking very generally: we have some data, say , and a class of functions from which we want to draw our model. The goal is to minimize the distance between and — if we can measure this distance, we have a functional , which we seek to minimize.

Now we want this functional to have the property that if the data of the response variable changes a tiny bit, then the at which our functional is minimized only changes a tiny bit. (Because we expect the data we measure to have some level of noise.) But this is not the case in general — actually, it’s not even the case when our functional is simply , i.e., when we’re doing least-squares linear regression. However, if we instead consider the related functional , then this does have the desired property — our problem is well-posed. And in general, we can regularize many implicit functionals in many paradigms in the same way.

I also learned — though not from Vapnik — that you can derive ridge regression by putting a normal prior on the parameters in your model and then doing standard Bayesian things. But this doesn’t really explain why overfitting is a problem in the way the above account does — at least I don’t think so.

Question: Is there a natural correspondence between well-posed problems and “good” prior distributions on the parameters? (Meta-question: Is there a way to make that question more precise?)

I do accept as true with all the concepts you’ve presented in your post. They are very convincing and will certainly work. Still, the posts are too quick for beginners. Could you please extend them a bit from next time? Thank you for the post.