Note that I'm not yet trying to find an optimal $\lambda$, I'm simply estimating $\hat w^{ridge}$ for a given $\lambda$. However, something seems off. When I enter $\lambda = 0$ I get the expected OLS result. I checked this by applying lm.ridge(lambda = 0) on the same dataset and it gives me the same coefficients. However, when I input any other penalty, like $\lambda=2$ or $\lambda=5$ my coefficients and the coefficients given by lm.ridge disagree wildly. I tried looking at the implementation of lm.ridge but I couldn't work out what it does (and therefore what it does differently).

Could anyone explain why there is a difference between my results and the results from lm.ridge? Am I doing something wrong in my code? I've tried playing around with scale() but couldn't find an answer there.

Okay, so based on responses below, I've gotten a somewhat clearer understanding of the problem. I've also closely re-read the section about RR in TESL by Hastie, Tibshirani and Friedman, where I discovered that the intercept is often estimated simply as the mean of the response. It seems that many sources on RR online are overly vague. I actually suspect many writers have never implemented RR themselves and might not have realized some important things as many of them leave out 3 important facts:

Intercept is not penalized in the normal case, the formula above only applies to the other coefficients.

RR is not equivariant under scaling, i.e. different scales gives different results even for the same data.

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

$\begingroup$@Dougal I was thinking it might be something like that but I tried standardizing by using scale() in my function but then I didn't even get the same coefficients for $\lambda=0$.$\endgroup$
– Johan FalkenjackOct 28 '15 at 15:35

3 Answers
3

First, very simply, I don't think your call to solve looks right, this is what I would expect

solve(t(X) %*% X + lambda.diag, t(X) %*% y)

Your code seems to be explicitly calculating a matrix inverse and then multiplying. This is mathematically correct, but computationally incorrect. It is always better to solve the system of equations. I've gotten in the habit of reading equations like $y = X^{-1}z$ as "solve the system of equations $Xy = z$ for $y$."

On a more mathematical note, you should not have to include an intercept term when fitting a ridge regression.

It is very important, when applying penalized methods, to standardize your data (as you point out with your comment about scale. It's also important to realize that penalties are not generally applied to the intercept term, as this would cause the model to violate the attractive property that the average predictions equal the average response (on the training data).

Together, these facts (centered data, no intercept penalty) imply that the intercept parameter estimate in a ridge regression is known a priori, it is zero.

The coefficient vector from ridge regression is the solution to the penalized optimization problem

But $x_{0i}$ are the entries in the model matrix corresponding to the intercept, so $x_{0i} = 1$ always. So we get

$$\sum_{i=1} y_i + \sum_{j=0}^q \beta_j \sum_{i=1}^n x_{ij} $$

The first term, with the sum over y, is zero because $y$ is centered (or not, a good check of understanding is to work out what happens if you don't center y). In the second term, each predictor is centered, so the sum over $i$ is zero for every predictor $j$ except the intercept. For the intercept, the second term $i$ sum comes out to $n$ (it's $1 + 1 + 1 + \cdots$). So this whole thing reduces to

So, you do not need to bind on an intercept term to your model matrix. Your function should either expect standardized data (and if you plan on making it public, it should check this is so), or standardize the data itself. Once this is done, the intercept is known to be zero. I'll leave it as an exersize to work out what the intercept should be when you translate the coefficients back to the un-normalized scale.

I still don't get results equivalent to lm.ridge though, but it might just be a question of translating the formula back into the original scales. However, I can't seem to work out how to do this. I thought it would just entail multiplying by the standard deviation of the response and adding the mean, as usual for standard scores, but either my function is still wrong or rescaling is more complex than I realize.

It's a bit more complicated, but not too bad if you are careful. Here's a place where I answered a very similar question:

$\begingroup$Okay, so If I understand you correctly I need to do something like this: fitRidge <- function(X, Y, lambda) { X <- scale(X) penalties <- lambda * diag(ncol(X)) wz <- c(mean(Y), solve(t(X) %*% X + penalties, t(X) %*% Y)) return(wz) } Assuming intercept is just mean of Y as Hastie says. I'm guessing this would give me a RR for standardized values, however, I can't seem to figure out how to translate the formula back into something on the original scale. I thought It would just entail multiplying with sd of Y and adding the mean of Y but apparently not.$\endgroup$
– Johan FalkenjackOct 28 '15 at 21:29

Not a complete answer here, but some relevant things for those that will try to implement Ridge regression by themselves and compare their results to lm.ridge.

lm.ridge uses SVD to estimate the coefficients. It is pretty straightforward to implement
Relevant explanation on how to do that is in the upvoted comment from this question:
Relevant answer

It also standardizes the predictors in a different way from the scale() function. It uses divisions by n instead of n-1. That can explain small deviations in the estimations.

There is also something up with the intercept (it deals with it in a different way than I do) as pointed out, but I couldn't get everything running inside lm.ridge to work out what it exactly does with the intercept term.