I will use ridge regression to estimate $\beta$. Ridge regression contains a tuning parameter (the penalty intensity) $\lambda$. If I were given a grid of candidate $\lambda$ values, I would use cross validation to select the optimal $\lambda$. However, the grid is not given, so I need to design it first. For that I need to choose, among other things, a maximum value $\lambda_{max}$.

Question: How do I sensibly choose $\lambda_{max}$ in ridge regression?

There needs to be a balance between

a $\lambda_{max}$ that is "too large", leading to wasteful computations when evaluating the performance of (possibly many) models that are penalized too harshly;

a $\lambda_{max}$ that is "too small" leading to a forgone opportunity to penalize more intensely and get better performance.

(Note that the answer is simple in the case of LASSO; there you take $\lambda_{max}$ such that all coefficients are set exactly to zero for any $\lambda \geq \lambda_{max}$.)

$\begingroup$It took me several re-reads to figure out exactly what you were asking there. Can one not actually take the limiting value (since all the coefficients will be set to zero -- you can figure out the fit easily enough)? Of course you can't then use exponentially-distanced points, but one might (for example) use points uniform in the inverse of $\lambda$, or one might use a convenient quantile function to place the points.$\endgroup$
– Glen_b♦Aug 3 '16 at 13:48

1

$\begingroup$@Glen, thank you. I reformulated the question; hopefully it is clearer now. Actually, I would probably take the limiting value if only I knew it. This is what the question is about. Do you have an idea what the limiting value is? I thought it is $+\infty$...$\endgroup$
– Richard HardyAug 3 '16 at 14:08

6

$\begingroup$Once, I tried reading the glmnet source code to answer this question. It did not go well.$\endgroup$
– Matthew DruryAug 3 '16 at 14:11

3

$\begingroup$The effect of $\lambda$ in the ridge estimator is that it shrinks singular values $s_i$ of $X$ via terms like that $s_i^2/(s_i^2+\lambda)$. This suggests that selecting $\lambda$ much larger than $s_1^2$ will shrink everything very strongly. I suspect that $\lambda=\|X\|^2=\sum s_i^2$ will be too big for all practical purposes. I usually normalize my lambdas by the squared norm of $X$ and have a grid that goes from $0$ to $1$.$\endgroup$
– amoebaAug 3 '16 at 14:29

1 Answer
1

The effect of $\lambda$ in the ridge regression estimator is that it "inflates" singular values $s_i$ of $X$ via terms like $(s^2_i+\lambda)/s_i$. Specifically, if SVD of the design matrix is $X=USV^\top$, then $$\hat\beta_\mathrm{ridge} = V^\top \frac{S}{S^2+\lambda I} U y.$$
This is explained multiple times on our website, see e.g. @whuber's detailed exposition here: The proof of shrinking coefficients using ridge regression through "spectral decomposition".

This suggests that selecting $\lambda$ much larger than $s_\mathrm{max}^2$ will shrink everything very strongly. I suspect that $$\lambda=\|X\|_2^2=\sum s_i^2$$ will be too big for all practical purposes.

I usually normalize my lambdas by the squared Frobenius norm of $X$ and have a cross-validation grid that goes from $0$ to $1$ (on a log scale).

Having said that, no value of lambda can be seen as truly "maximum", in contrast to the lasso case. Imagine that predictors are exactly orthogonal to the response, i.e. that the true $\beta=0$. Any finite value of $\lambda<\infty $ for any finite value of sample size $n$ will yield $\hat \beta \ne 0$ and hence could benefit from stronger shrinkage.

$\begingroup$Sorry, I abandoned the topic for the time being and I did not have enough time to think deeper about it. I am giving it an upvote now, but I would like to postpone accepting the answer until I have time to convince myself it gives what I really need (I have some reservations, but currently I do not have time to explore them in detail). I hope this is fine with you.$\endgroup$
– Richard HardyNov 2 '16 at 10:10

$\begingroup$What's wrong with compactifying lambda to [0,1] range as I specified in my other question. At the end what kind of grid you will place on this range matters. I have seen three different grids on the internet: linear, log, and sqrt. But I think it should be related to the geometry of the problem at hand. Otherwise it is very ad-hoc.$\endgroup$
– Cowboy TraderFeb 8 '17 at 11:20

$\begingroup$@CagdasOzgenc Here I suggested a principled way to choose maximal $\lambda$. If you use $\kappa$ instead of $\lambda$, then the maximum value of $\lambda$ will be a function of the minimal value of $\kappa$ that you end up using. I don't see any principled way to choose such a minimal value.$\endgroup$
– amoebaFeb 8 '17 at 11:24

1

$\begingroup$@CagdasOzgenc Zero does not make sense, it corresponds to inifinte lambda. I am talking about your minimal non-zero kappa, i.e. the step size. As I said, I don't see how you can choose it. Any choice IMHO will be more ad hoc than what I am suggesting here.$\endgroup$
– amoebaFeb 8 '17 at 11:51