What hyper-parameters are, and what to do with them; an illustration with ridge regression

This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for
free here. This is taken from Chapter 7, which deals
with statistical models. In the text below, I explain what hyper-parameters are, and as an example
I run a ridge regression using the {glmnet} package. The book is still being written, so
comments are more than welcome!

Hyper-parameters

Hyper-parameters are parameters of the model that cannot be directly learned from the data.
A linear regression does not have any hyper-parameters, but a random forest for instance has several.
You might have heard of ridge regression, lasso and elasticnet. These are
extensions to linear models that avoid over-fitting by penalizing large models. These
extensions of the linear regression have hyper-parameters that the practitioner has to tune. There
are several ways one can tune these parameters, for example, by doing a grid-search, or a random
search over the grid or using more elaborate methods. To introduce hyper-parameters, let’s get
to know ridge regression, also called Tikhonov regularization.

Ridge regression

Ridge regression is used when the data you are working with has a lot of explanatory variables,
or when there is a risk that a simple linear regression might overfit to the training data, because,
for example, your explanatory variables are collinear.
If you are training a linear model and then you notice that it generalizes very badly to new,
unseen data, it is very likely that the linear model you trained overfits the data.
In this case, ridge regression might prove useful. The way ridge regression works might seem
counter-intuititive; it boils down to fitting a worse model to the training data, but in return,
this worse model will generalize better to new data.

The closed form solution of the ordinary least squares estimator is defined as:

\[
\widehat{\beta} = (X'X)^{-1}X'Y
\]

where \(X\) is the design matrix (the matrix made up of the explanatory variables) and \(Y\) is the
dependent variable. For ridge regression, this closed form solution changes a little bit:

\[
\widehat{\beta} = (X'X + \lambda I_p)^{-1}X'Y
\]

where \(\lambda \in \mathbb{R}\) is an hyper-parameter and \(I_p\) is the identity matrix of dimension \(p\)
(\(p\) is the number of explanatory variables).
This formula above is the closed form solution to the following optimisation program:

\[
\sum_{i=1}^n \left(y_i – \sum_{j=1}^px_{ij}\beta_j\right)^2
\]

such that:

\[
\sum_{j=1}^p(\beta_j)^2 < c
\]

for any strictly positive \(c\).

The glmnet() function from the {glmnet} package can be used for ridge regression, by setting
the alpha argument to 0 (setting it to 1 would do LASSO, and setting it to a number between
0 and 1 would do elasticnet). But in order to compare linear regression and ridge regression,
let me first divide the data into a training set and a testing set. I will be using the Housing
data from the {Ecdat} package:

as you can see, the coefficients are the same. Let’s compute the RMSE for the unpenalized linear
regression:

preds_lm

The RMSE for the linear unpenalized regression is equal to 2077.4197343.

Let’s now run a ridge regression, with lambda equal to 100, and see if the RMSE is smaller:

model_ridge

and let’s compute the RMSE again:

preds

The RMSE for the linear penalized regression is equal to 2072.6117757, which is smaller than before.
But which value of lambda gives smallest RMSE? To find out, one must run model over a grid oflambda values and pick the model with lowest RMSE. This procedure is available in the cv.glmnet()
function, which picks the best value for lambda:

best_model

## [1] 66.07936

According to cv.glmnet() the best value for lambda is 66.0793576. In the
next section, we will implement cross validation ourselves, in order to find the hyper-parameters
of a random forest.