Regularization for Simplicity: L₂ Regularization

Consider the following generalization curve, which shows the loss
for both the training set and validation set against the number of
training iterations.

Figure 1. Loss on training set and validation set.

Figure 1 shows a model in which training loss gradually decreases,
but validation loss eventually goes up. In other words, this generalization curve
shows that the model is
overfitting
to the data in the training set. Channeling our inner
Ockham,
perhaps we could prevent overfitting by penalizing complex models, a principle
called regularization.

In other words, instead of simply aiming to minimize loss (empirical risk minimization):

$$\text{minimize(Loss(Data|Model))}$$

we'll now minimize loss+complexity, which is called structural
risk minimization:

$$\text{minimize(Loss(Data|Model) + complexity(Model))}$$

Our training optimization algorithm is now a function of
two terms: the loss term, which measures how well the
model fits the data, and the regularization term,
which measures model complexity.