β

Regularization (mathematics)

The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution, by adjusting λ{\displaystyle \lambda }, the weight of the regularization term.

Contents

One particular use of regularization is in the field of classification. Empirical learning of classifiers (learning from a finite data set) is always an underdetermined problem, because in general we are trying to infer a function of any x{\displaystyle x} given only some examples x1,x2,...xn{\displaystyle x_{1},x_{2},...x_{n}}.

A regularization term (or regularizer) R(f){\displaystyle R(f)} is added to a loss function:

where V{\displaystyle V} is an underlying loss function that describes the cost of predicting f(x){\displaystyle f(x)} when the label is y{\displaystyle y}, such as the square loss or hinge loss; and λ{\displaystyle \lambda } is a parameter which controls the importance of the regularization term. R(f){\displaystyle R(f)} is typically chosen to impose a penalty on the complexity of f{\displaystyle f}. Concrete notions of complexity used include restrictions for smoothness and bounds on the vector space norm.[1][page needed]

A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution, as depicted in the figure. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.

Regularization can be used to learn simpler models, induce models to be sparse, introduce group structure into the learning problem, and more.

Regularization can be motivated as a technique to improve the generalizability of a learned model.

The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. The expected error of a function fn{\displaystyle f_{n}} is:

Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over the N{\displaystyle N} available samples:

Without bounds on the complexity of the function space (formally, the reproducing kernel Hilbert space) available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements (e.g. of xi{\displaystyle x_{i}}) were made with noise, this model may suffer from overfitting and display poor expected error. Regularization introduces a penalty for exploring certain regions of the function space used to build the model, which can improve generalization.

When learning a linear function f{\displaystyle f}, characterized by an unknown vectorw{\displaystyle w} such that f(x)=w⋅x{\displaystyle f(x)=w\cdot x}, the L2{\displaystyle L_{2}}-norm loss corresponds to[clarification needed] Tikhonov regularization. This is one of the most common forms of regularization, is also known as ridge regression, and is expressed as:

The learning problem with the least squares loss function and Tikhonov regularization can be solved analytically. Written in matrix form, the optimal w{\displaystyle w} will be the one for which the gradient of the loss function with respect to w{\displaystyle w} is 0.

By construction of the optimization problem, other values of w{\displaystyle w} would give larger values for the loss function. This could be verified by examining the second derivative ∇ww{\displaystyle \nabla _{ww}}.

Early stopping can be viewed as regularization in time. Intuitively, a training procedure like gradient descent will tend to learn more and more complex functions as the number of iterations increases. By regularizing on time, the complexity of the model can be controlled, improving generalization.

In practice, early stopping is implemented by training on a training set and measuring accuracy on a statistically independent validation set. The model is trained until performance on the validation set no longer improves. The model is then tested on a testing set.

The exact solution to the unregularized least squares learning problem will minimize the empirical error, but may fail to generalize and minimize the expected error. By limiting T{\displaystyle T}, the only free parameter in the algorithm above, the problem is regularized on time which may improve its generalization.

The algorithm above is equivalent to restricting the number of gradient descent iterations for the empirical risk

A comparison between the L1 ball and the L2 ball in two dimensions gives an intuition on how L1 regularization achieves sparsity.

Enforcing a sparsity constraint on w{\displaystyle w} can lead to simpler and more interpretable models. This is useful in many real-life applications such as computational biology. An example is developing a simple predictive test for a disease in order to minimize the cost of performing medical tests while maximizing predictive power.

A sensible sparsity constraint is the L0{\displaystyle L_{0}} norm‖w‖0{\displaystyle \|w\|_{0}}, defined as the number of non-zero elements in w{\displaystyle w}. Solving a L0{\displaystyle L_{0}} regularized learning problem, however, has been demonstrated to be NP-hard.[2]

For a problem minw∈HF(w)+R(w){\displaystyle \min _{w\in H}F(w)+R(w)} such that F{\displaystyle F} is convex, continuous, differentiable, with Lipschitz continuous gradient (such as the least squares loss function), and R{\displaystyle R} is convex, continuous, and proper, then the proximal method to solve the problem is as follows. First define the proximal operator

The algorithm described for group sparsity without overlaps can be applied to the case where groups do overlap, in certain situations. This will likely result in some groups with all zero elements, and other groups with some non-zero and some zero elements.

If it is desired to preserve the group structure, a new regularizer can be defined:

For each wg{\displaystyle w_{g}}, w¯g{\displaystyle {\bar {w}}_{g}} is defined as the vector such that the restriction of w¯g{\displaystyle {\bar {w}}_{g}} to the group g{\displaystyle g} equals wg{\displaystyle w_{g}} and all other entries of w¯g{\displaystyle {\bar {w}}_{g}} are zero. The regularizer finds the optimal disintegration of w{\displaystyle w} into parts. It can be viewed as duplicating all elements that exist in multiple groups. Learning problems with this regularizer can also be solved with the proximal method with a complication. The proximal operator cannot be computed in closed form, but can be effectively solved iteratively, inducing an inner iteration within the proximal method iteration.

When labels are more expensive to gather than input examples, semi-supervised learning can be useful. Regularizers have been designed to guide learning algorithms to learn models that respect the structure of unsupervised training samples. If a symmetric weight matrix W{\displaystyle W} is given, a regularizer can be defined:

If Wij{\displaystyle W_{ij}} encodes the result of some distance metric for points xi{\displaystyle x_{i}} and xj{\displaystyle x_{j}}, it is desirable that f(xi)≈f(xj){\displaystyle f(x_{i})\approx f(x_{j})}. This regularizer captures this intuition, and is equivalent to:

R(f)=f¯TLf¯{\displaystyle R(f)={\bar {f}}^{T}L{\bar {f}}} where L=D−W{\displaystyle L=D-W} is the Laplacian matrix of the graph induced by W{\displaystyle W}.

The optimization problem minf∈RmR(f),m=u+l{\displaystyle \min _{f\in \mathbb {R} ^{m}}R(f),m=u+l} can be solved analytically if the constraint f(xi)=yi{\displaystyle f(x_{i})=y_{i}} is applied for all supervised samples. The labeled part of the vector f{\displaystyle f} is therefore obvious. The unlabeled part of f{\displaystyle f} is solved for by:

In the case of multitask learning, T{\displaystyle T} problems are considered simultaneously, each related in some way. The goal is to learn T{\displaystyle T} functions, ideally borrowing strength from the relatedness of tasks, that have predictive power. This is equivalent to learning the matrix W:T×D{\displaystyle W:T\times D} .

This regularizer constrains the functions learned for each task to be similar to the overall average of the functions across all tasks. This is useful for expressing prior information that each task is expected to share similarities with each other task. An example is predicting blood iron levels measured at different times of the day, where each task represents a different person.

This regularizer is similar to the mean-constrained regularizer, but instead enforces similarity between tasks within the same cluster. This can capture more complex prior information. This technique has been used to predict Netflix recommendations. A cluster would correspond to a group of people who share similar preferences in movies.