A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.

L₁ regularization is often preferred because it produces sparse models and thus performs feature selection within the learning algorithm, but since the L₁ norm is not differentiable, it may require changes to learning algorithms, in particular gradient-based learners.[1][2]

Regularization can be used to fine-tune model complexity using an augmented error function with cross-validation. The data sets used in complex models can produce a levelling-off of validation as complexity of the models increases. Training data sets errors decrease while the validation data set error remains constant. Regularization introduces a second factor which weights the penalty against more complex models with an increasing variance in the data errors. This gives an increasing penalty as model complexity increases.[3]

Examples of applications of different methods of regularization to the linear model are:

In inverse problem theory, an optimization problem is usually solved to generate a model that provides a good match to observed data. In this context, a regularization term is used to preserve prior information about the model and prevent over-fitting and convergence to a model that matches the data but does not predict well. Ensemble-based regularization is based on utilizing an ensemble (i.e., a set) of realizations from the prior probability distribution function (pdf) to construct a regularization term .[10] This regularization is flexible as it is based on representing the prior pdf using a set of realizations, instead of using, say a mean and covariance matrix for Gaussian distribution.