Contents

Modeling problem

To get an approach to statistics without the untenable assumption
that the observed data have been generated by a 'true' distribution,
a yardstick of the performance of a model \(P\) as a distribution is
taken as the probability or density \(P(x)\) it assigns to the data.
Equivalently, it can be taken as \(\log 1/P(x)\ ,\) which has the
interpretation of a code length as the number of bits when \(x\) is
encoded as a binary string. Of two models \(P\) and \(Q\ ,\)
according to the Maximum Likelihood principle, the former is
better if \(P(x) > Q(x)\) or \(\log 1/P(x)< \log 1/Q(x)\ .\) Hence, no
models are 'true' or 'false'. They simply have a performance of
varying degree of goodness, which, moreover, can be assessed.

This suggests the Minimum Description Length, MDL,
principle for model selection, of which the original
form states that the best model is the one which permits the
shortest encoding of the data and the model itself. It is the
addition of the code length for the model which separates this
principle from the familiar maximum likelihood principle above and makes
it global in the sense that any two models, regardless of their
complexity, can be compared.

There is a purely coding theoretic justification for the MDL
principle if one bears in mind that a short coding of data can be
achieved only when advantage is taken of all the regular features,
which amounts to a model of the data, or, as Maxwell put it, 'the go
of the data'.

Curve fitting example

To illustrate the MDL principle without undue technical
details, consider the basic curve fitting problem. The data
\((y^n,x^n) = (y_1,x_1), \ldots, (y_n,x_n)\) consist of \(n\) points
described by pairs of coordinates on a 2-dimensional plane, and the
objective is to describe a smooth curve \(\{(\bar y_i,x_i)\}\) to
represent the general shape of the cloud of points. The fundamental
difficulty, which pervades all ordinary statistical treatments of
related problems, is to capture the idea of 'optimal smoothness'.
Intuitively it is a curve which is close to all the points but not
too close if it has to jump abruptly from a point to the next.
Consider a parametric curve, such as a polynomial \(\bar y_i =
\theta_0 + \theta_1 x_i + \ldots + \theta_k x_i^k\ ,\) and measure the
closeness to the points by the quadratic distance \(D(y^n|x^n,{
\theta}) = \sum_i(y_i - \bar y_i)^2\ ,\) where \({ \theta}\) denotes
the \(k+1\) coefficients. To conform with the idea of model as a
distribution consider the density function defined by the quadratic
distance function, which is Gaussian of unit variance
\[
f(y^n|x^n;{ \theta}) = (2 \pi )^{-n/2} e^{-D(y^n|x^n,{
\theta})/2}.
\]
Notice that the negative logarithm of this density function, which
may be regarded as an idealized code length, is equivalent to the
quadratic distance measure, up to a term which does not depend on the data,
and therefore can be dropped from all equations below.

The MDL principle calls for the model, defined by the
parameter values, including their number in some sorting defining the structure,
for which the code length of the data and the model
\[
L(y^n|x^n,{ \theta}) = \log 1/f(y^n|x^n;{ \theta}) + L({
\theta}) \ \ (*)
\]
is minimized. We can distinguish between two situations of interest: In the first situation, \(\theta = (k, \theta_0, \ldots, \theta_{k-1})\ :\) each model is fully described by its number of scalar component parameters, and the value of these parameters. This is the case if, for example, we are interested in finding the degree of the polynomial that best explains the data. Then
\(L(\theta)\) decomposes as \(L(\theta) = L(k) + L(\theta_0, \ldots, \theta_{k-1})\ .\) In the second case, corresponding to the so-called 'subset selection problem', we consider a much larger class of models: we do not restrict the \(k\) nonzero parameters to coincide with the first \(k\) coefficients; rather, they can be any subset of the first \(n\) parameters. Then \(\theta = (k, i_1, \ldots, i_k, \theta_{i_1}, \ldots, \theta_{i_k})\) includes a description of which parameters are nonzero, and we get \(L(\theta) = L(k) + L(i_1, \ldots, i_k) + L(\theta_{i_1}, \ldots, \theta_{i_k})\ .\) Here \(L(i_1, \ldots, i_k) = \log n!/k! (n-k)! \) is the codelength needed to describe the 'structure', i.e. the indices \(i_1, \ldots, i_k\) of the nonzero coefficients.

For each \(k\) the first term in (*) is minimized by the least
squares estimate \({ \hat \theta}\ ,\) and since the components are
real numbers they must be quantized in order to keep the second term
finite. The optimal worst case quantization is of the order of
\(1/\sqrt n\) for each parameter.
The difference between \(
\log 1/f(y^n|x^n;{ \hat\theta})
\) and \(
\log 1/f(y^n|x^n;{ \ddot\theta})
\ ,\) where \(\ddot\theta\) is the discretized point closest to \(\hat\theta\ ,\) is then bounded by a constant not dependent of \(n\ .\) Assuming the parameters range over a finite interval with length \(C\ ,\) the number of discretized values is \( C \cdot \sqrt{n}\ .\) Therefore, the codelength per parameter can be taken to be approximately \( (1/2) (\log n + \log C)\ .\) For large enough \(n\ ,\) the constant terms (including \(L(k)\)) may be ignored, and then the (approximately) optimal
number of parameters can be found by the minimization
\[
\min_k\{ \log 1/f(y^n|x^n;{ \hat \theta}) + (k/2) \log n \ [\ + \log (n!/(k!(n-k)!)) \ ]\},
\]
where the structure term between square brackets is included only in the subset-selection problem.
This provides a relatively simple and unprejudiced solution to a
deep problem for which only ad hoc solutions are provided in
ordinary statistics.

The second and the third terms may be regarded as the amount of learnable information in the data, because clearly the learnable information
is precisely the best model, and it takes about the indicated number of bits
to describe it. The minimum of the first term is the amount of noise, which has no further information that can be extracted with
the models considered. The sum of the two is called stochastic complexity of the data, given the class of Gaussian models
considered.

The same criterion to find the number of parameters, but without the structure term, called BIC for Bayesian Information Criterion, has also been derived
by different means without the coding theoretic interpretation; the
word 'information' is neither the one above nor Shannon
information.
The similarity of the BIC criterion and the (asymptotic approximation to) the original MDL criterion has led many to believe that MDL and Bayesian model selection are equivalent. This is not the case however: there are various modeling problems, more sophisticated than the one discussed here, in which the two approaches lead to quite different formulas, with quite different results. This holds in particular for the more recent, extended formulation of the principle, which has become known as "modern MDL".

Modern MDL principle

Since the code length for the data and the optimal model actually is
not the shortest for the data alone, the modern version of the MDL principle is changed to finding the 'shortest prefix code
length \(\log 1/P(x^n|\mathcal{M})\) for the data', where \(P(x^n|\mathcal{M})\)
is a universal model for a class \(\mathcal{M}\) of models. Broadly speaking, a universal model \(P\)
relative to a class \(\mathcal{M}\) of models (distributions) is a distribution \(P\)
such that, no matter what data \(x_1, \ldots, x_n \) are observed, coding the data using the code with lengths \( - \log P(x_1, \ldots, x_n)\) does not require substantially more bits than coding the data with the distribution \( Q \in \mathcal{M}\) that turns out to be the best for coding the data with hindsight; this is just the maximum likelihood \(Q\ ,\) minimizing, over all \(Q \in \mathcal{M}\ ,\) the codelength \( - \log Q(x_1, \ldots, x_n)\ ,\) or, equivalently, maximizing \(Q(x_1, \ldots, x_n)\ .\)

By basing it on universal models, the definition of stochastic complexity is sharpened, yet it still
breaks up into the same two constituents, the amounts of noise and learnable information.
The learnable information relative to a \(k\)-parameter model \(\mathcal{M}\) now becomes asymptotically equal to \((k/2) \log n\) plus an additional term which tends to a constant with increasing \(n\ .\) This constant depends on the geometric structure of \(\mathcal{M}\) and can be related to the volume of \(\mathcal{M}\) when embedded in some abstract space. A large 'volume' means that \(\mathcal{M}\) contains many distinguishable distributions.

The MDL Principle originated with the publication of Rissanen (1978). Rissanen was inspired by the algorithmic complexity theory of Solomonoff, Kolmogorov and Chaitin. Since 1978, there has been a lot of theoretical and practical research surrounding the idea. One of the most remarkable insights is the following: universal models can be constructed in at least four different ways: based on two-part codes, on Bayes-style model averages, on predictive-sequential procedures or using the so-called normalized maximum likelihood (NML). Superficially, the resulting universal codes are completely different, but one can show that in fact, they lead to almost identical code lengths. Therefore, in practice, all these different types of universal codes may be used.

The 'raw' \((k/2) \log n\) formula for the learnable information, however, does not always produce satisfying results; to obtain these, one often needs more precise nonasymptotic formulas. For Bayes-style, predictive-sequential and NML approaches, these formulas do not involve any quantization. Therefore, in practice they are often easier to use than the original two-part MDL.