some strings can be effectively compressed but take a lot of effort to do so, or

different strings can be effectively compressed in various degrees, singly, jointly, or conditionally on one another, and that approximating the Kolmogorov complexity with real-world compressors still gives acceptable, or even very good, results.

The incompressibility of random objects
yields a simple but powerful proof technique.
The incompressibility method is a general-purpose
tool and should be compared with
the pigeon-hole principle or the
probabilistic method. Whereas the older methods
generally show the existence of an object with
the required properties, the incompressibility
argument shows that almost all objects have the
required property. This follows immediately
from the fact that the argument is typically used on a
Kolmogorov random object. Since such objects are
effectively indistinguishable, the proof
holds for all such objects. Each class
of objects has an abundance of objects that are Kolmogorov random
relative to the class.

The incompressibility method has been successfully applied
to solve open problems and simplify existing proofs.
The method rests on a simple fact:
a Kolmogorov random string cannot be compressed.
Generally,
a proof proceeds by showing that a certain property
has to hold for some `typical' instance of a problem. Since
`typical' instances are
difficult to define and often impossible to construct,
a classical proof usually involves
all instances of a certain class.

By intention and definition,
an individual Kolmogorov random object is a `typical'
instance. These are the incompressible objects.
Although individual objects cannot be proved to be
incompressible in any given finite axiom system,
a simple counting argument shows that almost
all objects are incompressible.
In a typical proof using the incompressibility method,
one first chooses a Kolmogorov random object from the
class under discussion.
This object is
incompressible.
Then one proves that the desired property holds for this
object.
The argument invariably says that if the property
does not hold, then the object
can be compressed. This yields the required contradiction.

Because we are dealing with only one fixed object,
the resulting proofs tend to be
simple and natural. They are natural in that
they supply rigorous analogues
for our intuitive reasoning.
In many cases
a proof using the incompressibility method implies
an average-case result since almost all strings are
incompressible.

The method is always a matter
of using regularity in an object, or algorithm, imposed by a property under investigation
and quantified in an assumption to be contradicted,
to compress the object's or algorithm's description to below its minimal value.
The incompressibility method is the oldest and the most used application of algorithmic complexity,

A simple example from number theory

In the nineteenth century, Chebychev showed that the number
of primes less than \(n\) grows asymptotically like \(n/\log n\ .\)
Using the incompressibility method we cannot (yet) prove this
statement precisely, but we can come remarkably close with
a minimal amount of effort.
We first prove, following G.J. Chaitin, that for infinitely many \(n\ ,\)
the number of primes less than or equal to \(n\) is at
least \(\log n/ \log \log n\ .\) The proof method is as follows.
For each \(n\ ,\) we construct a description
from which \(n\) can be effectively retrieved.
This description will involve the primes less than \(n\ .\)
For some \(n\) this description must be long, which
shall give the desired result.

Assume that
\(p_1 , p_2 , \ldots , p_m\) is the list of all the primes
less than \(n\ .\) Then,
\[
n = p_1^{{e}_1} p_2^{{e}_2} \cdots p_m^{{e}_m}
\]
can be reconstructed from the vector of the exponents.
Each exponent is at most \(\log n\) and can be represented
by \(\log \log n\) bits. The description of \(n\) (given \(\log n\))
can be given in \(m \log \log n\) bits.

It can be shown that each
\(n\) that is random (given \(\log n\)) cannot be described in fewer
than \(\log n\) bits, whence the result follows.
Can we do better? This is slightly more
complicated. The original idea is due to P. Berman, and improved by J.T. Tromp.
Let \(l(x)\) denote
the length of the binary representation of \(x\ .\)
We shall show that for infinitely many \(n\ ,\) the number of distinct primes
less than \(n\) is at least \(n/\log^2 n\ .\)

Firstly, we can describe any given integer \(n\) by \(E(m)\) concatenated with \(n/p_m\ ,\) where
\(E(m)\) is a prefix-free encoding
of \(m\ ,\) and \(n/p_m\) denotes the literal binary string representing the integer \(n/p_m\ .\) Here \(p_m\)
is the largest prime dividing \(n\ .\)
For random \(n\ ,\) the length of this description,
\(l(E(m)) + \log n - \log p_m\ ,\) must exceed \(\log n\ .\)
Therefore, \(\log p_m < l(E(m))\ .\) It is known
that the length of the prefix-free code
\(l(E(m))\) for the integer \(m\) is \(< \log m + 2 \log \log m\ .\)
Hence, \(p_m < m \log^2 m\ .\) Setting \(n_m := m \log^2 m\ ,\) and since we know from our previous result that there are infinitely many primes,
it follows
that for the special sequence of values of \(n_1, n_2, \ldots\)
the number of primes less than \(n_m\)
exceeds \(n_m/ \log^2 n_m\ .\)

Example: Gödel's incompleteness result

Gödel proved that in any consistent powerful enough theory, there are true, but unprovable statements. He constructed such a statement. Here we use the incompressibility argument to show in a very simple manner that there are, in fact, infinitely many such undecidable statements.

A formal system (consisting of definitions,
axioms, rules of inference)
is consistent
if no statement that can be expressed in the system
can be proved to be both true and false in the system.
A formal system is sound if only true statements can be
proved to be true in the system. (Hence, a sound formal
system is consistent.) The idea below goes back to Ya. Barzdins and was
popularized by G.J. Chaitin.

Let \(x\) be a finite binary
string. We write `\(x\) is random' if
the shortest binary description of \(x\)
has length
at least that of the literal description of \(x\)
(in other words, the Kolmogorov complexity of \(x\) is not less than its length).
A simple counting argument shows
that there are random \(x\)'s of each length, or that most strings are random in the sense that their Kolmogorov complexity is at least their length minus 2.

Fix any sound
formal system \(F\) in which we can express statements
like `\(x\) is random.'
We claim that for all but finitely
many random strings \(x\ ,\) the sentence
`\(x\) is random' is not provable in \(F\ .\)
Assume the contrary.
Suppose
\(F\) can be described in \(f\) bits---assume, for example,
that this is the
number of bits used in the exhaustive description of \(F\)
in the first chapter of the textbook `Foundations of \(F\)'.
Then given \(F\ ,\) we can start to
exhaustively search for a proof that some string of length \(n \gg f \) is random, and print it when we find such a string. This is a \(x\)
satisfying the `\(x\) is random' sentence.
This procedure to print \(x\) of length \(n\) uses only \( \log n + f\)
bits of data, which is much less than \(n\ .\)
But \(x\) is random by the proof, which is true since \(F\) is sound, and hence the shortest way to effectively describe \(x\) is by at least \(n\) bits.
Hence, we have a contradiction.

This shows that although most strings are random, it is impossible to
effectively prove them random. In a way, this explains why the incompressibility method
is so successful. We can argue about a `typical' individual element,
which is difficult or impossible by other methods.

Average-case complexity of Algorithms

The incompressibility method has been very successful to
analyze the difficult average case complexity of
algorithms. For example, the average case time complexity
of an algorithm is the average running time
over all inputs of binary length \(n\ ,\) for each \(n\ .\) This involves analyzing
the running time of all binary strings of length \(n\ ,\) for each \(n\ .\)
With the incompressibility method, we only need to analyze the running time of
a single incompressible (that is, ML-random) string
of length \(n\ ,\) because such a string is so typical that its
running time is about equal to that of most other strings,
and hence the average time.

The method simply analyzes the algorithm
with respect to this single string, by showing that the string is
compressible if the algorithm does not satisfy a certain running time.
This method has been used to analyze the average case running time
for many well-known sorting algorithms,
including the Heapsort and the Shellsort algorithm. A successful example is a \(\Omega (p n^{1+1/p})\)
lower bound on the average-case running time (uniform distribution)
for sorting \(n\) items,
of \(p\)-pass Shellsort. This is the first nontrivial
general lower bound for average-case Shellsort in 40 years.

Applications of Compressibility

Traditional wisdom has it that the better a theory compresses
the learning data concerning some phenomenon under investigation,
the better we learn, generalize, and the better the theory predicts
unknown data. This belief is vindicated in practice but before
the advent of Kolmogorov complexity has not been rigorously proved in
a general setting. The material on applications of compressibility is covered in Li and Vitanyi (2008), Chapter 5. Ray Solomonoff invented the notion of universal
prediction using the Kolmogorov complexity based universal distribution,
see the section on Algorithmic Probability in Algorithmic Information Theory. The discrete version of the universal probability, denoted as \(m\ ,\) has miraculous applications:

For every algorithm whatsoever, the average-case computation complexity resource (e.g. running time, storage) is the same order of magnitude as the worst-case computation complexity resource (running time, storage, respectively), the average taken with inputs distributed according to the universal distribution \(m\ .\)

In PAC-learning, using \(m\) as the distribution from which to draw labeled examples in the learning phase, considering families of discrete concept classes with computable distributions, the learning power increases: Some concept classes that were NP-hard to learn now become polynomially learnable. Similarly, PAC learning of continuous concept classes from families over computable measures is improved by using \(M\ ,\) the continuous version of \(m\ ,\) in the learning phase. Some continuous concept classes that are non-computable in the general setting become computable now. That is, in both cases we draw in the learning phase labeled examples according to the universal distribution, and we PAC-learn according to the real computable distribution. This model is akin to using a teacher in the learning phase who gives the simple examples first.

A formally related quantity is the probability that \(U\) halts when provided with fair coin flips on the input tape (i.e., that a random computer program will eventually halt). This halting probability, \(\Omega = \sum_x m(x)\) also known as Chaitin's constant, or "the number of wisdom" has numerous remarkable mathematical properties, and can be used for instance to quantify Gödel's Incompleteness Theorem and is a natural example of a Martin-Loef random sequence.

The continuous version of \(m\ ,\) denoted above by \(M\ ,\) leads to excellent predictions and decisions in general stochastic environments. A theory of learning by experiments in a reactive environment, has been developed by combining universal distribution predictions with sequential decision theory, results in an optimal reinforcement learning agent embedded in an arbitrary unknown environment Hutter (2005), and a formal definition and test of intelligence.

Figure 1: Kolmogorov's lecture introducing the Structure function at the meeting of the Bernoulli society in Tallinn, Estonia, 1973

Universal prediction is related to optimal effective compression.
The latter is almost always a best strategy in hypotheses identification
(the minimum description length (MDL) principle). While most strings
are incompressible, they represent data where there is no meaningful
law or regularity to learn; it is precisely the compressible strings
that represent data where we can learn meaningful laws from.
As perhaps the last mathematical innovation of an extraordinary
scientific career,
Kolmogorov in 1973
proposed to found statistical theory on finite combinatorial
principles independent of probabilistic assumptions.
Technically, the new statistics is expressed in terms of Kolmogorov complexity. The relation between the individual data and its explanation (model) is expressed
by Kolmogorov's Structure function.
This entails a non-probabilistic approach to statistics
and model selection. Let data be finite binary strings and models be
finite sets of binary strings. Consider model classes
consisting of models of given maximal (Kolmogorov) complexity.
The Structure function of the given data expresses the
relation between the complexity level constraint on a model class
and the least log-cardinality of a model in the class containing the data.
Recently it has been shown by Vereshchagin and Vitanyi that the structure function determines all stochastic properties
of the data: for every constrained model class it determines the individual
best-fitting model in the class irrespective of whether the `true' model
is in the model class considered or not. This approach led to a new foundation for the celebrated Minimum Description Length (MDL) principle for Statistical Modeling or
Statistical Inference, Rissanen (2007).
Essentially, for given data, one analysis by Vitanyi and Li tells which models obtained by the MDL principle are the right ones, the best fitting ones, while the analysis using the structure function tells how to obtain them.
In this setting, this happens with certainty,
rather than with high probability as is in the classical case.
Related ideas have been applied, by Vereshchagin and Vitanyi
to rate-distortion theory and lossy compression
of individual data. Cognitive psychology has a long tradition of applying formal models of simplicity and complexity. The work by E. L. J. Leeuwenberg even predates the advent of Kolmogorov complexity. Not surprisingly, this field has a large and significant literature on applications of Kolmogorov complexity, for example the circles around N. Chater and P.A. van der Helm.

Applications of Compressibility Requiring a Great Effort

Some strings can be compressed but take a great amount of effort, in time or space, to do so. Applications of compressibility requiring a great effort are covered in Li and Vitanyi (2008), Chapter 7. Such applications concern

Applications of Differences in Compressibility by Real Compressors

Using the compressibility of nonrandom strings, a new
theory of information distance and normalized
information distance has been introduced, initially using
Kolmogorov complexity and applied through approximation by
real-life compressors like "gzip", "bzip2", and "PPMZ". Such applications are covered in Li and Vitanyi (2008), Chapter 8.
In classical physics,
we know how to measure the physical distance between a pair of objects.
In the information age, it is important to
measure the "information distance" between two objects:
two documents, two letters, two emails, two music scores, two languages,
two programs, two pictures, two systems, or two genomes.
Such a measurement should not be application dependent or arbitrary.
The universal similarity metric is probably
the greatest practical success of Algorithmic Information Theory. A reasonable definition for
the similarity between two objects is how difficult it is to
transform them into each other. Formally, one can define the
similarity between strings \(x\) and \(y\) as the length of the shortest
program that computes \(x\) from \(y\) and vice versa.
This was proven to be equal to
\[
\max \{K(x|y),K(y|x)\}
\]
up to logarithmic additive terms which can be ignored.
These distances are absolute, but if we want to express similarity,
then we are more interested in relative ones.
For example, if two strings of length 1,000,000 differ by 1000 bits,
then we are inclined to think that those strings are relatively
more similar than two strings of 1000 bits that have that distance.
Hence we need to normalize to obtain a universal similarity
metric, and to apply it in practice for e.g. evolutionary trees of species,
we can approximate the Kolmogorov complexity by real-world compressors. This bold step was taken first for the slightly different sum distance \(K(x|y)+K(y|x)\ ,\) originally derived with a thermodynamic interpretation in mind.
Many applications including chain letter phylogeny, plagiarism detection, protein sequence/structure classification, and phylogenetic reconstruction have followed. In particular, researchers from the datamining community noticed that this methodology is in fact a parameter-free, feature-free, data-mining tool. They have experimentally tested a closely related metric on a large variety of sequence benchmarks. Comparing the compression method with 51 major methods found in 7 major data-mining conferences over the past decade they established clear superiority of the compression method for clustering heterogeneous data, and for anomaly detection, and competitiveness in clustering domain data. It has been shown that the normalized information distance (NID),
\[
NID(x,y) = \frac{ \max\{K{(x|y)},K{(y|x)}\} }{ \max \{K(x),K(y)\}},
\]
where \(K(x|y)\) is algorithmic information of \(x\) conditioned on \(y\ ,\)
is a similarity metric.
The function \(NID(x,y)\) has been shown to satisfy the basic requirements
for a metric distance
measure. The universality conjecture was proved by recently, that is, that the NID
as well as the normalized sum distance proposed
earlier are universal in that they
minimize, are at least as small as, every computable
distance measure satisfying a natural density requirement.
While this metric is not computable, it has an abundance of applications.
Simply approximating \(K\) by real-world compressors, with \(C(x)\) is the
binary length of the file \(x\) compressed with compressor \(C\) (for example
"gzip", "bzip2", "PPMZ", in order to make NID easy to apply, we can rewrite the NID to obtain the
normalized compression distance (NCD)
\[
NCD(x,y) = \frac{C(xy) - \min \{C(x),C(y)\}}{\max \{C(x),C(y)\}}.
\]
NCD is actually a family of distances parametrized with the compressor \(C\ .\)
The better \(C\) is, the closer the NCD approaches the NID, and the
better the results are. The normalized compression
distance has been used to fully automatically reconstruct
language and phylogenetic trees as above. It can also be used for new applications of general clustering
and classification of natural data in arbitrary domains, for clustering of heterogeneous data, and for anomaly detection across domains.
The NID and NCD have been further applied to
authorship attribution, stemmatology, music classification, internet knowledge discovery, to analyze network traffic and cluster computer worms and viruses,
software metrics and obfuscation, web page authorship,
topic and domain identification, hurricane risk assessment,
ortholog detection, and clustering fetal heart rate tracings.
Objects can be given literally, like the literal
four-letter genome of a mouse,
or the literal text of War and Peace by Tolstoy. For
simplicity we take it that all meaning of the object
is represented by the literal object itself. Objects can also be
given by name, like "the four-letter genome of a mouse,"
or "the text of `War and Peace' by Tolstoy." There are
also objects that cannot be given literally, but only by name,
and that acquire their meaning from their contexts in background common
knowledge in humankind, like "home" or "red." Using code-word lengths
obtained from the page-hit counts returned by Google from the web,
we obtain a semantic distance using the NCD formula and viewing
Google as a compressor
useful for data mining, text comprehension, classification, and translation.
The associated NCD, now called the
normalized Google distance (NGD) can be rewritten as:
\[
NGD(x,y)= \frac{ \max \{\log f(x), \log f(y)\} - \log f(x,y) }{
\log N - \min\{\log f(x), \log f(y) \}},
\]
where \(f(x)\) denotes the number of pages containing \(x\ ,\) and \(f(x,y)\)
denotes the number of pages containing both \(x\) and \(y\ ,\) as reported by Google. The number \(N\) can be set to the number of pages indexed by Google.
For a publicly available open-source downloadable software tool CompLearn, written by Rudi Cilibrasi,
for both NCD and NGD, see http://www.complearn.org. For an on-line demo of the NGD, see http://clo.complearn.org/: Simply think of 4-25 words (or compound words containing spaces) in a list. Enter the words in the box below and then press "run experiment" to see the computer make a tree.
The computer will use the Google search engine page counts to compute NGD for each pair of words. Then it will use a new quartet method to find the best tree.

Stimulated by this work, a competitive approach based on
compression has been developed
to Pearson-Neyman hypothesis testing (null-hypothesis versus alternative
hypothesis), tests for randomness of strings generated by random number
generators,
and lossy compression and denoising via compression.