More Formally

What is the optimal way of predicting the future, given
the past? Which is the best way to act such as to maximize one's
future expected reward? Which is the best way of searching
for the solution to a novel problem, making optimal use of
solutions to earlier problems?

Most previous work
on these old and fundamental questions has focused on
very limited settings, such as Markovian environments
where the optimal next action, given past inputs, depends
on the current input only [27].

We will concentrate on a much weaker and therefore much more
general assumption, namely, that the environment's responses are
sampled from a computable probability distribution.
If even this weak assumption were not true then we could not even formally
specify the environment, leave alone writing
reasonable scientific papers about it.

Let us first introduce some notation.
denotes the set of finite
sequences over the binary alphabet , the set of
infinite sequences over , the empty string,
. stand for strings in .
If then is the concatenation of and (e.g.,
if and then
). For ,
denotes the number of bits in , where for
;
. is the prefix of
consisting of the first bits, if , and otherwise
(
). denotes the logarithm with basis 2,
denote functions mapping integers to integers. We write if
there exist positive constants such that
for
all . For simplicity let us consider universal Turing Machines
[64]
(TMs) with input alphabet and trinary output alphabet including the
symbols ``0'', ``1'', and `` '' (blank). For efficiency reasons,
the TMs should have several work tapes
to avoid potential quadratic slowdowns associated with 1-tape TMs.
The remainder of this paper assumes a fixed universal reference TM.

Now suppose bitstring represents the data observed so far.
What is its most likely continuation
? Bayes' theorem yields

(1)

where
is the probability of , given knowledge
of , and
is just a normalizing factor. So the most likely continuation
is determined by , the prior probability of .
But which prior measure is plausible?
Occam's razor suggests that the ``simplest'' should
be more probable.
But which exactly is the ``correct'' definition of simplicity?
Sections 3 and 4
will measure the simplicity of a description by its length.
Section 5
will measure the simplicity of a description by the time
required to compute the described object.