Optimal Rational Decision Makers

So far we have talked about passive prediction, given the
observations. Note, however, that agents interacting with an environment
can also use predictions of the future to compute action sequences
that maximize expected future reward. Hutter's recent AIXI model
[22]
(author's SNF grant 61847)
does exactly this, by combining Solomonoff's
-based universal prediction scheme with an expectimax
computation.

In cycle action results in perception
and reward , where all quantities may depend on
the complete history. The perception and reward are sampled
from the (reactive) environmental probability distribution .
Sequential decision theory shows
how to maximize the total expected reward, called value,
if is known. Reinforcement learning [27] is used if is
unknown. AIXI defines a mixture distribution
as a weighted sum of distributions
, where is any class
of distributions including the true environment .

It can be shown that the conditional probability of
environmental inputs to an AIXI agent, given the agent's earlier
inputs and actions,
converges with increasing length of interaction against the true, unknown
probability [22],
as long as the latter is recursively computable, analogously
to the passive prediction case.

Recent work [24] also demonstrated
AIXI's optimality in the following sense. The
Bayes-optimal policy based on the mixture is self-optimizing
in the sense that the average value converges asymptotically for
all
to the optimal value achieved by the (infeasible)
Bayes-optimal policy which knows in advance.
The necessary condition that admits self-optimizing policies
is also sufficient. No other structural assumptions are made on .
Furthermore, is Pareto-optimal in the sense that there is no other policy
yielding higher or equal value in all environments
and a strictly higher value in at least one [24].

We can modify the AIXI model such that its predictions are based on the
-approximable Speed Prior instead of the incomputable .
Thus we obtain
the so-called AIS model. Using Hutter's approach [22]
we can now show that the conditional
probability of environmental inputs to an AIS agent, given the earlier
inputs and actions, converges to the true but unknown probability,
as long as the latter is dominated by , such as the above.