We have presented in this chapter some simple ways of balancing exploration and
exploitation. The -greedy methods choose randomly a small fraction of the
time, the softmax methods grade their action probabilities according to the current
action-value estimates, and the pursuit methods keep taking steps toward the
current greedy action. Are these simple methods really the best we can do in terms
of practically useful algorithms? So far, the answer appears to be "yes."
Despite their simplicity, in our opinion the methods presented in this chapter can
fairly be considered the state of the art. There are more sophisticated methods,
but their complexity and assumptions make them impractical for the full
reinforcement learning problem that is our real focus. Starting in Chapter 5 we
present learning methods for solving the full reinforcement learning problem that
use in part the simple methods explored in this chapter.

Although the simple methods explored in this chapter may be the best we can
do at present, they are far from a fully satisfactory solution to the problem of
balancing exploration and exploitation. We conclude this chapter with a brief look at
some of the current ideas that, while not yet practically useful, may point the
way toward better solutions.

One promising idea is to use estimates of the uncertainty of the
action-value estimates to direct and encourage exploration. For example,
suppose there are two actions estimated to have values slightly less than
that of the greedy action, but that differ greatly in their degree of
uncertainty. One estimate is nearly certain; perhaps that action has been
tried many times and many rewards have been observed. The uncertainty for
this action's estimated value is so low that its true value is very unlikely
to be higher than the value of the greedy action. The other action is known
less well, and the estimate of its value is very uncertain. The true
value of this action could easily be better than that of the greedy
action. Obviously, it makes more sense to explore the second action than
the first.

This line of thought leads to interval estimation methods. These
methods estimate for each action a confidence interval of the action's
value. That is, rather than learning that the action's value is approximately
10, they learn that it is between 9 and 11 with, say, 95% confidence. The
action selected is then the action whose confidence interval has the highest
upper limit. This encourages exploration of actions that are uncertain and have a chance of ultimately being the best action. In some cases one can
obtain guarantees that the optimal action has been found with confidence equal
to the confidence factor (e.g., the 95%). Unfortunately, interval estimation
methods are problematic in practice because of the complexity of the
statistical methods used to estimate the confidence intervals. Moreover, the
underlying statistical assumptions required by these methods are often not
satisfied. Nevertheless, the idea of using confidence intervals, or some other
measure of uncertainty, to encourage exploration of particular actions is sound
and appealing.

There is also a well-known algorithm for computing the Bayes optimal
way to balance exploration and exploitation. This method is
computationally intractable when done exactly, but there may be efficient
ways to approximate it. In this method we assume that we know the
distribution of problem instances, that is, the probability of each
possible set of true action values. Given any action selection, we can
then compute the probability of each possible immediate reward and the
resultant posterior probability distribution over action values. This
evolving distribution becomes the information state of the
problem. Given a horizon, say 1000 plays, one can consider all possible
actions, all possible resulting rewards, all possible next actions, all
next rewards, and so on for all 1000 plays. Given the assumptions, the
rewards and probabilities of each possible chain of events can be
determined, and one need only pick the best. But the tree of
possibilities grows extremely rapidly; even if there are only two actions
and two rewards, the tree will have
leaves. This approach effectively turns the bandit problem
into an instance of the full reinforcement learning problem. In the end,
we may be able to use reinforcement learning methods to approximate
this optimal solution. But that is a topic for current research and beyond
the scope of this introductory book.

The classical solution to balancing exploration and exploitation in -armed
bandit problems is to compute special functions called Gittins indices.
These provide an optimal solution to a certain kind of bandit problem more general
than that considered here but that assumes the prior distribution of possible
problems is known. Unfortunately, neither the theory nor the computational
tractability of this method appear to generalize to the full reinforcement
learning problem that we consider in the rest of the book.