The algorithm

Let the world state at time be , and assume that the
learning system then chooses action . The immediate result is
that a reward is received by the learner and the world
undergoes a transition to the next state . The objective
of the learner is to choose actions maximizing discounted
cumulative rewards over time. More precisely, let be a
specified discount factor in . The total discounted
return (or simply return) received by the learner starting
at time is given by:

(1)

The discount factor is a number in the range of
and is used to weight near term reinforcement more heavily than
distant future reinforcement. The closer is to the
greater the weight of future reinforcements

The objective is to find a policy , or rule for selecting
actions, so that the expected value of the return is maximized. It
is sufficient to restrict attention to policies that select
actions based only on the current state (called stationary
policies) :
. For any such policy and for any
state we define:

for all

(2)

The expected total discounted return received when starting in
state x and following policy thereafter. If is an
optimal policy we also use the notation for . Many
dynamic programming-based reinforcement learning methods involve
trying to estimate the state values or for a
fixed policy .

Q-learning, is a simple incremental algorithm developed from the
theory of dynamic programming [Ross,1983] for delayed
reinforcement learning. In Q-learning, policies and the value
function are represented by a two-dimensional lookup table indexed
by state-action pairs. Formally, for each state and action
let:

(3)

(4)

where
and is
the probability of reaching state as a result of taking action
in state . It follows that

(5)

Intuitively, Equation
4 says that the state-action value, , is the
expected total discounted return resulting from taking action
in state and continuing with the optimal policy thereafter.
More generally, the function can be defined with respect to an
arbitrary policy as

(6)

and is just for an optimal policy .

The Q-learning algorithm works by maintaining an estimate of the
function, which we denote by , and adjusting
values (often just called Q-values) based on
actions taken and reward received. This is done using Sutton's
prediction difference, or TD error [Sutton,1988] - the difference
between (the immediate reward received plus the discounted value
of the next state) and (the Q-value of the current state-action
pair):

(7)

where is the immediate reward, is the next state resulting from taking action in state , and
. Then the values of are adjusted according to

(8)

where
is a learning rate parameter. Note
that the current estimate of the function implicitly defines
a greedy policy by
. That is, the greedy policy is to select actions
with the largest estimated
Q-value.

It is important to note that the Q-learning method does not
specify what actions the agent should take at each state as it
updates its estimates. In fact, the agent may take whatever
actions it pleases. This means that Q-learning allows arbitrary
experimentation while at the same time preserving the current best
estimate of states' values. This is possible because Q-learning
constructs a value function on the state-action space, instead of
the state space. It constructs a value function on the state space
only indirectly. Furthermore, since this function is updated
according to the ostensibly optimal choice of action at the
following state, it does not matter what action is actually
followed at that state. For this reason, the estimated returns in
Q-learning are not contaminated by "experimental" actions
[Watkins,1989], so Q-learning is not experimentation-sensitive.

To find the optimal Q function eventually, however, the agent must
try out each action in every state many times. It has been shown
[Watkins,1989 ; Dayan,1992] that if equation 8 is
repeatedly applied to all state-action pairs in any order in which
each state-action pair's Q-value is updated infinitely often, then
will converge to and will converge
to with probability 1 as long as is reduced to 0 at
a suitable rate. This is why the policy
is only used a part of the time in order to be
able to explore the state-action space completely. At each
iteration , we will choose to make either a random action
random or the optimal action
. We
will take the first possibility (the random action) with a
probability of . At the beginning of the learning
must be huge (near 1). At the end of the learning, when
, we will set
to always use
the optimal policy. It is, as always, the compromise between
exploitation-exploration.

Shortly, we have:

: learning rate

: discount factor The closer is to the
greater the weight is given to future reinforcements.

You can see on the upper-left part of the screen the
average speed of the robot. That is:

average speed

(9)

When
you press the [reset speed counter] button, the program set
and .

Small value of will not allow the robot to extend
its arm forward to the right because during all this time the
robot gets no reward. You must set a high value for
otherwise the robot doesn't see that, far in the future, it is
useful to have its arm extended.

should be set a high value (.8) at the beginning
of the learning process. should be set to zero when the
optimal control policy has been found. To see if an optimal policy
has been found, press the [reset speed counter] button and wait a
little until the [speed counter] stabilizes. You have now a
correct evaluation of the speed of the robot. Continue the
learning process until you can't increase this value anymore.

You can compete with the algorithm! See by yourself that the Q-learning
is the best! Try to beat it! How to play? Press the [stop] button. Press
the [reset speed counter] button. Move the arm and the hand of the robot
using the 4 arrow keys: [UP/DOWN] moves the arm (yellow line) and [LEFT/RIGHT]
move the hand (red line). Move the robot until the [speed counter] stabilizes.
Now the result: Do you really go faster than Q-learning?