QLearning

Dear soul-mates, I would like to comment some issues of the QLearning algorithm in the book. (1) First of all I think there is a mistake in the pseudo code:

Message 1 of 1
, Sep 9, 2006

0 Attachment

Dear soul-mates,

I would like to comment some issues of the QLearning algorithm in the book.

(1)
First of all I think there is a mistake in the pseudo code: For the update of
Q[a,s], r' instead of r should be used. Otherwise the values of the final states
will never be taken into account.

(2)
If my point (1) is correct, then the static variable r is not needed in the
code.

(3)
Nsa should be initialized to 0 for all values.

(4)
max(a') should be randomly chosen if the are more than one maximum (between
them)

(5)
f(q,n) must return a value even when q is null, i.e. when the agent has no idea
of the value of Q(a', s')
In the book it works because f(q,n) returns 2 the first 5 iterations, regardless
of the value of Q(a', s') (explained in the page 774 of the International Edition)

(6)
I have been playing with the QLearning algorithm (after the modification
described in (1)) and the simple MDP world example. I have checked it with two
different parameters sets. The one described in the book:
- reward of non-terminal states: -0.02 [r]
- applies the value 2 [rp] to actions done less than 5 [en] times
- the learning rate [rl] is
60 / (60 - 1) + iteration
i.e. the parameter rl is here 60
- the number of trials is 2000 [tn]
and the values:
lr = 5 // learning rate
en = 100 ; // exploration number
rp = 2 ; // value of unknown states
tn = 300 ; // number of trials
r = -0.05

I have computed the q-values of the four actions in state 3,3 for each
iteration. With the first set (see attached graph1.png for the values of a
representative experiment), in most of the experiments, the values are wrong at
the end.
With the second set, the values are correct for all experiments I have performed
so far (see attached graph2.png for an example)