Reinforcement Learning Course Performance-based Rewards (Musings)

David Silver’s ‘teaching’ home page includes links to the slides
(which are particularly helpful for Lecture 7, where there’s no video available - just sound).

These notes are mainly for my own consumption…

Musings : Eligibility Traces

Several things that bubbled to the surface during the Reinforcement Learning course:

Eligibility traces, which are a way of keeping track of how rewards for \(TD(\lambda)\) should be attributed
to the states/actions that lead to them, seem very reminiscent of activation potentials

Aiming for minimum fit error is the same as optimising for least surprise, which is the same as
making predictions that are as un-surprising as possible

Distributing the amount of ‘surprise’ (positive or negative) to exponentially decaying eligibility
nodes with a neural network seems like a very natural reinterpretation of the \(TD(\lambda)\) framework

But current neural networks are densified, with multi-dimensional spaces embedding knowledge
in a way that is entangled far more thoroughly than the sparse representations (apparently) found in the brain

But there appears to be a definite disconnect between the rationales
behind each of the two methods that requires more justification

Musings : Exploration vs Exploitation

Interesting that \( (\mu + n\sigma) \) is an effective action-chooser, since that is what I was doing in the 1990s,
but using as a placeholder heuristic until I found something that was ‘more correct’

But \([0,1]\) bounded i.i.d variables having a decent confidence bound (Hoeffding’s inequality) was a new thing for me

Also, liked the Thompson sampling (from 1930s) that implicitly created samples according to a distribution,
merely by using samples from the underlying factors is very elegant

and makes me think about the heuristic in Genetic Algorithms for ‘just sampling’
as potentially solving a (very complex) probablity problem, without actually doing any computation

Musings : Games and State-of-the-Art

The tree optimisations seem ‘obvious’ in retrospect - but clearly each was a major ‘aha!’ when it
was first proposed. Very interesting.

Almost all of the State-of-the-Art methods use binary features, linear approximations to \( v_*() \), search and self-play.

Perhaps deeper models will work instead of the linear ones - but it’s interesting that binary (~sparse?)
features are basically powerful enough (when there’s some tree-search for trickier strategy planning)

Now we have new real-world learning with which to fine-tune the world model

To apply this to gimbal, seems like one could present ‘target trajectories’ to
controller in turn, letting it learn a common world-model,
with different reward-models for each goal. And let it self-play…