17 Further work

This thesis may look like some kind of definitive empirical comparison of these methods
but it is in fact far from the last word.
The real contribution of this thesis is to invent these methods in the first place,
show their motivation, their expected strengths and weaknesses,
place them in context relative to each other,
and show how the initial empirical results support what the theory would predict.
There are a number of untested methods
(though their expected performance has been discussed in the text):

Untested methods that we expect to be useful

The pure Minimize the Worst Unhappiness when actions are shared
(§13).

W-learning with full space with a declining Boltzmann distribution for W
instead of random W winners (§10).

Untested methods that we do not expect to be interesting

Other static measures of W, such as W = importance
(§5.4).
We expect them to have the same problem as static W=Q,
of not being able to take opportunities
(see §9.1).

Any scaled static or dynamic measures of W
(§8.1.2,
§15.4.5).
These won't provide any advantage
because we do not want to fix agents' inequalities
relative to each other
(see §8.1.2).

The Collective methods (§12).
[Whitehead et al., 1993] found that
Maximize Collective Happiness performed not much different from W=Q
(while both were better than monolithic Q-learning).

The product method of Maximize Collective Happiness
(§15.5.2)
- once the correction is made that rewards (and hence Q-values)
are all made > 0.

The artificial world I used increased in complexity throughout the thesis,
but as was revealed in §9.1,
still proved too simple to properly separate the two approaches of
Maximize the Best Happiness and
Minimize the Worst Unhappiness.
To properly contrast the two approaches,
they need to be tested in an environment
in which the rewards of opportunism (and the costs of not being opportunistic) are greater.

When testing the Collective methods,
the world should also be one
where some agents have non-recoverable states.
This world probably doesn't have enough such states.
If not obeyed at some point,
an agent can almost always recover its position within a small number of steps.

As just discussed,
the definitive empirical test probably needs more than
just a global fitness function as a linear combination of rewards.
It should also include an explicit penalty
for solutions that ignore subgoals.

In scaling up Reinforcement Learning to complex problems (such as robot control),
ways of breaking up the problem and its statespace are clearly needed.
The problem is normally phrased (e.g. see
[Kaelbling et al., 1996,
§6.3])
as that of task decomposition
- breaking up the problem into subtasks that must be achieved in serial or in parallel.

It would be misleading to see this thesis as providing a solution
to the general problem of decomposing problems into sub-tasks.
Rather, its contribution is, given a collection of agents,
find automatic ways of resolving their competition
so that they can share the same creature in a sensible manner.
Once put together, the agents in W-learning automatically find each other's weak spots,
and discover the areas on which they can and cannot compromise.

Then the problem becomes one of designing a suitable collection of agents.
This is not at all trivial,
but it is hoped that it will be easier than designing the details of their action selection.
We do have some rules of thumb to help in designing these collections
- for example, to increase the influence of an agent in the creature's mind,
slowly increase the differences between its rewards
(recall §8.1.3).

This system has the property of being able to "absorb" change
- increasing an agent's strength doesn't guarantee that it will do better in every single state.
In some states, a change of leader might reawaken a dormant agent,
who suddenly finds that the agent that used to fight its battles for it is no longer winning.
In other words, an agent increasing its strength will be unlikely to suddenly
ruin all the other agents. They will give up their ground slowly,
giving up less important states
but holding on to the end to their crucial states.
This is all done automatically.

The promise of RL is that designing a reward function
will be easier than designing the behavior itself.
Similarly, in addressing the Action Selection problem,
it is hoped that designing well-balanced collections
(and letting the competition resolution be automatic)
will be easier than designing the detailed action selection itself.

In general, we cannot expect to apply these techniques
to all multi-goal problems.
If goals are so different that they effectively can't be interrupted
or interleaved,
the ideas in this thesis may not be of much help.

Note however that the W-learning algorithm does not force goal-interleaving.
It only discovers whether goal-interleaving is possible.
Competition can actually result in serial behavior in practice.
Imagine agents that have almost no areas of agreement,
where one agent's best action is always the other's worst action, and vice versa.
As soon as one agent gets started (gets to take a few actions down its goal path)
it will be too strong compared to the others that are not started,
since it will have taken us into states where they can get no reward,
yet at the same time they would cause it a considerable loss if they take over.
Their loss by not getting their goal started cannot compete
with its loss if it loses the lead.
So it will fight them off and run until completion.
Then another agent gets a chance, and runs to termination.
And so on.
In practice, goals are satisfied serially in this case.
This is simply an extreme version of the persistence-enforcing
positive-feedback mechanism.

This thesis has been concerned with the Action Selection problem
- the problem of choice between multiple conflicting goals at run-time.
I now turn all this on its head by applying it
to the problems that Reinforcement Learning has been primarily interested in
- problems having one single goal.

Nothing in our model forces the competing agents to be actually trying to solve different goals.
To solve a single-goal problem,
we put together a large number of agents with different reward functions
but all roughly trying to solve the same thing
- each perhaps taking radically different approaches,
with statespaces and actionspaces that share nothing in common -
and we let them struggle via W-learning to solve it.
If there are multiple unexpressed behaviors, and lots of overlap,
this may be a small price to pay if there is a robust solution.

This has some resemblance to Minsky's "accumulation" of multiple different ways of solving a problem
in his Society of Mind [Minsky, 1986].
Some may be more efficient than others, but we keep them all for robustness.
Or more subtly, it might be hard to rank them in a single order.
Some may be more efficient some times (for some x)
but less efficient other times (for other x).

Having a noisy collection of competing agents to solve the problem may seem a dangerous strategy,
but remember that if the creature is looking as if it is going to make some serious error,
individual agents will come in with large W-values and take it all over for a while.
We positively want collections that single agents can take over and dominate
(such as Minimize the Worst Unhappiness)
if they feel strongly enough.

Note that we could try something like this in Hierarchical Q-learning too
(but would pay the price of a large (x,i) statespace).

In this thesis, I introduced Reinforcement Learning first,
and then showed how these numerical values could be propagated further to solve the action selection problem.
But I could just as easily have introduced the action selection methods
and left open where the numerical values came from.
They could also be evolved randomly, designed explicitly,
or come from some other learning method.

In particular, any system that builds up "fitness" values for actions F(x,a) can be used
as the basis of W-learning type schemes.
If a symbolic AI system can answer either the question before the fact:
"You want to take action .
If I force you to take action , how bad will that be?"
or the question after the fact:
"OK I ignored you and did something.
How bad was it?"
then a symbolic AI W-learning system can be built.

As should have been obvious throughout,
and as is illustrated by Figures 5.3
and 6.1,
W-learning almost demands a parallel implementation.
Agents can work out their suggested action
and update their Q and W values
independently, there being no serial links between them at all.

Most parallel multi-agent systems postulate
low-bandwidth communications between agents anyway.
With basic (not Negotiated) W-learning there is perfect parallelism,
where there is no communication between agents at all.
What communication there is takes place between agents and the switch.
Similar to schemes such as Maes' Spreading Activation Networks
[Maes, 1989,
§7.2],
no problem-specific agent language is used.
Only numbers are communicated.

Consider again Watkins' motivation [Watkins, 1989]
of finding learning methods that might plausibly occur in nature,
in particular methods that impose limited computational demands per timestep.
Once the Q-values have been learnt,
Q-learning provides a fast control policy
- ignore rewards, don't learn, and simply act on the best Q-value.
A Q-learning-controlled creature is a pure stimulus-response machine.
Parallel W-learning would retain all the speed of stimulus-response
in a more sophisticated architecture.

The whole motivation for the behavior-based AI project (see [Dennett, 1978])
is to understand simple complete creatures
and gradually move on to more complex complete creatures
(as opposed to the traditional AI approach
of trying to understand sub-parts of already-complex creatures
in the hope of combining these parts into an understanding of a whole complex creature).

The behavior-based AI approach involves a certain discipline in not moving on to more complex models
until simpler ones have been exhausted and understood.
There is a great temptation in the House Robot problem to start equipping the creature
with internal working memory, context, world models, map-building ability, a concept of time, plans, explicit goals,
representations, symbols, reasoning, and so forth.
One ends up with a hand-designed ad-hoc system from which very few if any general principles can be learnt.

I take the broad approach of Wilson [Wilson, 1990]
in pushing the limits of simple animat models,
and resisting as long as possible the temptation to introduce more complex models:

For example, Todd et al. [Todd et al., 1994] show that
even the possibilities of sensor-less(!) memory-less creatures
have not yet been fully explored.
Their creatures find (by evolution) the optimum p(a) ,
the probability of generating an action (independent of the state of the world).

Certainly it is clear from this thesis that the possibilities of stimulus-response
have not remotely been exhausted.
In particular, societies of stimulus-response agents.
Instead of building more complex components,
I'm heading in the other direction,
of more complex societies of relatively simple components.
Actually I call these relatively simple components because they are stimulus-response,
but in fact making hierarchies or societies out of components as complex as entire learning behaviors
has only been tried relatively recently.

Where all this is leading is towards ecosystem-like minds,
with structure in their collections, multiple competitions,
and dynamic creation and destruction of agents.
But I have been building carefully from the bottom up,
instead of jumping direct to such complex systems.

This work was partly inspired by Edelman's vision of a "rainforest" mind,
with life and death, growth and decay, in a dynamic ecosystem inside the head [Edelman, 1989, Edelman, 1992].
In his model, called Neural Darwinism,
competition is between structures called neuronal groups.

The idea is appealing, but it is difficult to tell whether neuronal groups are meant to be action-producing agents
or just pattern classifiers.
Edelman presents no explicit algorithm to show
how the model is implemented.
In fact, presenting arguments only about symbolic AI,
he draws the familiar conclusion that no
computer algorithm[1]
could implement his ideas.
No mention is made of self-modifying, reinforcement-learning agents embedded in the world,
to which his criticisms do not apply.

Here, with W-learning, we have the basis
for a full living and dying ecosystem inside the creature,
where what comes into existence, struggles, and perhaps dies is not mere connections (as Edelman may in fact mean)
but entire autonomous goals.
In standard W-learning, there is one competition between the agents,
this is resolved, and the creature is then fixed.
In a dynamic collection,
agents would continually adjust their W-values as new agents came into existence,
old agents died, and the nature of their competition changed.

In the earlier version of this document [Humphrys, 1996a, §17.5.1]
I briefly discussed the possibilities for constructing
dynamically changing collections.
The basic problem with a dynamic collection though,
is, if there is an automatic generator of random new agents,
that it can just generate a single new, extremely strong agent
(large differences between its rewards)
which immediately takes over the whole creature,
turning a carefully-built creature into a zombie overnight.

In fact, there would be an "arms race" between such strong agents
over the lifetime of the creature,
as the agent-generator generated an even stronger agent
which took over in its turn, and so on.
The creature would change from being one type of zombie to another
over its life.
Our dilemma is caused by the fact that we want the mind to be
driven by strong individuals in each state
(we don't want woolly collective decision making).
But then one strong individual can take over.

There are a number of possible ways of making collections hard to invade
without going back to majority-rule,
but at this point we must ask how useful this line of thinking really is
from the point of view of engineering.

When thinking about animal minds, it is attractive
to think of new agents (ideas,
memes)
coming into existence during the life history,
and trying to claim some space in the already-crowded mind.
But from the point of view of engineering, concepts like "individual" are only useful myths.
If a collection is invaded and destroyed,
we simply note that the new collection didn't work,
reset it to what it used to be
and try something else.
Resisting invasion may be important in biology to preserve the integrity
of the individual,
but in engineering this is just another way of building things.
Only if we sent our robot out into the world, with the ability to communicate,
imagine, and form new goals unsupervised over long periods of time,
might questions of maintaining individual integrity start to arise.
Even then, an artificial system can restore a past backup
in a way that a natural system cannot.

So for the moment, dynamic collections are not so interesting
since in their basic form they are just another way
of finding combinations of agents in the same collection.
More interesting for now is to discuss whether the collection itself
can start to have any structure. This is the subject of the final chapter.

[1]
This may all be a matter of terminology.
Edelman's co-worker Olaf Sporns
says [Sporns, 1995] that they do not like to use the word algorithm
for a self-modifying algorithm embedded in the real world.
A computer scientist would say that it still is an algorithm,
even if it has effectively become part of the world.