Download Item:

Abstract:

In continuous learning settings stochastic stable policies are often
necessary to ensure that agents continuously adapt to dynamic environments. The choice of the decentralised learning system and
the employed policy plays an important role in the optimisation
task. For example, a policy that exhibits ?uctuations may also introduce non-linear effects which other agents in the environment
may not be able to cope with and even amplify these effects. In
dynamic and unpredictable multiagent environments these oscillations may introduce instabilities. In this paper, we take inspiration
from the limbic system to introduce an extension to the weighted
policy learner, where agents evaluate rewards as either positive or
negative feedback, depending on how they deviate from average
expected rewards. Agents have positive and negative biases, where
a bias either magni?es or depresses a positive or negative feedback
signal. To contain the non-linear effects of biased rewards, we incorporate a decaying memory of past positive and negative feedback signals to provide a smoother gradient update on the probability simplex, spreading out the effect of the feedback signal over
time. By splitting the feedback signal, more leverage on the win
or learn fast (WoLF) principle is possible. The cognitive policy
learner is evaluated using a small queueing network and compared
with the fair action and weighted policy learner. Emphasis is placed
on analysing the dynamics of the learning algorithms with respect
to the stability of the queueing network and the overall queueing
performance.