Final Remarks

Previous work on IS (for example, [34,35])
implicitly focused on goal-oriented exploration -- in principle IS can
learn to change
its exploration strategy if this turns out to accelerate external
reward in the long run. This chapter's IS implementation, however,
involves an additional pure exploration component besides
the goal-oriented one. Part of the learner receives internal reward
for pointing out something another part did not know but thought
it knew. The surprised part suffers to the extent the surprising part
benefits -- the sum of all internal rewards remains zero. The learner
is ``interested'' in ``creative'' computations leading to unexpected results, while
simultaneously trying to make formerly surprising things predictable
and boring. It does not care much for irregular noise rich with
Shannon information [38]. Instead it prefers easily learnable
algorithmic
regularities, taking into account the costs of gaining information
in an RL framework.

From each module's perspective an instruction subsequence is ``novel''
as long as its outcome surprises the other. Since the surprised module
will eventually figure out what is going on, there will be incessant
pressure to create new novelties. What is the use of such an inquisitive
system's computations? Curiosity's long-term justifiability depends on
whether knowledge growth will eventually support goal-oriented behavior.
When will this be the case? The question is reminiscent of G. H. Hardy's
toast on ``pure'' mathematics -- the kind that ``would never be of
any use to anyone'' ([3], p. 185). History teaches us,
however, that it is hard to decide which math will be useless forever.
For instance, old results from ``pure'' number theory are used in today's
encryption technology.

In general, however, it will always be possible to design
environments where ``curiosity kills the cat'' [40], or at
least has negative influence on external performance. For instance, as
exemplified by simulation 2 of Experiment 2a, curiosity may occasionally
prevent discovery of external reward sources. This is reminiscent
of the situation in supervised learning. There often additional
``regularizer'' terms are added to the standard error function defining
network performance on the training data. They can greatly help to
remove redundant free parameters and improve generalization capability on
unseen data (e.g., [8]), but in general this cannot
be guaranteed.

This chapter's approach draws inspiration from several
sources. For instance, the two-module system is based on two
co-evolving modules. Co-evolution of competing strategies, however,
is nothing new. See, for example, [7,19] for interesting
cases.
Also, the idea of improving a learner by letting it play
against itself is ancient. See, for example, [20,41].
Even the idea of unsupervised learning through co-evolution of predictors
and modules trying to escape the predictions is nothing new -- it has
been used extensively in our previous work on unsupervised sensory coding
with neural networks
[25,36,33,32,37].
Finally, co-evolutionary methods translating mismatches between reality
and expectations into reward for ``curious,'' exploring agents are not new
either -- see our previous work on ``pure'' RL-based
exploration [24,23,40].
So, what is new?

Novel is the idea that both adaptive
modules equally influence the probability
of each executed instruction/computation.
This (1) allows for a straight-forward way of
making both modules equally powerful
(by copying the currently superior one onto the other),
and (2) prevents each module from being able to enforce
computations that will make the other lose no matter what it tries.
For instance, details of white noise on a screen are inherently
unpredictable, but none of the two opponents may exploit this to generate
surprises if the other does not ``agree'' to the corresponding
experiment. And it will agree only as long
as it suspects that there is a regularity in the white noise that the
other does not yet know.
The precondition of a surprise is that the surprised module has expressed
its confidence in a different outcome of the surprising computation sequence
by participating in the collective decision process.
Intuitively, my adaptive explorer continually wants to discover new,
``creative'' uses of its innate sensorium and computational potential.
It wants to focus on those novel things that seem easy to learn, given
current knowledge. It wants to ignore (1) previously learned, predictable
things, (2) inherently unpredictable ones (such as details of white noise
on a screen), and (3) things that are unexpected but not expected to be
easily learned (such as the contents of an advanced math textbook beyond
the explorer's current level).

Another novel aspect is the general setting. Instead
of being limited to Markovian domains and simple reactive strategies
such as approaches in [23,40],
this chapter's setup allows for quite arbitrary domains and
computations. This is made possible by the recent IS paradigm
[34,35]. There is no essential limit
(besides computability) to the nature of the regularities that may be
exploited to generate surprises. Neither is there an essential limit to
the nature of the learning processes that can make formerly surprising
regularities predictable and boring. There may be RL schemes even more
general than IS, but this is beyond the scope of this chapter.

Note that this chapter's notion of
``simple regularities'' differs from, e.g.,
Kolmogorov complexity theory's
[11,2,39,14,26].
There an object is called simple relative to current knowledge
if the size of the shortest algorithm computing it from is
small. The algorithm's computation time is ignored, as are constant
factors reflecting Kolmogorov complexity's machine independence.
The current chapter, however, takes both into account.

As the explorer's knowledge about its environment and computational abilities
expands, it keeps balancing on the thin, dynamically changing line between
the subjectively random and the subjectively trivial. Unlike Nake and
other authors he cites [17], I do not suggest a predefined
optimal ratio between known and unknown information. Instead, the two
cooperating/competing modules dynamically, implicitly determine this
ratio as they keep trying to surprise each other.

Recent papers attempt to explain ``beauty''
with the help of complexity theory concepts
[27,29]. They argue that
something ``beautiful'' need not be ``interesting''.
They predict that the ``most beautiful'' object
from a set of objects satisfying certain specifications
is the one that can be most easily computed from the
subjective observer's input coding scheme.
Interestingness in the current chapter's sense, however,
also takes into account whether the computational result is
expected or not. Something that is both ``beautiful'' and
already known may be quite boring -- ``beauty'' needs to be
unexpected to awaken interest.

Future work.
The programming language used in the experiments is designed to allow
for fairly arbitrary computations/explorations and learning processes. To
make progress towards analyzing ``inquisitive'' explorers, however, one
will probably have to study alternative systems with less computational
power and less general RL paradigms but more accessible dynamics. On the
other hand, it will also be interesting to study a curious learner's
performance in the case of more difficult tasks and more powerful primitive
instructions with more bias towards solving the task. Note that LIs can be
almost anything: neural net algorithms, Bayesian analysis algorithms, etc.

Furthermore, although IS is a rather general RL
paradigm, it may be possible to develop more general ones. In that
case I would like to combine them with the two-module idea.
Promising candidates may be RL schemes based on economy
and market models, such as classifier systems and their variants
[9,44,45,42,43], or the related
``Prototypical Self-referential Associating Learning Mechanisms'' (PSALMs)
[21], the Neural Bucket Brigade
[22], Hayek Machines [1,12],
Collective Intelligences (COINs) [46].

The basic ideas of the present chapter will probably remain unchanged,
however: competing agents will agree on algorithmic experiments and bet
on their outcomes, the winners profiting from outwitting others.