Information-driven self-organization: the dynamical system approach to autonomous robot behavior

Abstract

In recent years, information theory has come into the focus of researchers interested in the sensorimotor dynamics of both robots and living beings. One root for these approaches is the idea that living beings are information processing systems and that the optimization of these processes should be an evolutionary advantage. Apart from these more fundamental questions, there is much interest recently in the question how a robot can be equipped with an internal drive for innovation or curiosity that may serve as a drive for an open-ended, self-determined development of the robot. The success of these approaches depends essentially on the choice of a convenient measure for the information. This article studies in some detail the use of the predictive information (PI), also called excess entropy or effective measure complexity, of the sensorimotor process. The PI of a process quantifies the total information of past experience that can be used for predicting future events. However, the application of information theoretic measures in robotics mostly is restricted to the case of a finite, discrete state-action space. This article aims at applying the PI in the dynamical systems approach to robot control. We study linear systems as a first step and derive exact results for the PI together with explicit learning rules for the parameters of the controller. Interestingly, these learning rules are of Hebbian nature and local in the sense that the synaptic update is given by the product of activities available directly at the pertinent synaptic ports. The general findings are exemplified by a number of case studies. In particular, in a two-dimensional system, designed at mimicking embodied systems with latent oscillatory locomotion patterns, it is shown that maximizing the PI means to recognize and amplify the latent modes of the robotic system. This and many other examples show that the learning rules derived from the maximum PI principle are a versatile tool for the self-organization of behavior in complex robotic systems.

Keywords

Notes

Acknowledgments

Part of this work was completed during a stay of Nihat Ay and Ralf Der at the CSIRO in Sydney, Australia. Hospitality and financial support are gratefully acknowledged. Nihat Ay also acknowledges support by the Santa Fe Institute at the early stage of the paper. Mikhail Prokopenko thanks the Max Planck Institute of Mathematics in the Sciences in Leipzig, Germany, for support and hospitality at the Institute. The authors thank the anonymous reviewer for many important comments that helped to improve the paper substantially.

Appendix

Here, we derive some results used in the text.

PI over several time steps

In order to find the PI over τ time steps, we need the conditional entropy \(H\left( s_{t+\tau}|s_{t}\right) \) of st+τ given st which is well known, see, for example, DelSol (2004). We rederive it here by elementary means from our previous results, starting with Eq. 12 to obtain

which is easily transformed into that of the text using \(\gamma =1-\frac{\lambda}{\varepsilon}. \)

Generalized gradient for obtaining a self-consistent update rule

In this part of the appendix we will investigate the mathematical background of the consistent update rule (44) of the controller matrix C found in “Consistency.” We will show that the consistent update rule is also a gradient ascent algorithm, where the gradient is taken with respect to some non-standard metric on the differentiable manifold of n × n matrices, denoted by M(n). We will further characterize this metric as the pull-back of the standard metric under the map that links the value of the controller matrix C to the dynamical matrix R:

$$f: M(n) \rightarrow M(n); \quad C \mapsto R:=VC + T.$$

(65)

As in “Consistency” we will consider systems with V being a non-singular square matrix only.

Furthermore, we will introduce a general class of metrics on matrix spaces that contains the standard metric, our pull-back metric as well as the right-invariant metric on the space of invertible matrices used for example by Amari (compare Amari 1998). These results can be used to modify gradients of matrix functions in various ways without changing the stationary points of the learning algorithms. We provide an explicit formula for the gradient with respect to a metric from this class. We hope that this might be useful to modify learning algorithms on matrix spaces.

In this section we assume some familiarity with basic differential geometric concepts (as can be found in any introductory book on differential geometry such as Spivak 1999, Willmor 1959, Kühnel 2006, or Kobayashi and Nomizu 1963).

As stated above we are considering the differentiable manifold M(n) of all n × n matrices. The only chart we want to use here is the most obvious choice (in order to be consistent with the usual notation of differential geometry we write upper indices for the matrix entries here):

In the following summation will always be carried out over pairs of indices consisting of one upper and one lower index. In other cases the summation sign will be written down explicitly. A metric is a positive-definite, symmetric (differentiable) bilinear form

The metric gives rise to a gradient of a function h, denoted by \(\hbox{grad}_{g} \left[h\right] (p) \in T_{p} M(n).\) The gradient points into the direction of the steepest ascent of the function h at this point and its length is equal to \(\left|D_p f (p)\left[\hat{e}\right]\right|,\) where \(\hat{e}\) is the unit vector pointing into this direction. So the definition of the gradient involves metric structures on both spaces:

on \({\mathbb{R}}\) (which is canonically given; even a change of metric does not influence the direction of the gradient since two metrics at a certain point \({p \in \mathbb{R}}\) differ by a constant multiple only)

on M to specify the unit-sphere in the tangent space TpM over which the maximization is carried out.

An equivalent definition requires the gradient \(\hbox{grad}_{g}\left[f\right](p)\) to be the unique vector \(v \in T_{p} M\) such that:

where gp(i,j),(k,l) denotes the inverse n2 × n2 matrix of \(\left( g_{p; (i,j),(k,l)} \right)_{(i,j),(k,l)}. \) Since M(n) is a linear space, it is most natural to identify the tangent space at a given point \(p \in M(n)\) with M(n) itself. The canonical scalar product is then given by

$$\left\langle X ,Y \right\rangle _{p} := Tr X^{T} Y.$$

It implies the standard notion of a gradient in \({\mathbb{R}^{(n^2)}: }\)

Consider the problem of consistency in “Consistency” again. In order to find the optimal parameter for the policy matrix C we would like to implement some learning algorithm of the form

$$C_{n+1} = C_{n} + \Updelta C_{n}.$$

By changing C the transformation matrix R is changed indirectly so we have \(R_n := f(C_n)\) (where f has been defined in equation 65). For consistency, \(\Updelta C_{n}\) has to be chosen such that the following two conditions hold:

1.

\(\hbox{\rm\small 1}\!\!1- R_{n} R_{n}^{T} \) is invertible for every n

2.

the matrices Rn and \(R_{n}^{T}\) commute for every n, i.e., Rn is normal.

The first point is easily fulfilled, since the set of invertible matrices is open in M(n). To see this, let R be an invertible matrix, and let \(\Updelta R\) be a matrix with \(\left\|\Updelta R\right\| < \left\|R^{-1}\right\|^{-1}.\) Then an inversion in terms of the von-Neumann series shows that \(R+\Updelta R\) is also invertible, and:

Hence, a sufficiently small learning rate ensures the validity of point one.

The second point is more subtle. The set of normal matrices is the algebraic set \(\left\{A \in M(n) \left| A A^{T} -A^{T} A = 0\right.\right\}.\)4 Considering both the MI term and the penalty term (compare Eqs. 37 and 25), the objective function to be maximized is

preserves normality—\(R_{n+1}\) is normal whenever Rn is normal. However a naive update of C using the usual gradient might very well destroy normality of \(R_{n+1}. \) In order to overcome this problem we make use of the freedom to use another metric for the calculation of the gradient (compare Amari 1998). We summarize some well-known facts about the pull-back of a one-form. Let \(h: M \rightarrow N\) be a differentiable map between manifolds and let g be a metric on N then the pull-back of g under h is defined by

If h is a diffeomorphism, i.e., it is invertible with differentiable inverse, then the pull-back has the following properties:

\(h^{\ast}g\) is a metric on M (i.e., it is a positive definite, symmetric two-form on M)

Let \({\phi: N \rightarrow \mathbb{R} }\) be a differentiable function. According to the definition of concatenation and according to the chain rule the following two diagrams commute: Open image in new window

Using the definition of the gradient Eq. 66, the following formula is valid for any \(v \in T_p M: \)

Using the relationship \(D_p\left(\phi \circ h\right)\left[v\right]=D_{f(p)}(\phi)\left[D_p\left(h\right)\left[v\right]\right], \) the left-hand side of Eq. 68 can also be written in the following way:

In our case, consider the map f defined in Eq. 65. Its differential is simply:

$$D_{p} f: M(n)\rightarrow M(n); \quad X\, \mapsto\, V X$$

(70)

Since f is affine, \(D_{p} f\) even maps finite changes of C to the corresponding finite changes of R. The idea is to start with a matrix C0 such that \(R_0 = f(C_0)\) is normal and to update every Cn such that the change \(\Updelta C_n := C_{n+1} - C_{n}\) causes indirectly the desired change \(\Updelta R_n = \hbox{grad}_{\left\langle \cdot, \cdot\right\rangle} \left[K\right] (R_n)\) given by Eq. 67. This can be achieved by using the pull-back metric:

This is exactly the consistent update rule derived in “Consistency.” Since the pull-back metric is the only metric that makes f an isometry, it is the natural choice to transfer metric properties from R space to C space. The pull-back metric lies in a certain class of metrics that we would like to present now. Note that the two-form

$$g^{\prime}(X,Y)_{p} = Tr \left( G(p) X^{T} H(p) Y \right)$$

(72)

is a scalar product if for each \(p \in M(n)\) the matrices G(p) and H(p) are strictly positive (bilinearity is trivial, to see symmetry use the transposition invariance and the cyclic invariance of the trace, to see positivity and non-degeneracy write G(p) and H(p) as the square of a real symmetric matrix and notice that \(Tr X^T X\) is zero if and only if X = 0.)5

A similar calculation as carried out for the pull-back metric before yields the following expression for the gradient:

Obviously the standard metric and our pull-back metric are members of this class (they are obviously flat since there is no point dependence of the metric coefficients in the standard chart). Another example is the right invariant metric on the set of invertible matrices, GL(n), considered for example by Amari (1998):

Equations 72 and 73 are useful to modify the canonical gradient. As a consequence, a multiplication of the gradient by (possibly point-dependent) positive matrices from the left and from the right does not change the nature of the problem. Mathematically it is equivalent to a change of metric on the underlying space M(n). This modification of the standard gradient can be done with several aims in mind, for example:

1.

to simplify the standard gradient;

2.

to eliminate unfeasible quantities that appear in the standard gradient;

3.

to maintain some given constraints (such as normality of R in our case);

4.

to make use of a further mathematical structure underlying the given problem, such as symmetries or invariance properties.