'''Oja's learning rule''', or simply '''Oja's rule''', named after a Finnish computer scientist [[Erkki Oja]], is a model of how neurons in the brain or in [[artificial neural networks]] change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule (see [[Hebbian learning]]) that, through multiplicative normalization, solves all stability problems and generates an algorithm for [[principal components analysis]]. This is a computational form of an effect which is believed to happen in biological neurons.

+

A key problem in artificial [[neural networks]] is how neurons learn. The central hypothesis is that learning is based on changing the connections, or synaptic weights between neurons by specific learning rules. In [[unsupervised learning]], the changes in the weights only depend on the inputs and the output of the neuron. A popular assumption is the ''Hebbian learning rule'', according to which the change in a given synaptic weight is proportional to both the pre-synaptic input and the output activity of the post-synaptic neuron.

A key problem in artificial [[neural networks]] is how neurons learn. The central hypothesis is that learning is based on changing the connections, or synaptic weights between neurons by specific learning rules. In [[unsupervised learning]], the changes in the weights only depend on the inputs and the output of the neuron. A popular assumption is the ''Hebbian learning rule'', according to which the change in a given synaptic weight is proportional to both the pre-synaptic input and the output activity of the post-synaptic neuron.

The <strong><nowiki>Oja learning rule</nowiki></strong> (Oja, 1982) is a mathematical formalization of this Hebbian learning rule, such that over time the neuron actually learns to compute a principal component of its input stream.

The <strong><nowiki>Oja learning rule</nowiki></strong> (Oja, 1982) is a mathematical formalization of this Hebbian learning rule, such that over time the neuron actually learns to compute a principal component of its input stream.

+

+

==Theory==

+

Oja's rule requires a number of simplifications to derive, but in its final form it is demonstrably stable, unlike Hebb's rule. It is a single-neuron special case of the [[Generalized Hebbian Algorithm]]. However, Oja's rule can also be generalized in other ways to varying degrees of stability and success.

+

+

===Formula===

+

Oja's rule defines the change in presynaptic weights {{math|'''w'''}} given the output response <math>y</math> of a neuron to its inputs {{math|'''x'''}} to be

where {{math|''η''}} is the ''learning rate'' which can also change with time. Note that the bold symbols are [[Euclidean vector|vectors]] and {{math|''n''}} defines a discrete time iteration. The rule can also be made for continuous iterations as

where {{math|''y''('''x'''<sub>''n''</sub>)}} is again the output, this time explicitly dependent on its input vector {{math|'''x'''}}.

+

+

Hebb's rule has synaptic weights approaching infinity with a positive learning rate. We can stop this by normalizing the weights so that each weight's magnitude is restricted between 0, corresponding to no weight, and 1, corresponding to being the only input neuron with any weight. Mathematically, this takes the form

For small {{math|''η''}}, our [[Big-O notation|higher-order terms]] {{math|''O''(''η''<sup>2</sup>)}} go to zero. We again make the specification of a linear neuron, that is, the output of the neuron is equal to the sum of the product of each input and its synaptic weight, or

+

+

:<math>\,y(\mathbf{x}) ~ = ~ \sum_{j=1}^m x_j w_j </math>.

+

+

We also specify that our weights normalize to {{math|1}}, which will be a necessary condition for stability, so

can be shown to give one of the independent hidden factors under suitable assumptions (Hyvärinen and Oja, 1998). The main requirement is that prior to entering this algorithm, the input vectors have to be zero mean and ''whitened'' so that their covariance matrix <math>\mathbf C</math> is equal to the identity matrix. This can be achieved with a simple linear transformation, or by a variant of the Oja rule (see also Hyvärinen et al, 2001).

+

can be shown to give one of the independent hidden factors under suitable assumptions (Hyvärinen and Oja, 1998). The main requirement is that prior to entering this algorithm, the input vectors have to be zero mean and ''whitened'' so that their covariance matrix <math>\mathbf C</math> is equal to the identity matrix. This can be achieved with a simple linear transformation, or by a variant of the Oja rule (see also Hyvärinen et al, 2001).

+

+

===Stability and PCA===

+

In analyzing the convergence of a single neuron evolving by Oja's rule, one extracts the first ''principal component'', or feature, of a data set. Furthermore, with extensions using the [[Generalized Hebbian Algorithm]], one can create a multi-Oja neural network that can extract as many features as desired, allowing for [[principal components analysis]].

+

+

A principal component {{math|''a''<sub>''j''</sub>}} is extracted from a dataset {{math|'''x'''}} through some associated vector {{math|'''q'''<sub>''j''</sub>}}, or {{math|''a''<sub>j</sub> {{=}} '''q'''<sub>''j''</sub>⋅'''x'''}}, and we can restore our original dataset by taking

+

+

:<math>\mathbf{x} ~ = ~ \sum_j a_j \mathbf{q}_j</math>.

+

+

In the case of a single neuron trained by Oja's rule, we find the weight vector converges to {{math|'''q'''<sub>1</sub>}}, or the first principal component, as time or number of iterations approaches infinity. We can also define, given a set of input vectors {{math|''X''<sub>''i''</sub>}}, that its correlation matrix {{math|''R''<sub>''ij''</sub> {{=}} ''X''<sub>''i''</sub>''X''<sub>''j''</sub>}} has an associated [[eigenvector]] given by {{math|'''q'''<sub>''j''</sub>}} with [[eigenvalue]] {{math|''λ''<sub>''j''</sub>}}. The [[variance]] of outputs of our Oja neuron {{math|σ<sup>2</sup>(''n'') {{=}} ⟨y<sup>2</sup>(''n'')⟩}} then converges with time iterations to the principal eigenvalue, or

+

+

:<math>\lim_{n\rightarrow\infty} \sigma^2(n) ~ = ~ \lambda_1</math>.

+

+

These results are derived using [[Lyapunov function]] analysis, and they show that Oja's neuron necessarily converges on strictly the first principal component if certain conditions are met in our original learning rule. Most importantly, our learning rate {{math|''η''}} is allowed to vary with time, but only such that its sum is ''divergent'' but its power sum is ''convergent'', that is

Our output [[activation function]] {{math|''y''('''x'''(''n''))}} is also allowed to be nonlinear and nonstatic, but it must be continuously differentiable in both {{math|'''x'''}} and {{math|'''w'''}} and have derivatives bounded in time.<ref name="Haykin98">{{cite book |last=Haykin |first=Simon |authorlink=Simon Haykin |title=Neural Networks: A Comprehensive Foundation |edition=2 |year=1998 |publisher=Prentice Hall |location= |isbn=0-13-273350-1 }}</ref>

+

+

==Applications==

+

Oja's rule was originally described in Oja's 1982 paper,<ref name="Oja82"/> but the principle of self-organization to which it is applied is first attributed to [[Alan Turing]] in 1952.<ref name="Haykin98"/> PCA has also had a long history of use before Oja's rule formalized its use in network computation in 1989. The model can thus be applied to any problem of [[self-organizing map]]ping, in particular those in which feature extraction is of primary interest. Therefore, Oja's rule has an important place in image and speech processing. It is also useful as it expands easily to higher dimensions of processing, thus being able to integrate multiple outputs quickly. A canonical example is its use in [[binocular vision]].<ref name="Intrator07">{{cite web |url=http://www.cs.tau.ac.il/~nin/Courses/NC06/Hebb_PCA.ppt |title=Unsupervised Learning |accessdate=2007-11-22 |last=Intrator |first=Nathan |year=2007 |work=Neural Computation lectures |publisher=[[Tel-Aviv University]]}}</ref>

+

+

===Biology and Oja's subspace rule===

+

There is clear evidence for both [[long-term potentiation]] and [[long-term depression]] in biological neural networks, along with a normalization effect in both input weights and neuron outputs. However, while there is no direct experimental evidence yet of Oja's rule active in a biological neural network, a [[biophysics|biophysical]] derivation of a generalization of the rule is possible. Such a derivation requires retrograde signalling from the postsynaptic neuron, which is biologically plausible (see [[neural backpropagation]]), and takes the form of

where as before {{math|''w''<sub>''ij''</sub>}} is the synaptic weight between the {{math|''i''}}th input and {{math|''j''}}th output neurons, {{math|''x''}} is the input, {{math|''y''}} is the postsynaptic output, and we define {{math|''ε''}} to be a constant analogous the learning rate, and {{math|''c''<sub>pre</sub>}} and {{math|''c''<sub>post</sub>}} are presynaptic and postsynaptic functions that model the weakening of signals over time. Note that the angle brackets denote the average and the {{unicode|∗}} operator is a [[convolution]]. By taking the pre- and post-synaptic functions into frequency space and combining integration terms with the convolution, we find that this gives an arbitrary-dimensional generalization of Oja's rule known as '''Oja's Subspace''',<ref>{{cite journal |last=Oja |first=Erkki |authorlink=Erkki Oja |year=1989 |month= |title=Neural Networks, Principal Components, and Subspaces |journal=[[International Journal of Neural Systems]] (IJNS) |volume=1 |issue=1 |pages=61–68 |doi=10.1142/S0129065789000475 |url=http://www.worldscinetarchives.com/cgi-bin/details.cgi?id=pii:S0129065789000475&type=html |accessdate= 2007-11-22 |quote= }}</ref> namely

Oja's learning rule, or simply Oja's rule, named after a Finnish computer scientist Erkki Oja, is a model of how neurons in the brain or in artificial neural networks change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule (see Hebbian learning) that, through multiplicative normalization, solves all stability problems and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons.

A key problem in artificial neural networks is how neurons learn. The central hypothesis is that learning is based on changing the connections, or synaptic weights between neurons by specific learning rules. In unsupervised learning, the changes in the weights only depend on the inputs and the output of the neuron. A popular assumption is the Hebbian learning rule, according to which the change in a given synaptic weight is proportional to both the pre-synaptic input and the output activity of the post-synaptic neuron.

The Oja learning rule (Oja, 1982) is a mathematical formalization of this Hebbian learning rule, such that over time the neuron actually learns to compute a principal component of its input stream.

Contents

Oja's rule requires a number of simplifications to derive, but in its final form it is demonstrably stable, unlike Hebb's rule. It is a single-neuron special case of the Generalized Hebbian Algorithm. However, Oja's rule can also be generalized in other ways to varying degrees of stability and success.

The simplest learning rule known is Hebb's rule, which states in conceptual terms that neurons that fire together, wire together. In component form as a difference equation, it is written

,

or with implicit n-dependence,

,

where y(xn) is again the output, this time explicitly dependent on its input vector x.

Hebb's rule has synaptic weights approaching infinity with a positive learning rate. We can stop this by normalizing the weights so that each weight's magnitude is restricted between 0, corresponding to no weight, and 1, corresponding to being the only input neuron with any weight. Mathematically, this takes the form

.

Note that in Oja's original paper,[1]p=2, corresponding to quadrature (root sum of squares), which is the familiar Cartesian normalization rule. However, any type of normalization, even linear, will give the same result without loss of generality.

Our next step is to expand this into a Taylor series for a small learning rate , giving

.

For small η, our higher-order termsO(η2) go to zero. We again make the specification of a linear neuron, that is, the output of the neuron is equal to the sum of the product of each input and its synaptic weight, or

.

We also specify that our weights normalize to 1, which will be a necessary condition for stability, so

where denotes the change in the value of the weight is the input coming through the weight and is the output of the neuron as given in equation (1). The coefficient is called the learning rate and it is typically small. Due to this, one input vector (whose component is the term ) only causes a small instantaneous change in the weights, but when the small changes accumulate over time, the weights will settle to some values.

Equation \eqref{Hebb} represents the Hebbian principle, because the term is the product of the input and the output. However, this learning rule has a severe problem: there is nothing there to stop the connections from growing all the time, finally leading to very large values. There should be another term to balance this growth. In many neuron models, another term representing "forgetting" has been used: the value of the weight itself should be subtracted from the right hand side.
The central idea in the Oja learning rule is to make this forgetting term proportional, not only to the value of the weight, but also to the square of the output of the neuron. The Oja rule reads:

A mathematical analysis of the Oja learning rule in \eqref{Ojarule} goes as follows (a much more thorough and rigorous analysis appears in the book (Oja, 1983)). First, change into vector notation, in which is the column vector with elements and is the column vector with elements They are called the input vector and the weight vector, respectively. In vector-matrix notation, equation (1) then reads

This is the incremental change for just one input vector When the algorithm is run for a long time, changing the input vector at every step, one can look at the average behaviour. An especially interesting question is what is the value of the weights when the average change in the weight is zero. This is the point of convergence of the algorithm.

Averaging the right hand side over the conditional on staying constant, and setting this to zero gives the following equation for the weight vector at the point of convergence:

where the matrix is the average of Assuming the input vectors have zero means, this is in fact the well-known covariance matrix of the inputs.

Considering that the quadratic form is a scalar, this equation clearly is the eigenvalue-eigenvector equation for the covariance matrix This analysis shows that if the weights converge in the Oja learning rule, then the weight vector becomes one of the eigenvectors of the input covariance matrix, and the output of the neuron becomes the corresponding principal component. Principal components are defined as the inner products between the eigenvectors and the input vectors. For this reason, the simple neuron learning by the Oja rule becomes a principal component analyzer (PCA).

Although not shown here, it has been proven that it is the first principal component that the neuron will find, and the norm of the weight vector tends to one. For details, see (Oja, 1983; 1992). This analysis is based on stochastic approximation theory (see e.g. Kushner and Clark, 1978) and depends on a set of mathematical assumptions. Especially, the learning rate cannot be a constant but has to decrease over time. A typical decreasing sequence is

It is possible to define this learning rule for a layer of parallel neurons, each receiving the same input vector Then, in order to prevent all the neurons from learning the same thing, parallel connections between them are needed. The result is that a subset or all of the principal components are learned. Such neural layers have been considered by (Oja, 1983, 1992; Sanger, 1989; Földiák, 1989).

Independent component analysis (ICA) is a technique that is related to PCA, but is potentially much more powerful: instead of finding uncorrelated components like in PCA, statistically independent components are found, if they exist in the original data. It turns out that quite small changes in the Oja rule can produce independent, instead of principal, components in such a case. What is needed is to change the linear output factor in the Hebbian term to a suitable nonlinearity, such as Also the forgetting term must be changed accordingly. The ensuing learning rule

can be shown to give one of the independent hidden factors under suitable assumptions (Hyvärinen and Oja, 1998). The main requirement is that prior to entering this algorithm, the input vectors have to be zero mean and whitened so that their covariance matrix is equal to the identity matrix. This can be achieved with a simple linear transformation, or by a variant of the Oja rule (see also Hyvärinen et al, 2001).

In analyzing the convergence of a single neuron evolving by Oja's rule, one extracts the first principal component, or feature, of a data set. Furthermore, with extensions using the Generalized Hebbian Algorithm, one can create a multi-Oja neural network that can extract as many features as desired, allowing for principal components analysis.

A principal component aj is extracted from a dataset x through some associated vector qj, or aj = qj⋅x, and we can restore our original dataset by taking

.

In the case of a single neuron trained by Oja's rule, we find the weight vector converges to q1, or the first principal component, as time or number of iterations approaches infinity. We can also define, given a set of input vectors Xi, that its correlation matrix Rij = XiXj has an associated eigenvector given by qj with eigenvalueλj. The variance of outputs of our Oja neuron σ2(n) = ⟨y2(n)⟩ then converges with time iterations to the principal eigenvalue, or

.

These results are derived using Lyapunov function analysis, and they show that Oja's neuron necessarily converges on strictly the first principal component if certain conditions are met in our original learning rule. Most importantly, our learning rate η is allowed to vary with time, but only such that its sum is divergent but its power sum is convergent, that is

.

Our output activation functiony(x(n)) is also allowed to be nonlinear and nonstatic, but it must be continuously differentiable in both x and w and have derivatives bounded in time.[2]

Oja's rule was originally described in Oja's 1982 paper,[1] but the principle of self-organization to which it is applied is first attributed to Alan Turing in 1952.[2] PCA has also had a long history of use before Oja's rule formalized its use in network computation in 1989. The model can thus be applied to any problem of self-organizing mapping, in particular those in which feature extraction is of primary interest. Therefore, Oja's rule has an important place in image and speech processing. It is also useful as it expands easily to higher dimensions of processing, thus being able to integrate multiple outputs quickly. A canonical example is its use in binocular vision.[3]

There is clear evidence for both long-term potentiation and long-term depression in biological neural networks, along with a normalization effect in both input weights and neuron outputs. However, while there is no direct experimental evidence yet of Oja's rule active in a biological neural network, a biophysical derivation of a generalization of the rule is possible. Such a derivation requires retrograde signalling from the postsynaptic neuron, which is biologically plausible (see neural backpropagation), and takes the form of

where as before wij is the synaptic weight between the ith input and jth output neurons, x is the input, y is the postsynaptic output, and we define ε to be a constant analogous the learning rate, and cpre and cpost are presynaptic and postsynaptic functions that model the weakening of signals over time. Note that the angle brackets denote the average and the ∗ operator is a convolution. By taking the pre- and post-synaptic functions into frequency space and combining integration terms with the convolution, we find that this gives an arbitrary-dimensional generalization of Oja's rule known as Oja's Subspace,[4] namely