Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

A low-order model (LOM) of biological neural networks and its
mathematical equivalents including the clusterer interpreter
probabilistic associative memory (CIPAM) are disclosed. They are
artificial neural networks (ANNs) organized as networks of processing
units (PUs), each PU comprising artificial neuronal encoders, synapses,
spiking/nonspiking neurons, and a scheme for maximal generalization. If
the weights in the artificial synapses in a PU have been learned (and
then fixed) or can be adjusted by the unsupervised accumulation rule and
the unsupervised covariance rule (or supervised covariance rule), the PU
is called unsupervised (or supervised) PU. The disclosed ANNs, with these
Hebbian-type learning rules, can learn large numbers of large input
vectors with temporally/spatially hierarchical causes with ease and
recognize such causes with maximal generalization despite corruption,
distortion and occlusion. An ANN with a network of unsupervised PUs
(called clusterer) and offshoot supervised PUs (called interpreter) is an
architecture for many applications.

Claims:

1. An artificial neural network, comprising a least one processing unit,
each processing unit comprising (a) at least one artificial neuronal
encoder for encoding a vector that is input to said encoder into a
neuronal code that is output from said encoder and has an orthogonality
property; (b) a plurality of artificial synapses each for storing an
entry of a code deviation accumulation vector and for evaluating a first
product of said entry and a component of a code deviation vector that is
the deviation of a neuronal code from an average neuronal code; (c) a
plurality of artificial synapses each for storing an entry of a code
covariance matrix and for evaluating a second product of said entry and a
component of a code deviation vector that is the deviation of a neuronal
code from an average neuronal code; (d) an artificial nonspiking neuron
for evaluating a first sum of products each of a first product and a
masking factor; and (e) at least one artificial spiking neuron for
evaluating at least one second sum of products each of a second product
and a masking factor, for using said first sum and said at least one
second sum to evaluate a representation of a first subjective probability
distribution of a label of a vector that is input to said processing
unit, and for generating at least one pseudorandom number in accordance
with said first subjective probability distribution, wherein masking
factors are diagonal entries of a masking matrix; said products each of a
first product and a masking factor are entries of the product of a code
deviation accumulation vector, a masking matrix and a code deviation
vector; and said products each of a second product and a masking factor
are entries of the product of a code covariance matrix, a masking matrix
and a code deviation vector.

2. The artificial neural network of claim 1, wherein a plurality of code
covariance matrices are submatrices of a general code covariance matrix,
a plurality of code deviation accumulation vectors are subvectors of a
general code deviation accumulation vector, a plurality of masking
matrices are submatrices of a general masking matrix, and a plurality of
neuronal codes are subvectors of a general neuronal code.

3. The artificial neural network of claim 1, further comprising at least
one feedback connection with time delay means.

4. The artificial neural network of claim 1, wherein at least one
processing unit further comprises unsupervised accumulation learning
means for adjusting at least one code deviation accumulation vector in
response to a vector that is input to said processing unit by the
unsupervised accumulation rule.

5. The artificial neural network of claim 4, wherein at least one
processing unit further comprises unsupervised learning means for using
at least one pseudorandom number generated in accordance with said first
subjective probability distribution and at least one code deviation
vector to adjust at least one code covariance matrix in response to a
vector that is input to said processing unit by the unsupervised
covariance rule.

6. The artificial neural network of claim 1, wherein at least one
processing unit further comprises (a) means for evaluating a third sum of
products each of a first product and a learning masking factor; (b) means
for evaluating at least one fourth sum of products each of a second
product and a learning masking factor and using said third sum and said
fourth sum to evaluate a representation of a second subjective
probability distribution; and (c) unsupervised learning means for using
at least one pseudorandom number generated in accordance with said second
subjective probability distribution and at least one code deviation
vector to adjust at least one code covariance matrix in response to a
vector that is input to said processing unit by the unsupervised
covariance rule, wherein said learning masking factors are diagonal
entries of a learning masking matrix; said products each of a first
product and a learning masking factor are entries of the product of a
code deviation accumulation vector, a learning masking matrix and a code
deviation vector; and said products each of a second product and a
learning masking factor are entries of the product of a code covariance
matrix, a learning masking matrix and a code deviation vector.

7. The artificial neural network of claim 1, wherein at least one
processing unit further comprises supervised learning means for adjusting
at least one code covariance matrix in response to a vector that is input
to said processing unit and a label of said vector that is provided from
outside said processing unit by the supervised covariance rule.

8. The artificial neural network of claim 1, wherein a plurality of
processing units are unsupervised processing units, which form a network
called a clusterer, and at least one processing unit is a supervised
processing unit that receives a vector output from an unsupervised
processing unit in said clusterer, the set of at least one supervised
processing unit being called an interpreter.

9. The artificial neural network of claim 8, wherein at least one of said
plurality of unsupervised processing units further comprises (a) memory
means for storing at least one learning masking matrix; (b) means for
evaluating a third product of a code deviation accumulation vector, a
learning masking matrix and a code deviation vector that is the deviation
of a neuronal code from an average neuronal code; (c) means for
evaluating a fourth product of a code covariance matrix, a learning
masking matrix and a code deviation vector that is the deviation of a
neuronal code from an average neuronal code; (d) evaluation means for
using at least one third product and at least one fourth product to
evaluate a representation of a second subjective probability distribution
of a label of a vector that is input to said processing unit; and (e)
unsupervised learning means for using at least one pseudorandom number
generated in accordance with said second subjective probability
distribution and at least one code deviation vector to adjust at least
one code covariance matrix in response to a vector that is input to said
processing unit by the unsupervised covariance rule.

10. A system for processing data, said system comprising a least one
processing unit, each processing unit comprising (a) encoding means for
encoding a vector that is input to said encoder into a neuronal code that
has an orthogonality property; (b) memory means for storing at least one
code deviation accumulation vector, at least one code covariance matrix
and at least one masking matrix; (c) means for evaluating a first product
of a code deviation accumulation vector, a masking matrix and a code
deviation vector that is the deviation of a neuronal code from an average
neuronal code; (d) means for evaluating a second product of a code
covariance matrix, a masking matrix and a code deviation vector that is
the deviation of a neuronal code from an average neuronal code; and (e)
evaluation means for using at least one first product and at least one
second product to evaluate a representation of a first subjective
probability distribution of a label of a vector that is input to said
processing unit.

11. The system of claim 10, further comprising pseudorandom number
generation means for generating at least one pseudorandom number in
accordance with said first subjective probability distribution.

12. The system of claim 11, further comprising means for feedbacking at
least one pseudo-random number generated by said pseudorandom number
generation means to a processing unit after a time delay.

13. The system of claim 11, wherein said processing unit further
comprises unsupervised accumulation learning means for adjusting at least
one code deviation accumulation vector by the unsupervised accumulation
rule; and unsupervised learning means for adjusting at least one code
covariance matrix by the unsupervised covariance rule.

14. The system of claim 10, wherein at least one processing unit further
comprises (a) memory means for storing at least one learning masking
matrix; (b) means for evaluating a third product of a code deviation
accumulation vector, a learning masking matrix and a code deviation
vector that is the deviation of a neuronal code from an average neuronal
code; (c) means for evaluating a fourth product of a code covariance
matrix, a learning masking matrix and a code deviation vector that is the
deviation of a neuronal code from an average neuronal code; (d)
evaluation means for using at least one third product and at least one
fourth product to evaluate a representation of a second subjective
probability distribution of a label of a vector that is input to said
processing unit; and (e) unsupervised learning means for using at least
one pseudorandom number generated in accordance with said second
subjective probability distribution and at least one code deviation
vector to adjust at least one code covariance matrix in response to a
vector that is input to said processing unit by the unsupervised
covariance rule.

15. The system of claim 10, wherein said processing unit further
comprises unsupervised accumulation learning means for adjusting at least
one code deviation accumulation vector by the unsupervised accumulation
rule; and supervised learning means for adjusting at least one code
covariance matrix by the supervised covariance rule.

16. The system of claim 10, wherein a plurality of processing units are
unsupervised processing units, which form a network called a clusterer,
and at least one processing unit is a supervised processing unit that
receives a vector output from an unsupervised processing unit in said
clusterer, the set of at least one supervised processing unit being
called an interpreter.

17. The system of claim 16, wherein at least one of said plurality of
unsupervised processing units further comprises (a) memory means for
storing at least one learning masking matrix; (b) means for evaluating a
third product of a code deviation accumulation vector, a learning masking
matrix and a code deviation vector that is the deviation of a neuronal
code from an average neuronal code; (c) means for evaluating a fourth
product of a code covariance matrix, a learning masking matrix and a code
deviation vector that is the deviation of a neuronal code from an average
neuronal code; (d) evaluation means for using at least one third product
and at least one fourth product to evaluate a representation of a second
subjective probability distribution of a label of a vector that is input
to said processing unit; and (e) unsupervised learning means for using at
least one pseudorandom number generated in accordance with said second
subjective probability distribution and at least one code deviation
vector to adjust at least one code covariance matrix in response to a
vector that is input to said processing unit by the unsupervised
covariance rule.

18. A method for processing data, said method comprising: (a) encoding a
subvector of a first vector into a neuronal code with an orthogonality
property; (b) evaluating a code deviation vector that is the deviation of
a neuronal code from an average neuronal code; (c) evaluating a first
product of an entry of a code deviation accumulation vector, a component
of a code deviation vector and a masking factor; (d) evaluating a second
product of an entry of a code covariance matrix, a component of a code
deviation vector and a masking factor; (e) evaluating a first sum of
first products; (f) evaluating at least one second sum of second
products; and (g) using said first sum and said at least one second sum
to evaluate a representation of a subjective probability distribution of
a component of a label of said first vector.

19. The method of claim 18, further comprising using at least one code
deviation vector to adjust a code deviation accumulation vector by the
unsupervised accumulation learning rule.

20. The method of claim 18, further comprising generating a pseudorandom
number in accordance with a subjective probability distribution.

21. The method of claim 20, further comprising including a pseudorandom
number generated in accordance with a subjective probability distribution
as a component in said first vector after a time delay.

22. The method of claim 20, further comprising using at least one code
deviation vector and at least one pseudorandom number generated in
accordance with a subjective probability distribution to adjust a code
covariance matrix by the unsupervised covariance rule.

23. The method of claim 18, further comprising (a) evaluating a third
product of an entry of a code deviation accumulation vector, a component
of a code deviation vector and a learning masking factor; (b) evaluating
a fourth product of an entry of a code covariance matrix, a component of
a code deviation vector and a learning masking factor; (c) evaluating a
third sum of third products; (d) evaluating at least one fourth sum of
fourth products; (e) using said third sum and said at least one fourth
sum to evaluate a representation of a second subjective probability
distribution; (f) generating at least one pseudorandom number in
accordance with said second subjective probability distribution; and (g)
using at least one pseudorandom number generated in accordance with said
second subjective probability distribution and at least one code
deviation vector to adjust at least one code covariance matrix in
response to said first vector by the unsupervised covariance rule.

24. The method of claim 18, further comprising using at least one code
deviation vector and an assigned label of said first vector, whose
components are not generated by the method of claim 18, to adjust a code
covariance matrix by the supervised covariance rule.

25. The method of claim 22, further comprising using at least one code
deviation vector and an assigned label of said first vector, whose
components are not pseudorandom numbers generated by the method of claim
22, to adjust a code covariance matrix by the supervised covariance rule.

26. The method of claim 23, further comprising using at least one code
deviation vector and a given label of said first vector, whose components
are not pseudorandom numbers generated by the method of claim 23, to
adjust a code covariance matrix by the supervised covariance rule.

[0003] A good introduction to the prior art in ANNs (artificial neural
networks) and their applications can be found in Simon Haykin, Neural
Networks and Learning Machines, Third Edition, Pearson Education, New
Jersey, 2009; Christopher M. Bishop, Pattern Recognition and Machine
Learning, Springer Science, New York, 2006.

[0004] An ANN, which is a functional model of biological neural networks,
was recently reported in James Ting-Ho Lo, Functional Model of Biological
Neural Networks, Cognitive Neurodynamics, Vol. 4, Issue 4, pp. 295-313,
November 2010, where the ANN is called the temporal hierarchical
probabilistic associative memory (THPAM), and in James Ting-Ho Lo, A
Cortex-Like Learning Machine for Temporal and Hierarchical Pattern
Recognition, U.S. patent application Ser. No. 12/471,341, filed May 22,
2009; Publication No. US-2009-0290800-A1, Publication Date Nov. 26, 2009,
where the ANN is called the probabilistic associative memory (PAM). The
ANN is hereinafter referred to as the THPAM. The goal to achieve in the
construction of the THPAM was to develop an ANN that performs
Hebbian-type unsupervised and supervised learning without
differentiation, optimization or iteration; retrieves easily; and
recognizes corrupted, distorted and occluded temporal and spatial
information. In the process to achieve the goal, mathematical necessity
took precedence over biological plausibility. This mathematical approach
focused first on minimum mathematical structures and operations that are
required for an effective learning machine with the mentioned properties.

[0005] The THPAM turned out to be a functional model of biological neural
networks with many unique features that well-known models such as the
recurrent multilayer perceptron, associative memories, spiking neural
networks, and cortical circuit models do not have. However, the THPAM has
been found to have shortcomings. Among them the most serious one is the
inability of its unsupervised correlation rule to prevent clusters from
overgrowing under certain circumstances. These shortcomings motivated
further research to improve the THPAM. At the same time, the unique
features of the THPAM indicated that it might contain clues for
understanding the structures and operations of biological neural
networks. To achieve this understanding and eliminate the mentioned
shortcomings, the components of the THPAM were examined from the
biological point of view with the purpose of constructing a biologically
plausible model of biological neural networks. More specifically, the
components of the THPAM were identified with those of biological neural
networks and reconstructed, if necessary, into biologically plausible
models of the same.

[0006] This effort resulted in a low-order model (LOM) of biological
neural networks and an improved functional model called the Clustering
Interpreting Probabilistic Associative Memory (CIPAM). They were
respectively reported in the articles, James Ting-Ho Lo, A Low-Order
Model of Biological Neural Networks, Neural Computation, Vol. 23, No. 10,
pp. 2626-2682, October 2011; and James Ting-Ho Lo, A Cortex-Like Learning
Machine for Temporal Hierarchical Pattern Clustering, Detection, and
Recognition, Neurocomputing, Vol. 78, pp. 89-103, 2012, which are both
incorporated into the present invention disclosure by reference. Note
that "dendritic and axonal encoders", "dendritic and axonal trees" and
"dendritic and axonal expansions" in them are collectively called
"neuronal encoders", "neuronal trees" and "neuronal codes" respectively
in the present invention disclosure and that "C-neuron", "D-neuron" and
"expansion covariance matrix" are called "nonspiking neuron", "spiking
neuron" and "code covariance matrices" respectively in the present
invention disclosure.

[0007] It was subsequently discovered that the LOM and the CIPAM are
equivalent in the sense that their corresponding components can
mathematically be transformed into each other. In fact, generalizing the
mathematical transformation that transforms the LOM and the CIPAM into
each other, we can transform the LOM and the CIPAM into infinitely many
equivalent models.

[0008] The LOM, the CIPAM and their equivalent models are each a network
of models of the biological neuronal node or encoder (which is a
biological dendritic or axonal node or encoder), synapse,
spiking/nonspiking neuron, means for learning, feedback connection,
maximal generalization scheme, feedback connection, etc. For simplicity,
these component models are sometimes referred to without the word
"model". For example, the model neuronal node, model neuronal encoder,
model neuronal tree, model synapse, model spiking/nonspiking neuron, etc.
will be referred to as the neuronal node/encoder/tree, synapse,
spiking/nonspiking neuron, etc. respectively. The LOM, the CIPAM and all
their equivalent models can be used as artificial neural networks (ANNs).
To emphasize that their components are artificial components in these
artificial neural networks, they are referred to as the artificial
neuronal node/encoder/tree, artificial synapse, artificial
spiking/nonspiking neuron, etc. respectively.

[0009] If there is possibility of confusion, the real components in the
brain are referred to with the adjective "biological", for example, the
biological neuronal node, biological neuronal encoder, biological
neuronal tree, biological spiking/nonspiking neuron, and biological
synapse, etc. The components of equivalent models (or equivalent ANNs)
that can be obtained from transforming a component model of the LOM are
given the same name of the said component of the LOM. In other words, all
the model components (of the equivalent ANNs) that are equivalent to one
another are given the same component name.

[0010] All models that are equivalent to the LOM including the LOM and the
CIPAM use the "unsupervised covariance rule" instead of the "unsupervised
correlation rule" used in the THPAM and can prevent the clusters of
patterns or causes formed in synapses from overgrowing. Moreover, all
models that are equivalent to the LOM including the LOM and the CIPAM use
the "supervised covariance rule" instead of the "supervised correlation
rule" used in the THPAM. These are two of the main improvements in the
LOM, the CIPAM and other equivalent models over the THAPM.

[0011] From the application viewpoint, as an ANN, the LOM (or a
mathematical equivalent thereof) has the following advantages:

[0012]
1. No label of the learning data from outside the ANN is needed for the
UPUs (unsupervised processing units) in the LOM to learn.

[0013] 2. The
unsupervised learning by a processing unit clusters data without
involving selecting a fixed number of prototypes, cycling through the
data, using prototypes as cluster labels, or minimizing a non-convex
criterion.

[0014] 3. Both the unsupervised and supervised covariance
rules are of the Hebbian type, involving no differentiation,
backpropagation, optimization, iteration, or cycling through the data.
They learn virtually with "photographic memories", and are suited for
online adaptive learning. Large numbers of large temporal and spatial
data such as photographs, radiographs, videos, speech/language,
text/knowledge, etc. are learned easily. The "decision boundaries" are
not determined by exemplary patterns from each and every pattern and
"confuser" class, but by those from pattern classes. In many applications
such as target and face recognition, there are a great many pattern and
"confuser" classes and usually no or not enough exemplary patterns for
some "confuser classes".

[0015] 4. Only a small number of algorithmic
steps are needed for retrieving or estimating labels. Detection and
recognition of multiple/hierarchical temporal/spatial causes are easily
performed. Massive parallelization at the bit level by VLSI
implementation is suitable.

[0017] 6. The ANN generalizes not by only a single holistic
similarity criterion for the entire input exogenous feature vector, which
noise; erasure; distortion and occlusion can easily defeat, but by a
large number of similarity criteria for feature subvectors input to a
large number of UPUs (processing units) in different layers. These
criteria contribute individually and collectively to generalization for
single and multiple causes. Example 1: smiling; putting on a hat; growing
or shaving beard; or wearing a wig can upset a single similarity
criterion used for recognizing a face in a mug-shot photograph. However,
a face can be recognized by each of a large number of feature subvectors
of the face. If one of them is recognized to belong to a certain face,
the face is recognized. Example 2: a typical kitchen contains a
refrigerator, a counter top, sinks, faucets, stoves, fruit and vegetable
on a table, etc. The kitchen is still a kitchen if a couple of items, say
the stoves and the table with fruit and vegetable, are removed.

[0018] 7.
Masking matrices in a PU (processing unit) eliminate effects of
corrupted, distorted and occluded components of the feature subvector
input to the PU, and thereby enable maximal generalization capability of
the PU, and in turn that of the ANN.

[0019] 8. The ANN is no more a
blackbox with "fully connected" layers much criticized by opponents of
such neural networks as multilayer perceptrons (MLPs) or recurrent MLPs.
In a PU of the ANN, synaptic weights are covariances between neuronal
codes and labels of the vector input to the PU. Each PU has a receptive
field in the exogenous feature vector input to the ANN and recognizes the
pattern(s) or cause(s) appearing within the receptive field. Such
properties can be used to help select the architecture (i.e., layers,
PUs, connections, feedback structures, etc.) of the ANN for the
application.

[0020] 9. The ANN (or a mathematical equivalent thereof) may
have some capability of recognizing rotated, translated and scaled
patterns. Moreover, easy learning and retrieving by an ANN allow it to
learn translated, rotated and scaled versions of an input image with
ease.

[0021] 10. The hierarchical architecture of the clusterer stores
models of the hierarchical temporal and spatial worlds (e.g., letters,
words and sentences).

[0022] 11. Ambiguity and uncertainty are
represented and resolved with subjective probabilities and membership
degrees in the sense of fuzzy logic.

[0023] 12. Noises and interferences
in inputs self-destruct like random walks with residues eliminated
gradually by forgetting factors in the synapses, leaving essential
informations that have been learned by repetitions and emphases.

[0024]
13. The architecture of the ANN can be adjusted without discarding
learned knowledge in the ANN. This allows enlargement of the feature
subvectors, increase of the number of layers, and even increase of
feedback connections.

[0025] For simplicity and clarity of the present invention disclosure, we
will mainly describe the LOM in the present invention disclosure and will
also show how the LOM is transformed into the CIPAM and other ANNs that
are mathematically equivalent to it by the use of affine functions and
their inverses.

3 SUMMARY

[0026] In this Section, we first describe briefly the LOM (Low-Order Model
of biological neural networks) and its equivalents including the CIPAM
(Clustering Interpreting Probabilistic Associative Memory). All these
equivalents and the LOM can be transformed into one another by affine
functions and their inverses. The components of the LOM and their
corresponding components of those equivalents are models of the
corresponding components of biological neural networks. Each model
component of the LOM and its corresponding model components of those
equivalents can be transformed into one another by affine functions and
their inverses and are therefore given the same name. Such components
include the model neuronal node, model neuronal encoder, model neuronal
tree, model synapse, model spiking neuron, model nonspiking neuron, etc.

[0027] The LOM and its equivalents form a new paradigm of artificial
neural networks and can be used for the wide range of applications that
artificial neural networks are intended for. Said applications include
clustering data, detecting objects or patterns, recognizing or
classifying patterns or objects. In applications, the LOM and its
equivalents are artificial neural networks, and their model components,
which are mentioned above, are the artificial neuronal (axonal or
dendritic) node, artificial neuronal (axonal or dendritic) encoder,
artificial neuronal (axonal or dendritic) tree, artificial synapse,
artificial spiking neuron, artificial nonspiking neuron, etc. In the
present invention disclosure, the words, "model" and "artificial" in
front of a component are used interchangeably, depending on whether the
emphasis is placed on modeling of a biological component or on
application of the model.

[0028] For simplicity, the LOM is described in the present invention
disclosure with the understanding that the description is valid for the
equivalents of the LOM after transformation by proper affine functions
and their inverses. Notice also that in this invention disclosure,
although a dendrite or axon is a part of a neuron, and a dendro-dendritic
synapse is a part of a dendrite (thus a part of a neuron), they are
treated, for simplicity, as if they were separate entities, and the word
"neuron" refers essentially to the soma of a neuron in this
specification. Similarly, artificial dendrites, artificial axons, and
artificial synapses are treated, for simplicity, as if they were separate
from artificial nonspiking/spiking neurons, and the term "artificial
nonspiking/spiking neuron" refers essentially to the model of a soma of a
neuron in the present invention disclosure.

[0029] In this Summary, references to subsection and subsubsection numbers
in the Section entitled "DESCRIPTION OF PREFERRED EMBODIMENTS" are made.

[0030] As an artificial neural network (ANN), the LOM (or any of its
equivalents), is a discrete-time multilayer network of processing units
(PUs) with or without feedback connections. A PU includes some or all of
the following components:

[0031] 1. artificial neuronal encoders for
encoding a vector that is input to said encoder into a neuronal code that
is output from the encoder and has an orthogonality property;

[0032] 2.
means for evaluating a code deviation vector that is the deviation of a
neuronal code from an average neuronal code;

[0033] 3. artificial
synapses each for storing an entry of a code deviation accumulation
vector and for evaluating a first product of said entry and a component
of a code deviation vector that is the deviation of a neuronal code from
an average neuronal code;

[0034] 4. artificial synapses each for storing
an entry of a code covariance matrix and for evaluating a second product
of said entry and a component of a code deviation vector that is the
deviation of a neuronal code from an average neuronal code;

[0035] 5.
memory means for storing masking factors, which are the diagonal entries
of masking matrices;

[0036] 6. an artificial nonspiking neuron for
evaluating a first sum of products each of a first product and a masking
factor; and

[0037] 7. R artificial spiking neurons each for evaluating a
second sum of products each of a second product and a masking factor and
for using said first sum and said second sum to evaluate a representation
of a subjective probability distribution of a component of a label of a
vector that is input to said processing unit and for generating a
pseudorandom number in accordance with said subjective probability
distribution, wherein R is the number of binary digits in the label of
the vector input to the PU, the R pseudorandom numbers generated by said
artificial spiking neurons are binary digits (i.e., 0 and 1) and form a
point estimate of the label of the vector input to said PU; such labels
and their point estimates being R-dimensional binary vectors. Over time,
the point estimates generated by a PU form R spike trains output from the
PU. The effect of the masking factors, which are nonnegative integers,
can also be achieved either by having numerous artificial neuronal
encoders with input vectors of different dimensionalities (i.e., input
vectors with different numbers of components) or by including duplicates
of first products and duplicates of second products in the first and
second sum respectively.

[0038] A vector input to a PU first goes through artificial neuronal
encoders (Subsection 5.1), which are networks of artificial neuronal
nodes and form the upper part of the artificial neuronal trees.
Artificial neuronal nodes in the LOM are each a hyperbolic polynomial
with two variables, which acts approximately like an XOR (exclusive-OR)
logic gate with an accuracy depending on how close the two inputs to the
node are to binary digits (Subsubsection 5.1.1). By combining such
hyperbolic polynomials, which are commutative and associative binary
operations, an artificial neuronal node may have more than two input
variables.

[0039] An artificial neuronal encoder, which is a network of neuronal
nodes or a mathematical composition of many hyperbolic polynomials, can
be looked upon as a function that encodes its input vector into a
neuronal code with an orthogonal property (Subsubsection 5.1.3). Code
deviation vectors that are each a deviation of a neuronal code from an
average neuronal code over a time window are computed.

[0040] The strength (or weight) of an artificial synapse is either an
entry of a code covariance matrix or an entry (or component) of a code
deviation accumulation vector. An artificial synapse storing an entry of
a code deviation accumulation vector evaluates the product of said entry
and a component of a code deviation vector to yield a first product. This
product is then multiplied to a masking factor, that is the diagonal
entry of a masking matrix corresponding to the artificial synapse. An
artificial synapse storing an entry of a code covariance matrix evaluates
the product of said entry and a component of a code deviation vector to
yield a second product. This product is then multiplied to a masking
factor, that is the diagonal entry of a masking matrix corresponding to
the artificial synapse.

[0041] In the PU, there are an artificial nonspiking neuron, also called a
model nonspiking neuron, and R artificial spiking neurons, also called
model spiking neurons. An artificial nonspiking neuron evaluates a first
sum of products each of a first product and a masking factor
(Subsubsection 5.5.4). An artificial spiking neuron evaluates a second
sum of products each of a second product and a masking factor, uses said
first sum and said second sum to evaluate a representation of a
subjective probability distribution of a component of a label of a vector
that is input to said processing unit, and generates a pseudorandom
number in accordance with said subjective probability distribution
(Subsubsection 5.5.5). The R pseudorandom numbers generated by the R
artificial spiking neurons are binary digits (i.e., 0 and 1) and form a
point estimate of the label of the vector input to said PU, such labels
and their point estimates being R-dimensional binary vectors. Over time,
the point estimates generated by a PU form R spike trains output from the
PU. The effect of the masking factors, which are nonnegative integers,
can also be achieved either by having numerous artificial neuronal
encoders with input vectors of different dimensionalities (i.e., input
vectors with different numbers of components) or by including duplicates
of first products and duplicates of second products in the first and
second sum respectively.

[0042] There are three learning rules: the unsupervised covariance rule,
supervised covariance rule and unsupervised accumulation rule. The
unsupervised covariance rule is used for synapses with post-synaptic
model spiking neurons whose outputs are teaching signals (or desired
outputs from the post-synaptic model spiking neurons) (Subsubsection
5.2.1). The supervised covariance rule is used for synapses receiving
teaching signals (or desired outputs from the post-synaptic model spiking
neurons) or a given label from outside the LOM (Subsubsection 5.2.2). The
unsupervised accumulation rule is used for accumulating deviations of the
neuronal codes from their averages over time for synapses with a
post-synaptic model nonspiking neuron (Subsubsection 5.2.3). A forgetting
factor and a normalizing constant are used to keep the entries in
expansion covariance matrices and expansion accumulation matrices
bounded. A PU usually learns by the unsupervised accumulation rule and
one of the other two learning rules. In some applications, a processing
unit can perform both supervised and unsupervised covariance learning,
depending on whether a teaching signal from outside the LOM is available
or not.

[0043] By the unsupervised covariance rule, the output vector from the R
artificial spiking neurons (i.e., model spiking neurons) in the PU is
assigned as the label (or teaching signals) to be learned jointly with
the vector input to the PU. If the input vector or a variation thereof
has not been learned before, the output vector from the R artificial
spiking neurons is a purely random label. If the input vector or a
variation of it has been learned before, the output vector, which is a
point estimate of the label or labels of the input vector based on the
subjective probability distribution, is learned jointly with the input
vector. Supervised covariance learning is performed when teaching signals
(i.e., labels of input vector to the PU) from outside the LOM are
provided (Subsection 5.2.2). In the third type of learning, namely
unsupervised accumulation learning, no label is needed.

[0044] Maximal generalization capability of the PU is achieved mainly with
masking matrices, each automatically finding the largest subvector of the
output vector of a neuronal encoder that matches a subvector of an output
vector of the same neuronal encoder that has been stored in the code
covariance matrix (or the code deviation accumulation vector) and setting
the rest of the components of the former output vector equal to zero. A
masking matrix can be viewed as idealization and organization of neuronal
encoders with overlapped and nested input vectors (Section 5.4).

[0045] If a PU is required to be able to distinguish learned input vectors
with small differences with learned input vectors and to recognize
unlearned input vectors with larger differences with learned input
vectors, masking matrices for retrieving and masking matrices for
unsupervised learning should be different. To distinguish these two types
of masking matrix, the former and the latter are respectively called the
masking matrices and the learning masking matrices, and similarly, their
diagonal entries are respectively called masking factors and learning
masking factors.

[0046] The biological explanation of learning masking matrices (or
learning masking factors) is unclear. However, to use them for
unsupervised learning, a processing unit should further include:

[0048] 2. means for evaluating a third sum of products each of a first
product and a learning masking factor;

[0049] 3. means for evaluating R
fourth sums, each of which is a sum of products each of a second product
and a learning masking factor, and using said third sum and said fourth
sum to evaluate a representation of a second subjective probability
distribution; and

[0050] 4. unsupervised learning means for using R
pseudorandom numbers generated in accordance with said second subjective
probability distribution and at least one code deviation vector to adjust
at least one code covariance matrix in response to a vector that is input
to said processing unit by the unsupervised covariance rule, wherein said
learning masking factors are diagonal entries of a learning masking
matrix; said products each of a first product and a learning masking
factor are entries of the product of a code deviation accumulation
vector, a learning masking matrix and a code deviation vector; and said
products each of a second product and a learning masking factor are
entries of the product of a code covariance matrix, a learning masking
matrix and a code deviation vector.

[0051] A PU that has completed learning or continues to learn by the
unsupervised covariance learning rule is called an unsupervised
processing unit (UPU). A PU that has completed learning or continues to
learn by the supervised covariance learning rule is called a supervised
processing unit (SPU). The LOM may have UPU in the lower layers and have
SPUs in the higher layers. Alternatively, the LOM may have a network of
UPUs only and have SPUs that branch out from the UPUs in said network. As
unsupervised learning requires no label from outside the LOM, it can be
performed anytime on data without a given or handcrafted label. What the
network of UPUs actually does is clustering of spatial and temporal
patterns in the data. When supervised learning occurs, a cluster is
assigned with the same label given or handcrafted. Therefore, the network
of UPUs is called a clusterer, and the set of SPUs branching out from the
clusterer is called an interpreter.

[0052] It is stressed that the LOM is described in the present invention
disclosure with the understanding that the description is valid for the
equivalents of the LOM.

[0053] Object 1 of the present invention is to provide an artificial
neural network and a method for identifying or approximating a known or
unknown function or dynamical system.

[0054] Object 2 of the present invention is to provide an artificial
neural network and a method for nonparametric nonlinear classification or
regression for spatial or temporal data.

[0055] Object 3 of the present invention is to provide an artificial
neural network and a method for nonparametric nonlinear classification or
regression for spatially or temporally hierarchical data.

[0056] Object 4 of the present invention is to provide an artificial
neural network and a method for recognizing spatial or temporal patterns.

[0057] Object 5 of the present invention is to provide an artificial
neural network and a method for recognizing spatially or temporally
hierarchical patterns.

[0058] Object 6 of the present invention is to provide an artificial
neural network and a method for understanding images or videos.

[0059] Object 7 of the present invention is to provide an artificial
neural network and a method for Objects 1-6 where said artificial neural
network and said method are trained on data with erasure, smear, noise,
occlusion, distortion, and/or alteration.

[0060] Object 8 of the present invention is to provide an artificial
neural network and a method for generating representations of probability
distributions of labels of the vector or its subvectors input to the
artificial neural network or the method.

[0061] Object 9 of the present invention is to provide an artificial
neural network and a method for data fusion, data mining, decision
making, predicting a financial time series, or searching the internet.

[0062] Object 10 of the present invention is to provide an artificial
neural network and a method for unsupervised learning of data or
clustering of data.

[0063] Object 11 of the present invention is to provide an artificial
neural network and a method for supervised learning of data.

[0064] Object 12 of the present invention is to provide an artificial
neural network and a method to perform both supervised and unsupervised
learning of data.

[0065] Object 13a of the present invention is to provide an artificial
neural network whose architecture can be adjusted without discarding
learned knowledge.

[0066] Object 13b of the present invention is to provide a method wherein
the dimensionalities of unsupervised or supervised covariance matrices
can be adjusted without discarding learned knowledge.

[0067] Object 14a of the present invention is to provide an artificial
neural network with a hierarchical architecture for recognizing
hierarchical patterns at different levels.

[0068] Object 14b of the present invention is to provide a method that
employs a hierarchical structure for recognizing hierarchical patterns at
different levels.

[0069] Object 15a of the present invention is to provide an artificial
neural network with feedback connections for processing sequences of
vectors such as those obtained from examining single images at
consecutive time points, multiple images of an object taken from
different angles, consecutive frames in a video or movie, and handwritten
or printed letters in a word, words in a sentence, and sentences in a
paragraph.

[0070] Object 15b of the present invention is to provide a method that
employs feedback structures for processing sequences of vectors such as
those obtained from examining single images at consecutive time points,
multiple images of an object taken from different angles, consecutive
frames in a video or movie, and handwritten or printed letters in a word,
words in a sentence, and sentences in a paragraph.

[0071] Object 16 of the present invention is to provide an artificial
neural network and a method for recognizing rotated, translated and/or
scaled patterns.

[0072] The foregoing objects, as well as other objects of the present
invention that will become apparent from the discussion below, are
achieved by the present invention with the following preferred
embodiments.

[0073] The first embodiment of the present invention disclosed herein is
an artificial neural network that includes at least one processing unit,
said processing unit comprising at least one artificial neuronal encoder
for encoding a vector vt (ψ) that is input to said encoder into
a neuronal code vt (ψ) that is output from the encoder and has
an orthogonality property; a plurality of artificial synapses each for
storing an entry Cj (ψ) of a code deviation accumulation vector

and for evaluating a first product Cj (ψ) ({hacek over
(v)}.sub.τj (ψ)-{hacek over (v)}.sub.τj (ψ)) of said
entry Cj (ψ) and a component {hacek over
(v)}.sub.τj(ψ)-{hacek over (v)}.sub.τj (ψ) of a code
deviation vector {hacek over (v)}.sub.τ (ψ)-{hacek over
(v)}.sub.τ (ψ) that is the deviation of a neuronal code {hacek
over (v)}.sub.τ (ψ) from an average neuronal code {hacek over
(v)}.sub.τ (ψ); a plurality of artificial synapses each for
storing an entry Dkj (ψ) of a code covariance matrix D
(ψ)=ΛΣs=1tλt-s (rs-1/2)
({hacek over (v)}s (ψ)-{hacek over (v)}s (ψ))' and for
evaluating a second product Dkj (ψ) ({hacek over (v)}.sub.τj
(ψ)-{hacek over (v)}.sub.τj (ψ)) of said entry Dkj
(ψ) and a component {hacek over (v)}.sub.τj (ψ)-{hacek over
(v)}.sub.τj (ψ) of a code deviation vector that is the deviation
of a neuronal code {hacek over (v)}.sub.τ (ψ) from an average
neuronal code {hacek over (v)}.sub.τj (ψ); and evaluation means
for using a plurality of first products and a plurality of second
products to evaluate a representation (e.g., y.sub.τ) of a subjective
probability distribution p.sub.τ of a label r.sub.τ of a vector
v.sub.τ that is input to said processing unit.

[0074] The second embodiment of the present invention disclosed herein is
the first embodiment, wherein said processing unit further comprises
memory means for storing a plurality of masking factors, Mjj
(ψ), j=1, . . . , 2dim v.sup.τ.sup.(ψ), ψ=1, . . . ,
Ψ, and said evaluation means comprises an artificial nonspiking
neuron for evaluating a first sum Σ.sub.ψ=1.sup.Ψ
Σj=1dim v.sup.τ(ψ) c.sub.τj (ψ) of
products c.sub.τj (ψ)=Cj (ψ) Mjj (ψ) ({hacek
over (v)}.sub.τj (ψ)-{hacek over (v)}.sub.τj (ψ)) each of
a first product and a masking factor Mjj (ψ); and at least one
artificial spiking neuron for evaluating at least one second sum,
d.sub.τk=Σ.sub.ψ=1.sup.Ψ Σj=1dim
v.sup.τ.sup.(ψ) d.sub.τkj (ψ), k=1, . . . , R, of
products d.sub.τkj (ψ)=Dkj (ψ) Mjj (ψ)({hacek
over (v)}.sub.τj (ψ)-{hacek over (v)}.sub.τj (ψ)) each of
a second product and a masking factor Mjj (ψ) and for using said
first sum and said at least one second sum to evaluate a representation
of a subjective probability distribution p.sub.τ of a label
r.sub.τ of a vector that is input to said processing unit. Here the
masking factors are diagonal entries of a masking matrix; said products
each of a first product and a masking factor are entries of the product
of a code deviation accumulation vector, a masking matrix and a code
deviation vector; and said products each of a second product and a
masking factor are entries of the product of a code covariance matrix, a
masking matrix and a code deviation vector.

[0075] Masking factors, Mjj (ψ), j=1, . . . , 2dim
v.sup.τ.sup.(ψ), ψ=1, . . . , Ψ, are nonnegative
integers. The first sum Σ.sub.ψ=1.sup.Ψ
Σj=1dim v.sup.τ.sup.(ψ) c.sub.τj (ψ) of
products c.sub.τj (ψ) is equal to Σ.sub.ψ=1.sup.Ψ
Σj=1dim v.sup.τ.sup.(ψ)
Σi=1Mjj.sup.(ψ) Cj (ψ)({hacek over
(v)}.sub.τj (ψ)-{hacek over (v)}.sub.τj (ψ)), which is a
sum of products Cj (ψ) ({hacek over (v)}.sub.τj
(ψ)-{hacek over (v)}.sub.τj (ψ)). The second sum
Σ.sub.ψ-1.sup.Ψ Σj=1dim
v.sup.τ.sup.(ψ) d.sub.τkj (ψ) of products d.sub.τkj
(ψ) is equal to Σ.sub.ψ=1.sup.Ψ Σj=1dim
v.sup.τ.sup.(ψ) Σi=1Mjj.sup.(ψ) Dkj
(ψ)({hacek over (v)}.sub.τj (ψ)-{hacek over (v)}.sub.τj
(ψ)), which is a sum of products Cj (ψ) ({hacek over
(v)}.sub.τj (ψ)-{hacek over (v)}.sub.τj (ψ)). In short,
multiplying a first product by a masking factor can be achieved with
addition of duplicates of the first product, and multiplying a second
product by a masking factor can be achieved with addition of duplicates
of the second product. Therefore the second embodiment is a special case
of the first embodiment.

[0076] Note that the product of a masking matrix M (ψ) and the code
deviation vector {hacek over (v)}.sub.τj (ψ)-{hacek over
(v)}.sub.τj (ψ),

where diagI (i1-, i2-, . . . , ij-) ({hacek
over (v)}.sub.τj (ψ)-{hacek over (v)}.sub.τj (ψ)) can be
looked upon as the code deviation vector of the neuronal code
diagI(i1-, i2-, . . . , ij-) {hacek over
(v)}.sub.τj (ψ), which can in turn be looked upon as what a
neuronal encoder encodes the vector diagI (i1-, i2-,
. . . , ij-) v.sub.τj (ψ) into, and 2-ηj
denotes a weight that is preselected to differentiate between different
levels j of maskings. Therefore, the first embodiment with artificial
neuronal encoders that produce such neuronal codes diagI(i1-,
i2-, . . . , ij-) {hacek over (v)}.sub.τj (ψ)
is mathematically the same as the second embodiment and has the same
generalization capability as the second embodiment.

[0077] The third embodiment of the present invention disclosed herein is
the second embodiment, wherein said at least one artificial spiking
neuron is further for generating a pseudorandom vector v {p.sub.τ}
(or v {y.sub.τ}) in accordance with said subjective probability
distribution p.sub.τ.

[0078] The fourth embodiment of the present invention disclosed herein is
the second embodiment, wherein the artificial neural network further
comprises at least one feedback connection with time delay means. The
artificial neural network in this embodiment is a recurrent network.

[0079] The fifth embodiment of the present invention disclosed herein is
the second embodiment, wherein each processing unit further comprises
unsupervised accumulation learning means for adjusting the code deviation
accumulation vectors, C (ψ), ψ=1, . . . , Ψ, in response to a
vector vt that is input to said processing unit by the unsupervised
accumulation rule,

C ( ψ )  C ( ψ ) + Λ 2 ( v
t ( ψ ) - v t ( ψ ) ) ' ;
##EQU00003##

and unsupervised learning means for using a pseudorandom vector v
{p.sub.τ} in accordance with said subjective probability distribution
p.sub.τ and code deviation vectors {hacek over (v)}t
(ψ)-{hacek over (v)}t (ψ) to adjust the code covariance
matrices, D (ψ), ψ=1, . . . , Ψ, in response to a vector that
is input to said processing unit by the unsupervised covariance rule, D
(ψ)λD (ψ)+Λ (v {p.sub.τ}-v {p.sub.τ})
({hacek over (v)}t (ψ)-{hacek over (v)}t (ψ))'. A
processing unit in the fifth embodiment is called an unsupervised
processing unit (UPU). So is a processing unit in the second embodiment,
wherein the code deviation accumulation vectors and code covariance
matrices in said processing unit have been determined as those in the
sixth embodiment and held fixed.

[0080] The sixth embodiment of the present invention is the second
embodiment, wherein each processing unit further comprises supervised
accumulation learning means for adjusting the code deviation accumulation
vectors, C (ψ), ψ=1, . . . , Ψ, in response to a vector
vt that is input to said processing unit by the unsupervised
accumulation rule,

C ( ψ )  λ C ( ψ ) +
Λ 2 ( v t ( ψ ) - v t ( ψ )
) ' ; ##EQU00004##

and supervised learning means for using a label rt of vt
provided from outside said processing unit and code deviation vectors
{hacek over (v)}t (ψ)-{hacek over (v)}t (ψ) to adjust
the code covariance matrices, D (ψ), ψ=1, . . . , Ψ, in
response to a vector vt that is input to said processing unit by the
supervised covariance rule, D (ψ)λD (ψ)+Λ
(rt-1/2) ({hacek over (v)}t (ψ)-{hacek over (v)}t
(ψ))'. A processing unit here in the sixth embodiment is called a
supervised processing unit (SPU). So is a processing unit in the second
embodiment, wherein the code deviation accumulation vectors and code
covariance matrices in said processing unit have been determined as those
in the sixth embodiment and held fixed.

[0081] The seventh embodiment of the present invention is the second
embodiment, wherein each processing unit (PU) further includes: memory
means for storing a plurality of learning masking factors; means for
evaluating a third sum of products each of a first product and a learning
masking factor; means for evaluating R fourth sums, each of which is a
sum of products each of a second product and a learning masking factor,
and using said third sum and said fourth sum to evaluate a representation
of a second subjective probability distribution; and unsupervised
learning means for using R pseudorandom numbers generated in accordance
with said second subjective probability distribution and at least one
code deviation vector to adjust at least one code covariance matrix in
response to a vector that is input to said processing unit by the
unsupervised covariance rule. With a set of masking factors for
retrieving and a second set of masking factors for unsupervised learning,
a PU can distinguish learned input vectors with small differences and to
generalize on parts of input vectors having larger differences with
learned input vectors.

[0082] The seventh embodiment of the present invention is the second
embodiment, wherein a plurality of processing units are unsupervised
processing units, which form a network called a clusterer, and at least
one processing unit is a supervised processing unit that receives vectors
from an unsupervised processing unit in said clusterer, the set of at
least one supervised processing unit being called an interpreter.

[0083] The present invention can also be embodied as systems, in which
artificial neuronal encoder, synapses, and nonspiking/spiking neurons are
respectively embodied as encoding means, memory means, and evaluation
means; and learning means in an ANN remain to be learning means. The
following embodiment is an example.

[0084] The eighth embodiment of the present invention is a system for
processing data, said system comprising a least one processing unit, each
processing unit comprising encoding means for encoding a vector that is
input to said encoder into a neuronal code that has an orthogonality
property; memory means for storing at least one code deviation
accumulation vector, at least one code covariance matrix and at least one
masking matrix; means for evaluating a first product of a code deviation
accumulation vector, a masking matrix and a code deviation vector that is
the deviation of a neuronal code from an average neuronal code; means for
evaluating a second product of a code covariance matrix, a masking matrix
and a code deviation vector that is the deviation of a neuronal code from
an average neuronal code; and evaluation means for using at least one
first product and at least one second product to evaluate a
representation of a first subjective probability distribution of a label
of a vector that is input to said processing unit.

[0085] The present invention can also be embodied as methods, in which the
functions of artificial neuronal encoder, synapses, nonspiking/spiking
neurons and learning means in an ANN are respectively embodied as steps.
The following embodiment is an example.

[0086] The ninth embodiment of the present invention is a method for
processing data that includes the steps of encoding a subvector of a
first vector into a neuronal code with an orthogonality property;
evaluating a code deviation vector that is the deviation of a neuronal
code from an average neuronal code; evaluating a first product of an
entry of a code deviation accumulation vector, a component of a code
deviation vector and a masking factor; evaluating a second product of an
entry of a code covariance matrix, a component of a code deviation vector
and a masking factor; evaluating a first sum of first products;
evaluating at least one second sum of second products; and using said
first sum and said at least one second sum to evaluate a representation
of a subjective probability distribution of a component of a label of
said first vector.

[0088] Preferred embodiments of the present invention will now be further
described in the following paragraphs of the specification and may be
better understood when read in conjunction with the attached drawings, in
which:

[0089]FIG. 1 illustrates an artificial neuronal node 7, represented by a
solid dot, in the LOM. The artificial neuronal node has two input
variables, v and u, and one output φ (v, u)=-2vu+v+u. If v and u are
binary digits, φ (v, u) acts like the logic gate, XOR (exclusive-or).

[0090]FIG. 2 is a three-dimensional graph of the output φ (v, u) of
an artificial neuronal node over the unit square [0, 1]2. Note that
the domain of the function φ (v, u) contains the unit square as a
subset. The saddle shape of φ (v, u) shows that the artificial
neuronal node is robust: If strengths of spikes change in travelling
through dendrites/exons or if the spikes are corrupted by biological
noises in the dendrites/exons, the outputs of the artificial neuronal
node suffer from a graceful degradation. The hyperbolic polynomial
-2vu+v+u is an idealized approximation of the XOR logic gate and
henceforth called the XOR polynomial.

[0091]FIG. 3 illustrates an artificial neuronal encoder with four inputs
vti, i=1, . . . , 4; and 16 outputs listed on the right side of the
artificial neuronal encoder. It is an upper part of an artificial
neuronal tree, performing neuronal encoding. The solid dots represent
artificial neuronal nodes as shown in FIG. 1. Because of the
commutativity and associativity of the XOR polynomial, there are many
possible branching structures with the same inputs and outputs.

[0092]FIG. 4 illustrates an artificial neuronal encoder 6 that generates
the 2m-dimensional neuronal code {hacek over (v)} of an
m-dimensional vector v input to the artificial neuronal encoder in a
recursive manner.

[0093]FIG. 5a illustrates a deviation evaluator 12 that evaluates the
code deviation {hacek over (v)}tj-{hacek over (v)}tj of an
artificial neuronal code {hacek over (v)}tj from it average {hacek
over (v)}tj, where the average

v tj is 1 q v τ = t - q v + 1
t v tj ##EQU00005##

of {hacek over (v)}tj over a time window [t-qv+1, t] with the
window width qv; an unsupervised learning scheme that adjusts an
entry Dij of the code covariance matrix D using the unsupervised
covariance rule 14a; a unit-time delay 8; an artificial synapse denoted
by 18d that stores the entry Dij of the code covariance matrix D
after a unit-time delay 8 and evaluates the product Dij ({hacek over
(v)}tj-{hacek over (v)}tj); and artificial spiking neuron i 10
that uses such products from artificial synapses and a graded signal
ct from an artificial nonspiking neuron to generate an output
uti. Note that the unsupervised learning rule 14a inputs the
artificial code deviation {hacek over (v)}tj-{hacek over (v)}tj
and the output uti from the artificial spiking neuron i 10 as the
ith component rti of the label rt of vt.

[0094]FIG. 6a illustrates a deviation evaluator 12 that evaluates the
code deviation {hacek over (v)}tj-{hacek over (v)}tj of an
artificial neuronal code {hacek over (v)}tj from its average {hacek
over (v)}tj, where the average

v tj is 1 q v τ = t - q v + 1
t v tj ##EQU00006##

of {hacek over (v)}tj over a time window [t-qv+1, t] with the
window width qv; a supervised learning scheme that adjusts an entry
Dij of the code covariance matrix D using the supervised covariance
rule 14b; a unit-time delay 8; an artificial synapse 18d that stores the
entry Dij of the code covariance matrix D after a unit-time delay
and evaluates the product Dij ({hacek over (v)}tj-{hacek over
(v)}tj); and artificial spiking neuron i 10 that uses such products
from artificial synapses and a graded signal ct from an artificial
nonspiking neuron to generate an output uti. Note that the
supervised learning rule 14b inputs the code deviation {hacek over
(v)}tj-{hacek over (v)}tj and the component wti of a label
wt of the measurements in the receptive field of model spiking
neuron i, that wt is provided from outside the LOM (e.g., a
handcrafted label wt) as the ith component rti of the label
rt for the unsupervised covariance rule.

[0095] FIG. 7a illustrates a deviation evaluator 12 that evaluates the
code deviation {hacek over (v)}tj-{hacek over (v)}tj of an
artificial neuronal code {hacek over (v)}tj from its average {hacek
over (v)}tj, where the average {hacek over (v)}tj is

1 q v τ = t - q v + 1 t v tj of
v tj ##EQU00007##

over a time window [t-qv+1, t] with the window length qv; an
unsupervised accumulation learning scheme 20 that adjusts an entry
Cj of the code deviation accumulation vector C using the
unsupervised accumulation rule 20; a unit-time delay 8; an artificial
synapse 18c that stores the entry Cj of the code deviation
accumulation vector C after a unit-time delay and evaluates the product
Cj ({hacek over (v)}tj-{hacek over (v)}tj); and artificial
nonspiking neuron i 16 that sums such products from artificial synapses
as an output ct. Note that the unsupervised accumulation rule 20
inputs only the component {hacek over (v)}tj-{hacek over (v)}tj
of the code deviation vector {hacek over (v)}t-{hacek over
(v)}t.

[0096] FIG. 8 shows an artificial synapse denoted by 18d that stores an
entry Dij of the code covariance matrix D and performs the
multiplication of its input {hacek over (v)}tj-{hacek over
(v)}tj and its weight Dij. FIG. 8 also shows an artificial
synapse denoted by 18c that stores an entry Cj of the code
deviation accumulation vector C and performs the multiplication of its
input {hacek over (v)}tj-{hacek over (v)}tj and its weight
Cj.

[0097] FIG. 9 shows the data for supervised learning in EXAMPLE 2a,
unsupervised learning in EXAMPLE 2b, and unsupervised learning in EXAMPLE
2c at the vertices of a cube. The digits 17 in the squares at the
vertices are labels for supervised learning. The question marks 15
indicate that the labels are unknown and unsupervised learning is
necessary.

[0098]FIG. 10 shows an artificial synapse denoted by 18d that stores an
entry Dij of the code covariance matrix D and performs the
multiplication of its input {hacek over (v)}tj-{hacek over
(v)}tj and its weight Dij. Denoting the jth diagonal entry of
the masking matrix M by Mjj, the product Dij ({hacek over
(v)}tj-{hacek over (v)}tj) obtained by the artificial synapse
is multiplied by the masking factor Mjj 22, resulting in
d.sub.τij=MjjDij ({hacek over (v)}tj-{hacek over
(v)}tj), which is the (i, j)th entry of the matrix [d.sub.τij].
FIG. 10 also shows an artificial synapse denoted by 18c that stores an
entry Cj of the code deviation accumulation matrix C and performs
the multiplication of its input {hacek over (v)}tj-{hacek over
(v)}tj and its weight Cj. The product Cj ({hacek over
(v)}tj-{hacek over (v)}tj) obtained by the artificial synapse
denoted by 18c is multiplied by the masking factor Mjj 22,
resulting in c.sub.τj=MjjCj ({hacek over (v)}tj-{hacek
over (v)}tj), which is the jth entry of the row vector
[c.sub.τj].

[0099] FIG. 11 illustrates an artificial neuronal encoder, wherein
artificial neuronal nodes 7 are represented by large dots; an artificial
nonspiking neuron 16, an artificial spiking neuron 10, the artificial
synapses denoted by 18c, 18d, and the masking matrix M denoted by the
dashed line 15, that are used in EXAMPLES 2a, 2b and 2c. The dashed line
15 represents a masking factor following an artificial synapse on each
feedforward connection. In unsupervised accumulation learning, the
weights in artificial synapses 18c are adjusted by the unsupervised
accumulation rule 20 as shown in FIG. 7b. For unsupervised covariance
learning, the teaching signal r.sub.τ is the output vector
u.sub.τ=v {p.sub.τ} of the spiking neuron, and the unsupervised
covariance rule (14a in FIG. 5b) is used to adjust the weights in
artificial synapses denoted by 18d. For supervised covariance learning,
the teaching signal r.sub.τ is a label w.sub.τ of input vector
v.sub.τ, that is provided from outside the LOM, and the supervised
covariance rule (14b in FIG. 6b) is used to adjust the weights in
artificial synapses denoted by 18d.

[0100] FIG. 12 illustrates the masking matrix M 21 for an artificial
neuronal encoder that encodes an m-dimensional vector vt into its
2m-dimensional neuronal code {hacek over (v)}t. M is a
2m×2m diagonal matrix diag(M11 . . .
M2m2m), whose diagonal entries are called masking
factors. M{hacek over (v)}t eliminates (or reduces) the effect of
the corrupted, distorted or occluded components of vt to enable
generalization on other components. The greater J is, the effect of more
components of vt can be eliminated (or reduced). The factor
2-ηj is used to de-emphasize the effect of subvectors of vt
in accordance with the number j of components of vt included in
(i1-, i2-, . . . , ij-). The constant J is
preselected with consideration of dim vt and the application.

[0101]FIG. 5b is the same as FIG. 5a except the addition of the masking
factor Mjj 22 that is multiplied to the output Dij ({hacek over
(v)}tj-{hacek over (v)}tj) from the artificial synapse 18d to
yield the product dtij=MjjDij ({hacek over
(v)}tj-{hacek over (v)}tj). Artificial spiking neuron i 10 then
uses such products and a graded signal ct from an artificial
nonspiking neuron to generate an output uti. Note that the
unsupervised learning rule 14a inputs the artificial code deviation
{hacek over (v)}tj-{hacek over (v)}tj and the output uti
from the artificial spiking neuron i 10 as the ith component rti of
the label rt of vt.

[0102]FIG. 6b is the same as FIG. 6a except the addition of the masking
factor Mjj 22 that is multiplied to the output Dij ({hacek over
(v)}tj-{hacek over (v)}tj) from the artificial synapse 18d to
yield the product dtij=MjjDij ({hacek over
(v)}tj-{hacek over (v)}tj). Artificial spiking neuron i 10 then
uses such products and a graded signal ct from an artificial
nonspiking neuron to generate an output uti. Note that the
supervised learning rule 14b inputs the code deviation {hacek over
(v)}tj-{hacek over (v)}tj and the component wti of a label
wt of the measurements in the receptive field of model spiking
neuron i, that wt is provided from outside the LOM (e.g., a
handcrafted label wt) as the ith component rti of the label
rt for the unsupervised covariance rule.

[0103]FIG. 7b is the same as FIG. 7a except the addition of the masking
factor Mjj 22 that is multiplied to the output Cj ({hacek over
(v)}tj-{hacek over (v)}tj) from the artificial synapse 18c to
yield the product ctj=MjjCj ({hacek over
(v)}tj-{hacek over (v)}tj). Artificial nonspiking neuron 16
then sums such products as an output ct. Note that the unsupervised
accumulation rule inputs only the component {hacek over
(v)}tj-{hacek over (v)}tj of the code deviation vector {hacek
over (v)}t-{hacek over (v)}t.

[0104]FIG. 13 shows that the product M{hacek over (v)}.sub.τ in
EXAMPLE 3 consists of four terms, {hacek over (v)}.sub.τ, (diagI
(1-) {hacek over (v)}.sub.τ (diagI (2-)) {hacek over
(v)}.sub.τ, and (diagI (3-)) {hacek over (v)}.sub.τ, where
(diagI (j-)) {hacek over (v)}.sub.τ has the components in {hacek
over (v)}.sub.τ that involve v.sub.τj eliminated. These four
terms can be viewed as the outputs of four artificial neuronal encoders
shown in (a) 26, (b) 28, (c) 30 and (d) 32 in response to four overlapped
and nested input vectors, v.sub.τ=[v.sub.τ1 v.sub.τ2
v.sub.τ3]', [v.sub.τ2 v.sub.τ3]', [v.sub.τ1
v.sub.τ3]', and [v.sub.τ1 v.sub.τ2]', respectively. The
product M{hacek over (v)}.sub.τ is the sum of the outputs in (a),
(b), (c) and (d). Although there are two parts for (diagI (3-))
{hacek over (v)}.sub.τ in (d), they can be replaced with a single
neuronal encoder, namely the upper neuronal (e.g., dendritic) part in (d)
with φ (v.sub.τ2, v.sub.τ2)=0 added to replace φ
(v.sub.τ2, v.sub.τ2)=0 in the lower neuronal (e.g., dendritic)
part.

[0105]FIG. 14 illustrates three neuronal encoders 36 in a PU (processing
unit). Let vt (ψ), ψ=1, 2, 3, be subvectors of the vector
vt 34 input to the PU, where the components of vt (ψ) are
randomly selected from those of vt such that the union of all the
components of vt (ψ), ψ=1, 2, 3, is the set of components of
vt. The locations of the components of vt (1), vt (2) and
vt (3) are indicated by x, y and z in the circle 34, respectively.
The subvectors vt (ψ), ψ=1, 2, 3, are encoded by model
neuronal encoders 36 into neuronal codes {hacek over (v)}t (ψ),
ψ=1, 2, 3, which form the general neuronal code {hacek over
(v)}t=[vt' (1) vt' (2) vt' (3)]' of the vector
vt input to the PU. Note that this {hacek over (v)}t here is
different from the neuronal code of the vector vt input to the PU.

[0106]FIG. 15 illustrates two artificial neuronal encoders that input
subvectors, v.sub.τ (1) and v.sub.τ (2), of the vector
v.sub.τ. v.sub.τ (1) and v.sub.τ (2) are overlapped. With the
masking factors indicated by the dashed line denoted by M, the two
artificial neuronal encoders are used to encode one input vector
v.sub.τ into the general neuronal code {hacek over (v)}t to
vt reduce the number of synapses (from 16 to 12) required and to
increase the generalization capability.

[0107]FIG. 16 illustrates the group 40 of artificial neuronal encoders
(NEs) 36 in a PU (processing unit). The vector input to the PU is
vt. Let vt (ψ), ψ=1, . . . , Ψ, be subvectors of
the vector vt 34 input to the PU, where the components of vt
(ψ) are randomly selected from vt such that the union of all the
components of vt (ψ), ψ=1, . . . , Ψ, is the set of
components of vt. The subvectors vt (ψ), ψ=1, . . . ,
Ψ, are encoded by artificial neuronal encoders 36 into neuronal codes
{hacek over (v)}t (ψ), ψ=1, . . . , Ψ, which form the
general neuronal code {hacek over (v)}t=[{hacek over (v)}t' (1)
{hacek over (v)}t' (Ψ)]' of the vector input to the PU.

[0108] FIG. 17 illustrates the general masking matrix M in a PU, wherein M
(ψ) is the masking matrix for neuronal encoder ψ for ψ=1, . .
. , Ψ. Note that η can be preselected for a given magnitude of
dim vt (ψ) and the application. M{hacek over (v)}t, where
{hacek over (v)}t is the general neuronal code, reduces or
eliminates the effect of the corrupted, distorted, occluded, noised, or
erased components of the vector vt input to the PU. Note that the
vector [ΣjMjjDij ({hacek over (v)}.sub.τj-{hacek
over (v)}.sub.τj)], where Σj is taken over j=1, . . . ,
Σ.sub.ψ=1.sup.Ψ 2dim v(ψ), is equal to the vector
DM ({hacek over (v)}.sub.τ-{hacek over (v)}.sub.τ), because M is
a diagonal matrix.

[0109]FIG. 18 provides the formulas 42 of the general neuronal code
{hacek over (v)}t, the average general neuronal code {hacek over
(v)}t, the general code deviation accumulation vectors C, and the
general code covariance matrix D, whose D (ψ), ψ=1, . . . ,
Ψ, are learned by the unsupervised covariance rule in an unsupervised
PU (UPU) or by the supervised covariance rule in a supervised PU (SPU).

[0110]FIG. 19 illustrates an artificial nonspiking neuron 16 in a PU,
which evaluates the sum c.sub.τ of its inputs, c.sub.τj, j=1, . .
. , Σ.sub.ψ=1.sup.Ψ 2dim v.sup.τ.sup.(ψ),
which are computed using the general neuronal code deviation {hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ, the general masking matrix M
and the general code deviation accumulation vector C.

[0112] FIG. 21 illustrates a PU (processing unit) 56 without a learning
mechanism. A general code deviation accumulation vector C and a general
code covariance matrix D are stored in synapses 48. If the D was learned
before in accordance with the unsupervised covariance rule, the PU 56 is
called an unsupervised processing unit (UPU). If D was learned before in
accordance with the supervised covariance rule, the PU 56 is called a
supervised processing unit (SPU). At time or numbering τ, the PU
receives v.sub.τ. The neuronal encoders 40 encode it into the general
neuronal code {hacek over (v)}.sub.τ. The synapses 48 compute the
entries of the matrix [Dkj ({hacek over (v)}tj-{hacek over
(v)}tj)] and the vector [Cj ({hacek over (v)}tj-{hacek
over (v)}tj)], which entries are multiplied to the masking factors
Mjj 46 to yield the matrix [d.sub.τkj]=[MjjDkj ({hacek
over (v)}tj-{hacek over (v)}tj)] and the vector
[c.sub.τj]=[MjjCj ({hacek over (v)}tj-{hacek over
(v)}tj)]. The masking factor Mjj is the jth diagonal entry of
the general masking matrix M 46.

[0113] There are R artificial spiking neurons 50 and one artificial
nonspiking neuron 16 in the PU. The artificial nonspiking neuron sums up
c.sub.τj to form c.sub.τ. For k=1, . . . , R, artificial spiking
neuron k sums up d.sub.τkj to form d.sub.τk and divides
d.sub.τk by c.sub.τ to get y.sub.τk. The vector p.sub.τ
with components p.sub.τk=(y.sub.τk+1)/2 represents a subjective
probability distribution of the label of v.sub.τ input to the PU For
k=1, . . . , R, artificial spiking neuron k generates a pseudorandom
number u.sub.τk=v {p.sub.τk} in accordance with the subjective
probability p.sub.τk. The pseudo-random vector v {p.sub.τ} is a
point estimate of the label of v.sub.τ according to the subjective
probability distribution p.sub.τ and is the output vector of the PU
56 at time or numbering τ. Over time, the R components of
u.sub.τ=v {p.sub.τ} form R spike trains.

[0114]FIG. 22 illustrates an UPU with an unsupervised learning mechanism
60 to learn the code covariance matrix D and an unsupervised learning
mechanism 58 to learn the code deviation accumulation vector C. The
former 60 learns in accordance with the unsupervised covariance rule,

DλD+Λ(u.sub.τ-u.sub.τ)({hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ)'

where u.sub.τ=v {p.sub.τ} that is the output of the PU and is
actually used by the PU as a label r.sub.τ of v.sub.τ, and the
latter 58 learns in accordance with the unsupervised accumulation rule

C  λ C + Λ 2 ( v τ -
v τ ) ' ##EQU00008##

The adjusted C and D are delayed for one unit of time or one numbering 8
before being stored in the artificial synapses 48.

[0115] FIG. 23 illustrates an SPU with a supervised learning mechanism 62
to learn the code covariance matrix D and an unsupervised learning
mechanism 58 to learn the code deviation accumulation vector C. The
former 62 learns in accordance with the supervised covariance rule,

DλD+Λ(r.sub.τ-1/2)({hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ)'

where r.sub.τ denotes the label of the vector v, input to the SPU 64,
and the latter 58 learns in accordance with the unsupervised accumulation
rule

C  λ C + Λ 2 ( v τ -
v τ ) ' ##EQU00009##

The adjusted C and D are delayed for one unit of time or one numbering 8
before being stored in the artificial synapses 48.

[0116]FIG. 24 illustrates an UPU with an unsupervised learning mechanism
60 to learn the code covariance matrix D that uses learning masking
factors Mjj# and an unsupervised learning mechanism 58 to learn
the code deviation accumulation vector C. Mjj# are the diagonal
entries of the learning masking matrices, M# (ψ), ψ=1, . . .
, Ψ, which have a smaller J in (32) than the masking matrices M
(ψ) in the same UPU do.

[0117] The UPU (with learning masking matrices M# (ψ), ψ=1, .
. . , Ψ), generates an estimate v {p.sub.τ#} of the label of
the vector v.sub.τ input to the UPU using its masking matrices M
(ψ), ψ=1, . . . , Ψ, as before. However, when it comes to
learning, the learning masking matrices M# (ψ), ψ=1, . . . ,
Ψ, are used instead to generate an estimate v {p.sub.τ#} of
the label of vt, and the general code covariance matrix D is
adjusted by

D(ψ)λD(ψ)+Λ(v{p.sub.τ#}-v{p.sub..tau-
.#})({hacek over (v)}.sub.τ-{hacek over (v)}.sub.τ)'

for ψ=1, . . . , Ψ. The generation of v {p.sub.τ#} is
described in the Subsection on "Learning Masking Factors for Unsupervised
Learning of Smaller Differences" and summaried in 62 in FIG. 24. The code
deviation accumulation vector C is learned in accordance with the
unsupervised accumulation rule

C  λ C + Λ 2 ( v τ -
v τ ) ' ##EQU00010##

The adjusted C and D are delayed for one unit of time or one numbering 8
before being stored in the artificial synapses 48.

[0118]FIG. 25 illustrates layer l and layer l+2 of an example LOM with a
feedback connection from layer l to layer l and another from layer l+2 to
layer l. The former contains 1 unit-time delay device 8, and the latter 5
unit-time delay devices 8. The box under layer l of PUs does not
represent a model of a biological entity. It shows only how both the
feedforwarded and feedbacked spike trains are assembled into the vector
input to layer l. Note that only the feedbacks from layer l and layer l+2
are shown here. There can be feedbacks from other layers. Note also that
for notational simplicity, we feedback all outputs from layer l and layer
l+2, but this is not necessary.

[0119]FIG. 26 illustrates an example LOM with three layers of PUs. There
are three types of feedback connection: same-layer feedbacks, one-layer
feedbacks, and two-layer feedbacks. The delay durations on the feedback
connections are not specified in the delay boxes 76. The PUs 74 can be
UPUs or SPUs. The second delay box on a feedback connection represents an
additional delay.

[0120] FIG. 27 illustrates the entire clusterer with three layers of UPUs
78 in an example LOM.

[0121] FIG. 28 illustrates an example LOM consisting of the clusterer from
FIG. 27 and an interpreter with three SPUs 80. The feedback connections
and delays on them in the clusterer in FIG. 27 are not shown for clarify
in FIG. 28.

5 DESCRIPTION OF PREFERRED EMBODIMENTS

[0122] In the terminology of artificial neural networks, machine learning,
and pattern recognition, a feature vector is a transformation of a
measurement vector, whose components are measurements or sensor outputs.
As a special case, the transformation is the identity transformation, and
the feature vector is the measurement vector. Example measurement vectors
are digital photographs, frames of a video, segments of speech and
handwritten characters/words/numbers. In the present invention
disclosure, a feature vector is usually referred to as a vector, unless
an emphasis that a vector is a transformation of a measurement vector is
needed.

[0123] The present invention disclosure discloses a low-order model (LOM)
of biological neural networks, the Clustering Interpreting Probabilistic
Associative Memory (CIPAM), and other ANNs (artificial neural networks)
that are equivalent to the LOM and the CIPAM in the sense that they can
be obtained by transforming the LOM or the CIPAM with an affine function
and its inverse. The LOM and its equivalent ANNs including the CIPAM are
systems that receive and process feature vectors or sequences of feature
vectors. Such feature vectors input to the LOM, the CIPAM or another ANN
equivalent to them are called exogenous feature vectors. The LOM and ANNs
(including the CIPAM) equivalent to the LOM can be viewed as a new ANN
paradigm and a new type of learning machine.

[0124] The LOM or an ANN equivalent to it is a network of processing units
(PUs) with or without feedback connections. If an LOM or an ANN
equivalent to it (e.g., CIPAM) is a multilayer network, a vector input to
a layer of the network is a feature vector, because it is a
transformation of exogenous feature vectors input to the LOM or the ANN
and is in turn a transformation of the measurement vectors. A feature
vector input to layer l comprises a vector output from layer l-1 and
vectors output and feedbacked from the same or other layers. For example,
if there is a feedback connection to layer 1, then an exogenous feature
vector is not an entire feature vector input to layer 1, but only a
subvector of said entire feature vector.

[0125] The LOM, the CIPAM and their equivalent models are each a network
of models of the biological neuronal node/encoder, synapse,
spiking/nonspiking neuron, means for learning, feedback connection,
maximal generalization scheme, feedback connection, etc. For simplicity,
these component models are often referred to without the word "model".
For example, the model neuronal node, model neuronal encoder, model
neuronal tree, model spiking/nonspiking neuron, model synapse, etc. will
be referred to as the neuronal node, neuronal encoder, neuronal tree,
spiking/nonspiking neuron, synapse, etc. respectively. The LOM, the CIPAM
and all their equivalent models can be used as artificial neural networks
(ANNs). To emphasize that their components are artificial components in
these artificial neural networks, they are referred to as the artificial
neuronal node, artificial neuronal encoder, artificial neuronal tree,
artificial spiking/nonspiking neuron, artificial synapse, etc.
respectively.

[0126] We will first describe the LOM and then show how the LOM is
transformed into the CIPAM and the general ANN equivalent to the LOM by
the use of the affine functions.

[0128] It was discovered in 1980's and 1990's that biological dendrites
are capable of performing information processing tasks. However, neuronal
trees are missing in well-known artificial neural networks, overlooking a
large percentage of the neural circuit (Simon Haykin, Neural Networks and
Learning Machines, Third Edition, Pearson Education, New Jersey, 2009 and
Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer
Science, New York, 2006).

[0129] In the low-order model (LOM) disclosed in the present invention
disclosure, a model neuronal (dendritic/axonal) encoder, is a network of
model neuronal nodes, each of which is a low-order polynomial with two
variables that acts like an XOR (exclusive-OR) logic gate when the two
input variables are (bipolar) binary digits.

[0130] In this Subsection, model neuronal nodes and model neuronal
encoders are described. An model neuronal encoder is a function, that
encodes the vector input to the encoder into the vector output, called
the neuronal code, from the encoder. The output vectors have an
orthogonality property proven in the Appendix of James Ting-Ho Lo, A
Low-Order Model of Biological Neural Networks, Neural Computation, Vol.
23, No. 10, pp. 2626-2682, 2011.

5.1.1 Binary Operation of a Model Neuronal Node

[0131] A model (or artificial) neuronal node is shown in FIG. 1. The input
variables, v and u, to a neuronal node are usually spikes in spike trains
modeled as Bernoulli processes. The output of the model neuronal node is

φ(v,u)=-2vu+v+u (1)

which is a hyperbolic polynomial depicted over the unit square in FIG. 2.
The unit square is only a part of the domain of the function φ. If v
and u are binary digits, φ (v, u) is the XOR function. If not, φ
(v, u) approximates the XOR function nicely. As shown in FIG. 2, the
closer v and u are to binary values, the more φ acts like XOR. The
farther they are to binary values, the less φ acts like XOR. For
example, φ (0.9, 0.9)=0.18, φ (0.9, 0.1)=0.82, φ (0.9,
0.75)=0.3, and φ (0.75, 0.1)=0.7. φ acts more like XOR at (0.9,
0.9) and (0.9, 0.1) than at (0.9, 0.75) and (0.75, 0.1). Note that there
are other polynomials (e.g., the elliptic polynomial
2v2+2u2-2vu-v-u) that act like the XOR function at binary
inputs, but φ involves the least number of arithmetic operations and
approximates XOR in the most reasonable manner as shown in FIG. 2. φ
(v, u) is henceforth called the XOR polynomial.

5.1.2 Composition of Operations of Model Neuronal Nodes

[0132] The algebraic binary operation φ (v, u)=-2vu+v+u is commutative
and also associative:

[0134] If the model neuronal encoder has m inputs forming an input set
S={v1, v2, . . . , vm}, then the input set has 2m
subsets. On each of these subsets, say {vk1, vk2,
vki}, an output of the neuronal encoder is defined to be
φi (vk1, vk2, . . . , vki). For
example, if the input set is {v1, v2, v3}, then the
subsets are Φ, {v1}, {v2}, {v2, v1}, {v3},
{v3, v1}, {v3, v2}, {v3, v2, v1},
where Φ is the empty set. The outputs of the neuronal encoder are
φ0 (Φ), φ1 (v1), Φ1 (v2),
φ2 (v2, v1), φ1 (v3), φ2
(v3, v1), φ2 (v3, v2), φ3 (v3,
v2, v1), where φ0 (Φ) is defined to be 0.

[0135] Similarly, if the input set is {v1, v2, v3,
v4}, the model neuronal encoder has 16 outputs φi
(vk1, . . . , vki), where {vk1, . . . ,
vki} are subsets of {v1, v2, v3, v4}. FIG.
3 shows this model neuronal encoder with four inputs and 16 outputs,
where the four inputs vti, i=1, 2, 3, 4, at time t are each close to
a binary digit. φi (vk1, . . . , vki) can be
evaluated by binary φ2 or other operations φk in more
than one way if i>2. Therefore, the structure of a model neuronal
encoder for more than 2 inputs is not unique.

Hence, a function φj with repeated variables can be identified
with a function φj-2i with different variables for some i>0.
Using {v1, v2, . . . , vm} as the input set and φ (v,
u) to compose functions, we can obtain only 2m different functions
for input variables with binary values.

[0137] Note that model axonal encoders, model dendritic encoders, and
model dendritic encoders with a dendrite or dendritic node replaced with
an axon or axonal node respectively are collectively called neuronal
encoders. An orthogonality property of a model neuronal encoder's output
vector is discussed in the next Subsubsection and proven in the Appendix.

5.1.3 An Orthogonality Property of Neuronal Codes

[0138] To describe an orthogonality property of the outputs of a model
neuronal encoder with input variables {v1, v2, . . . ,
vm}, we organize its 2m outputs into a vector, called the
neuronal code, as follows: Let u denote a scalar and v=[v1 v2 .
. . vk]' a k-dimensional vector. Define a k-dimensional vector φ
(u, v) of polynomials by

φ(u,v)=[φ(u,v1)φ(u,v2) . . . φ(u,vk)]

The 2m different functions that can be defined by compositions of
the binary operation φ on the input set {v1, v2, . . . ,
vm} are generated and organized into a 2m-dimensional column
vector {hacek over (v)} by recursively generating row vectors {hacek over
(v)} (1, . . . , k), for k=1, 2, . . . , m, as follows:

Denoting the k-th component of {hacek over (v)} by {hacek over
(v)}k, the vector {hacek over (v)}=[{hacek over (v)}1 {hacek
over (v)}2 . . . {hacek over (v)}2m]' is called the
neuronal code of v. Setting the first component of {hacek over (v)} (1)
equal to zero above (instead of one) yields two properties: First,
because φ (v, v)=0 and φ (v, 0)=v, two equal binary signals
meeting at a neuronal node produce the first component of {hacek over
(v)}, and this first component will not change other components through
neuronal nodes through neuronal nodes down the stream. Second, this makes
{hacek over (0)}=0. Here, 0 are the zero vectors. The above recursive way
to generate the neuronal codes is also shown in FIG. 4.

[0139] It is proven in the APPENDIX of James Ting-Ho Lo, "A Low-Order
Model of Biological Neural Networks," Neural Computation, vol. 23, no.
10, pp. 2626-2682, 2011, that given two m-dimensional binary vectors, v
and u, their neuronal codes, {hacek over (v)} and {hacek over (u)},
satisfy

Hence, ({hacek over (v)}-1/2I)' ({hacek over (v)}-1/2I)=24-2 and
({hacek over (v)}-1/2I)' ({hacek over (u)}-1/2I)=0, as predicted by (8)
and (7).

5.2 Hebbian-Type Learning Rules

[0142] Three learning rules, called unsupervised covariance rule,
supervised covariance rule and unsupervised accumulation rule, are
described in this Subsection. The first two are essentially Terrence J.
Sejnowski's covariance rule. However, the unsupervised covariance rule
and supervised covariance rule herein proposed do not build up the
covariance between the outputs of the presynaptic and postsynaptic
neurons as Sejnowski's covariance rule does. The unsupervised covariance
rule builds up, in synapses, the covariance between the outputs of the
presynaptic neuronal encoder and the postsynaptic neurons. The supervised
covariance rule builds up the covariance between the outputs of the
presynaptic neuronal encoder and the labels provided from outside the LOM
that act as teaching signals for the supervised covariance learning. Like
Sejnowski's covariance rule, the unsupervised and supervised covariance
rules here, especially the former, can be looked upon as variants of what
is commonly known as the Hebb rule. The unsupervised accumulation rule
simply accumulates code deviation vectors, which are the deviations of
the neuronal codes from their averages over a time window with a certain
width.

5.2.1 Unsupervised Covariance Rule

[0143] There are two types of model neuron, namely model spiking neuron
(also called D-neuron) and model nonspiking neuron (also called
C-neuron), in the LOM. A model spiking neuron generates binary digits,
and a model nonspiking neuron outputs graded signals that are transmitted
to its neighboring model spiking neurons. Computations performed in these
two types of neuron are described in Subsubsections 5.5.4 and 5.5.5.

[0144] Each of the 2m outputs, {hacek over (v)}t1, {hacek over
(v)}t2, . . . , {hacek over (v)}t2m, from a neuronal
encoder at time (or numbering) t passes through a synapse to reach each
of a number of, say R, postsynaptic model spiking neurons and a
postsynaptic model nonspiking neuron. {hacek over (v)}t=[{hacek over
(v)}t1 . . . {hacek over (v)}t2m]' is called a neuronal
code, and {hacek over (v)}t=[{hacek over (v)}t1 . . . {hacek
over (v)}t2m]' denotes an average of {hacek over (v)}t
over a time window [t-qv+1, t] with width qv, which is preset
in consideration of the maximum size of the clusters formed using the
unsupervised covariance rule (to be described) for the application.
{hacek over (v)}t-{hacek over (v)}t is called a code deviation
vector. It is the deviation of a neuronal code {hacek over (v)}t
from an average neuronal code {hacek over (v)}t. FIG. 5a and FIG. 5b
show an output of the neuronal encoder going through a model synapse
represented by to reach a model spiking neuron, model spiking neuron i,
whose output at time t is uti. The model spiking neuron usually
receive signals from the synapses connected to other model neuronal
encoders.

[0145]FIG. 5a and FIG. 5b also show how the unsupervised covariance rule
updates the weight (or strength) Dij of the model synapse:

DijλDij+Λ(uti-uti)({hacek over
(v)}tj-{hacek over (v)}tj) (11)

where Λ is a proportional constant, λ is a forgetting factor
that is a positive number less than one, and {hacek over (v)}tj and
uti denote, respectively, the averages of {hacek over (v)}tj
and uti over time windows with preset widths. These widths may be
different.

[0146] The outputs uti, i=1, . . . , R, of the R model spiking
neurons can be assembled into a vector, ut=[ut1 ut2
utR]', and the strengths Dij into a R×2m matrix D
whose i×j-th entry is Dij. This matrix D is called a code
covariance matrix. Using these notations, the unsupervised covariance
rule can be expressed as follows:

DλD+Λ(ut-ut)({hacek over (v)}t-{hacek
over (v)}t)' (12)

If the vector pairs, (vs, us), s=1, . . . , t, have been
learned by the 2m R synapses, their code covariance matrix D is

[0147] This learning rule (11) makes the LOM more fault-tolerant and
efficient than THPAM:

[0148] 1. If the model spiking neuron outputting
uti or the neuronal node outputting {hacek over (v)}tj is out
of order causing uti or {hacek over (v)}tj=0 or 1 or any
constant for too long, then uti-uti=0 or vti-vti=0,
whence DijλDij and Dij shrinks to zero,
eliminating the effect of the faulty model spiking neuron or neuronal
node.

[0149] 2. If {hacek over (v)}tj assumes the value, 1 (or 0),
significantly more often than 0 (or 1), then {hacek over (v)}tj is
closer to 1 (or 0), {hacek over (v)}tj-{hacek over (v)}tj is
smaller for {hacek over (v)}tj=1 (or 0) than for {hacek over
(v)}tj=0 (or 1), and D learns {hacek over (v)}tj with less
intensity. The same happens if uti assumes 1 (or 0) significantly
more often than 0 (or 1). This automatically balances out the number of
additions (to store 1's) to and subtractions (to store 0's) from D to
avoid memory saturation at a synapse.

[0150] 3. If {hacek over
(v)}tj is replaced with an inhibitory {hacek over (v)}th taking
on 1 or 0, then {hacek over (v)}tj-{hacek over (v)}tj=-(-{hacek
over (v)}tj--{hacek over (v)}tj), where -{hacek over
(v)}tj is excitatory, and the orthogonality property among {hacek
over (v)}t remains valid. This will be discussed more in Subsection
5.3.

[0151] These advantages of the unsupervised covariance rule are valid also
for the supervised covariance rule and the unsupervised accumulation rule
to be described below. In the LOM used as an artificial neural network,
the formulas for the unsupervised covariance rule, (11), (12) and (13)
can be replaced with

for less computation and more accurate estimation of the label ut
once it is generated and learned by the PU, where I=[1 1 . . . 1]' with R
components.

[0152] How a vocabulary is built by the unsupervised covariance rule is
discussed at the end of Subsubsection 5.5.7.

5.2.2 Supervised Covariance Rule

[0153] The set of sensor elements (e.g., the pixels in the CCD of a
camera) whose measurements affect the inputs to a model spiking neuron is
called the receptive field of the model spiking neuron. The vectors
vt that are input to all the neuronal encoders that affect such
inputs to a model spiking neuron have the same receptive field.

[0154] If a label wt of the cause (e.g., object or pattern) that
appears in the receptive field of a spiking or nonspiking neuron is
provided from outside the LOM, a supervised learning can be performed.
For example, if the image or sub-image appearing in the receptive field
of the neuron can be identified by a human, s/he can handcraft a label
wt of the image or subimage.

[0155]FIG. 6a and FIG. 6b show how the component {hacek over (v)}tj
and the component wti of the label wt of model spiking neuron i
are used to update the weight Dij in a model synapse {circle around
(×)} of postsynaptic model spiking neuron i for j=1, . . . ,
2m.

[0156] The supervised covariance rule that updates the strength Dij
using {hacek over (v)}tj and wti is the following:

DijλDij+Λ(wti-1/2)({hacek over
(v)}tj-{hacek over (v)}tj) (17)

for j=1, . . . , 2m and i=1, . . . , R, where Λ and λ
are a proportion constant and a forgetting factor, and {hacek over
(v)}tj denotes the average of {hacek over (v)}tj over a time
window.

[0157] The synaptic weights (or strengths) Dij form an
R×2m matrix D whose (i, j)th entry is Dij. This matrix D
is again called a code covariance matrix. Using these notations, the
covariance learning rule can be expressed as follows:

DλD+Λ(wt-1/2I)({hacek over (v)}t-{hacek
over (v)}t)' (18)

If the pairs, (vs, ws), s=1, . . . , t, have been learned by
the R (2m) synapses, their code covariance matrix D is

[0158] Note that the update formulas, (17) and (18), and the equation (19)
are the same as the update formulas, (14) and (15), and the equation
(16), respectively; except that ut is respectively replaced with the
label wt of vt provided from outside the LOM. In some
applications, the term 1/2I in (17) and (18), and (19) can be replaced
with wti, wt and ws, respectively. In this case, the
length of time interval on which the averages, wti, wt and
ws, are taken may be different from that for {hacek over
(v)}tj, {hacek over (v)}t and {hacek over (v)}s.

EXAMPLE 2a

[0159] Consider a unit cube shown in FIG. 9. The vectors, vt, t=1, 2,
. . . , 8, to be input to neuronal encoders in EXAMPLES 2a, 2b and 2c,
are shown at the vertices. The signals from a teaching model spiking
neuron corresponding to vt, t=1, 2, 3, 7, 8, are available for
supervised learning. They are binary digits wt, t=1, 2, 3, 7, 8,
respectively, enclosed in the square boxes. The supervised training data
consists of the pairs, (vt, wt), t=1, 2, 3, 7, 8. The question
marks in the square boxes indicate no teaching signal is available for
supervised learning.

[0160] The pairs, ({hacek over (v)}t', wt), t=1, 2, 3, 7, 8, are
listed as rows in the following table:

[0161] Assume λ=Λ=1, {hacek over (v)}t=I/2, and
wt=I/2 in (18). The code covariance matrix D is the following:

D = [ - 3 4 1 4 3 4 3 4 1 4 1 4 - 1 4
3 4 ] ( 20 ) ##EQU00021##

5.2.3 Unsupervised Accumulation Rule

[0162] The 2m synapses on the connections from the output terminals
of a neuronal encoder to a model nonspiking neuron are updated by the
unsupervised accumulation rule:

C  λ C + Λ 2 ( v t - v
t ) ' ( 21 ) ##EQU00022##

FIGS. 7a and 7b show a model nonspiking neuron and one of its synapses
and how the component {hacek over (v)}tj-{hacek over (v)}tj of
the code deviation {hacek over (v)}t-{hacek over (v)}t is used
to adjust Cj. Note that the graded output ct from the model
nonspiking neuron is not feedbacked for updating the synaptic strength
Cj. If the deviations {hacek over (v)}s-{hacek over (v)}s,
s=1, . . . , t, have been accumulated in the 2m synapses, the
strengths (or weights) in them are the row vector,

C = Λ 2 s = 1 t λ t - s ( v s
- v s ) ' ( 22 ) ##EQU00023##

This row vector C is called a code deviation accumulation vector.

EXAMPLE 2b

[0163] This is a continuation of EXAMPLE 2a. For the training data,
xt, t=1, 2, 3, 7, 8, from EXAMPLE 2a, which is shown in FIG. 9, the
code deviation accumulation vector C is the following:

C = [ - 5 4 - 1 4 1 4 1 4 - 1 4 - 1 4
- 3 4 1 4 ] ( 23 ) ##EQU00024##

Note that the teaching signals, wt, t=1, 2, 3, 7, 8, are not needed
in obtaining C by the unsupervised accumulation rule (21).

[0164] The inputs to the neuronal encoder are v.sub.τ1, v.sub.τ2,
v.sub.τ3. The masking matrices M in the figure will be used in
EXAMPLE 2c in Section 5.4. The outputs of the neuronal encoder are 0,
v.sub.τ1, v.sub.τ2, φ (v.sub.τ2, v.sub.τ1), v3,
φ (v.sub.τ3, v.sub.τ1), φ (v.sub.τ3, v.sub.τ),
φ3 (v.sub.τ3, v.sub.τ2, v.sub.τ1). For the synapses
preceding the model spiking neuron to perform supervised covariance
learning, the selection lever represented by the thick line segment with
a circle at its end is placed in the top position to receive a teaching
signal r.sub.τ provided from outside. For these synapses to perform
unsupervised covariance learning, the lever is placed in the bottom
position to receive a spike or nonspike v {p.sub.τ} output from the
model spiking neuron. For these synapses to perform no learning, the
lever is placed in the middle position to receive the signal 1/2. The
selection lever of a biological synapse is usually permanently set at one
position for one type of learning. It is not clear if a biological
synapse can perform supervised, unsupervised, and no learning
alternately.

[0165] In c° 11, the model nonspiking neuron and model spiking
neuron share the same neuronal encoder and its outputs. This is not
necessary in modeling biological neural networks. As long as the model
nonspiking neuron and model spiking neuron jointly provide enough
information to produce a good estimate of the subjective probability
distribution, they may have different neuronal encoders.

5.3 Subjective Probabilities Learned

[0166] The purpose of this Subsection is to show that an artificial
neuronal encoder and the Hebbian-type learning rules disclosed in the
last two Subsections working together are able to learn subjective
probability distributions of the labels of the vectors input to the
artificial neuronal encoder.

[0167] Once a vector v.sub.τ is received by a neuronal encoder,
v.sub.τ is encoded by the neuronal encoder into {hacek over
(v)}.sub.τ, which is made available to synapses for learning as well
as retrieving of a point estimate of the label (ut or wt) of
the input v.sub.τ, whether the label has been learned by the
unsupervised covariance rule or the supervised covariance rule. This
label of v.sub.τ, which may be ut or wt as defined in the
preceding Subsection, is denoted by r.sub.τ. Recall that learned
information is stored in the code covariance matrices, D and C. Upon the
arrival of {hacek over (v)}.sub.τ, the following products,
d.sub.τ° and c.sub.τ°, are computed by the synapses
preceding the R model spiking neurons and 1 model nonspiking neuron as
shown in c° 8:

d.sub.τ°=D({hacek over (v)}.sub.τ-{hacek over
(v)}.sub.τ) (24)

c.sub.τ°=C({hacek over (v)}.sub.τ-{hacek over
(v)}.sub.τ) (25)

where d.sub.τ° is an R-dimensional vector and
c.sub.τ° is a scalar. Recall that {hacek over
(v)}t=[{hacek over (v)}t1 . . . {hacek over (v)}t2m]'
is called a neuronal code, and {hacek over (v)}t=[{hacek over
(v)}t1 . . . {hacek over (v)}t2m]' denotes an average of
{hacek over (v)}t over a time window [t-qv+1, t] with width
qv, which is preset in consideration of the size the maximum size of
the clusters formed using the unsupervised covariance rule to be
described for the application. {hacek over (v)}t-{hacek over
(v)}t is called a code deviation vector. It is the deviation of a
neuronal code {hacek over (v)}t from an average neuronal code {hacek
over (v)}t.

[0168] To gain some intuitive understanding of the meanings of
d.sub.τ° and c.sub.τ°, let us assume that
λ=Λ=1, and that the averages, wt and {hacek over
(v)}.sub.τ, are all equal to I/2. Here, I=[1 1 . . . 1]', which we
note is not the identity matrix I. Let rs be an R-dimensional binary
vector. Noting that the time average of a biological spike train is
usually close to 1/2 and that the forgetting factor λ is believed
to be very close to 1. The only problem with the assumption is that
{hacek over (v)}s1=0 and hence {hacek over (v)}s1=0≠1/2.
If the dimensionality of the vector {hacek over (v)}s is large, the
effect of missing this one term of 1/2 is negligible. Nevertheless, it
would be interesting to see whether and how this one term exists in
biological neural networks.

This is the relative frequency that rsj=1 has been learned for
vs=v.sub.τ for s=1, . . . , t. Let
a.sub.τ°=[a.sub.τ1° a.sub.τ2° . . .
a.sub.τR°]'. Then a.sub.τ°/c.sub.τ° is a
relative frequence distribution of r.sub.τ given v.sub.τ.

[0172] v.sub.τ may be shared by more than one cause (or pattern) and
may contain corruption, distortion, occlusion or noise caused directly or
indirectly by the sensor measurements such as image pixels and sound
recordings. In this case, the label r.sub.τ of v.sub.τ should be
a random variable, which can be described or represented only by a
subjective probability distribution (or a relative frequency
distribution). On the other hand, v.sub.τ may contain parts from more
than one cause. In this case, the label r.sub.τ of v.sub.τ should
be a fuzzy variable, which can be described or represented only by its
membership function as defined in the fuzzy set theory. Fortunately, both
the probabilities and membership degrees range between zero and one. The
former can be learned mainly as relative frequencies over time and the
latter mainly as relative proportions in each v.sub.τ as represented
by relative frequencies. a.sub.τj°/c.sub.τ°
evaluated above is a relative frequency representing a probability or a
membership degree. The three learning rules in the preceding Subsection
facilitate learning both the subjective probabilities and membership
degrees and sometimes a combination thereof. For simplicity, the
membership degree and membership function will also be referred to as
subjective probability and subjective probability distribution in the
present invention disclosure. The fact that the subjective probability
distribution (or membership function) of the label r.sub.τ can be
retrieved from the synapses is striking, but mathematically and naturally
necessary.

[0173] If any number of components in {hacek over (v)}.sub.τ change
their signs and the corresponding components in {hacek over (v)}s
and I change their signs, then the orthogonality property, (7) and (8),
still holds. If {hacek over (v)}tq is inhibitory, the qth component
of {hacek over (v)}s-{hacek over (v)}s and {hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ change their signs. If
additionally {hacek over (v)}s={hacek over (v)}.sub.τ=-I/2, then
a.sub.τj°/c.sub.τ° above is still the relative
frequency that rsj=1 has been learned for vs=v.sub.τ. In
general, a.sub.τj°/c.sub.τ° above is this relative
frequency regardless of how many of the neuronal encoder' outputs {hacek
over (v)}t are inhibitory.

[0174] The above subjective-probability and membership-degree
interpretation of a.sub.τ°/c.sub.τ° are valid, only
if all the vectors vt, t=1, 2, . . . , input to the neuronal encoder
are binary vectors or nearly binary vectors. Biological spiking neurons
emit spike trains, where a spike and a nonspike can be looked upon as
nearly binary digits. Therefore, the output vectors of a biological
neuronal encoder, whose components are spikes and nonspikes, are nearly
orthogonal.

EXAMPLE 1b

[0175] This example is a continuation of EXAMPLE 1a. With the v and u from
EXAMPLE 1a, let a supervised training data set consists of 8 copies of u
with label 1 and 2 copies of u with label 0; and 3 copies of v with label
1 and 27 copies of v with label 0. By (19) and (22), this supervised
training data set is learned with λ=Λ=1 (in (19) and (22))
to form the code covariance matrix D and the code deviation accumulation
vector C:

D=1/2(8-2)({hacek over (u)}-1/2I)'+1/2(3-27)({hacek over (v)}-1/2I)'

C=1/2(8+2)({hacek over (u)}-1/2I)'+1/2(3+27)({hacek over (v)}-1/2I)'

By (7) and (8), D ({hacek over (u)}-1/2I)=3 (4), C ({hacek over
(u)}-1/2I)=5 (4), D ({hacek over (v)}-1/2I)=-12 (4), C ({hacek over
(u)}-1/2I) 15 (4). It follows that (D ({hacek over (u)}-1/2I)+C ({hacek
over (u)}-1/2I))/(2C ({hacek over (u)}-1/2I))= 8/10 is the relative
frequency that u has been learned with label 1; and 1- 8/10= 2/10 is the
relative frequency that u has been learned with label 0. Similarly, (D
({hacek over (v)}-1/2I)+C ({hacek over (v)}-1/2I))/(2C ({hacek over
(v)}-1/2I))= 3/30 is the relative frequency that v has been learned with
label 1; and 1- 3/30= 27/30 is the relative frequency that v has been
learned with label 0.

5.4 Masking Factors for Maximal Generalization

[0176] Let a vector v.sub.τ that deviates, due to corruption,
distortion or occlusion, from each of the vectors vs that have been
learned by the synapses on an artificial neuronal encoder be presented to
the neuronal encoder. The neuronal encoder and its synapses are said to
have a maximal generalization capability in their retrieval of
information, if they are able to automatically find the largest subvector
of v.sub.τ that matches at least one subvector among the vectors v,
stored in the synapses and enable post-synaptic neurons to generate a
subjective probability distribution of the label of the largest
subvector. This maximal capability is achieved by the use of a masking
matrix described in this Subsection. A biological interpretation of such
a matrix is given at the end of this Subsection.

where 2j is used to compensate for the factor 2-j in
2m-2-j in the important property (30), and 2-ηj is a weight
selected to differentiate between different levels j of maskings. Note
that the masking matrix M is a diagonal matrix, whose diagonal entries
Mjj are called masking factors.

[0181] The formulas (24) and (25) together with M constitute the following
decovariance rules:

d.sub.τ=DM({hacek over (v)}.sub.τ-{hacek over (v)}.sub.τ)
(33)

c.sub.τ=CM({hacek over (v)}.sub.τ-{hacek over (v)}.sub.τ)
(34)

which are equivalent to

d.sub.τkj=DkjMjj({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj)=MjjDkj({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj)

c.sub.τj=CjMjj({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj)=MjjCj({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj)

[0183] Continuing in this manner, it is seen that DkM ({hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ) and CM ({hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ) always use the greatest number
of uncorrupted, undistorted or unoccluded components of v.sub.τ in
estimating d.sub.τk, c.sub.τ, and a.sub.τk. The vector
a.sub.τj/c.sub.τ is now an estimate of the subjective conditional
subjective probability distribution of r.sub.τ given v.sub.τ,
using the greatest number of uncorrupted, undistorted or unoccluded
components of v.sub.τ.

[0184] If some terms in (32) are missing, the generalization effect of M
degrades only gracefully. The example weight 2-5j in (32) is used to
illustrate M generalizing maximally in EXAMPLE 2c. The weight is believed
to be 2-j in biological neural networks for a reason to be
discussed. J in (32) is believed to vary from brain region to brain
region. The range of J can be found by biological experiments.

EXAMPLE 2c

[0185] Let the code covariance matrices, D and C, be those obtained in
EXAMPLE 2a and EXAMPLE 2b. Using (29), we construct the masking matrix by
the formula (32) for J=1,

[0186] Recall that with the masking matrix M, we use
d.sub.τj=D.sub.τjM ({hacek over (v)}.sub.τ-{hacek over
(v)}.sub.τ) and c.sub.τ=C.sub.τM ({hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ) in general, where Dj
denotes the jth row of D. If c.sub.τ≠0, the subjective
probability p.sub.τj=(d.sub.τj/c.sub.τ+1)/2. The masking
matrix M for this example is shown in FIG. 11.

[0187] Assume that {hacek over (v)}1 is presented to the synapses
containing the code covariance matrices through M. By matrix
multiplication,

DM({hacek over (v)}1-{hacek over
(v)}1)=-1+2-4(-1/2)=-1.0312

CM({hacek over (v)}1-{hacek over (v)}1)=1+2-4(5/2)=1.1563

The subjective probability the label of v4 is 1 is (DM ({hacek over
(v)}1-{hacek over (v)}1)/(CM ({hacek over (v)}1-{hacek
over (v)}1))+1)/2=0.054, and the subjective probability that the
label of v4 is 0 is 0.946. Note that v1 with a label of 0 has
been learned. The subjective probability that the label of v4 is 0
should be 1. The use of M causes a very small amount of error to the
subjective probability, which can be controlled by changing the weight,
2-5.

[0188] The neuronal codes of the three vertices, v4, v5 and
v6, of the cube in FIG. 9, which are not included in the supervised
learning data, are listed as follows:

[0189] Simple matrix-vector multiplication yields D ({hacek over
(v)}t-{hacek over (v)}t)=0 and C ({hacek over (v)}t-{hacek
over (v)}t)=0 for t=4, 5, 6. Hence no information is provided on
vt by D ({hacek over (v)}t-{hacek over (v)}t) and C
({hacek over (v)}t-{hacek over (v)}t) for t=4, 5, 6. This shows
that if vt has not been learned, then generalization is necessary to
get information on it from the code covariance matrices.

[0190] Assume that {hacek over (v)}4 is presented to the synapses
containing the code covariance matrices. By matrix multiplication,

DM({hacek over (v)}4-{hacek over
(v)}4)=0+2-10(9+2+6-3-2+1-1)=2-10(12)

CM({hacek over (v)}4-{hacek over
(v)}4)=0+2-10(15-2+2-1+2-1-3)=2-10(12)

The subjective probability the label of v4 is 1 is (DM ({hacek over
(v)}4-{hacek over (v)}4)/(CM ({hacek over (v)}4-{hacek
over (v)}4))+1)/2=1. From FIG. 9, we see that all the three vertices
neighboring v4 have been learned and have a label of 1. It is a good
generalization that a label of 1 is assigned to v4.

[0191] Assume that {hacek over (v)}6 is presented to the code
covariance matrices. By matrix multiplication,

DM({hacek over (v)}6-{hacek over
(v)}6)=0+2-7(9+2-6+3+2-1-1)=2-7(8) (36)

CM({hacek over (v)}6-{hacek over
(v)}6)=0+2-7(15-2-2+1-2+1-3)=2-7(8) (37)

The subjective probability the label of v6 is 1 is (DM ({hacek over
(v)}6-{hacek over (v)}6)/(CM ({hacek over (v)}6-{hacek
over (v)}6))+1)/2=1. From FIG. 9, we see that only two vertices
neighboring v4 have been learned, and they both have a label of 1.
It is a good generalization that a label of 1 is assigned to v6.

[0192] Assume that {hacek over (v)}5 is presented to the synapses
containing the code covariance matrices. By matrix multiplication,

DM({hacek over (v)}5-{hacek over
(v)}5)=0+2-7(9-2-6-3+2+1-1)=2-7(0)

CM({hacek over (v)}5-{hacek over
(v)}5)=0+2-7(15+2-2-1-2-1-3)=2-7(8)

The subjective probability the label of v5 is 1 is (DM ({hacek over
(v)}5-{hacek over (v)}5)/(CM ({hacek over (v)}5-{hacek
over (v)}5))+1)/2=1/2. From FIG. 9, we see that only two vertices
neighboring v4 have been learned, and one of them has a label of 1,
and the other has a label of 0. No generalization is possible. A label of
1 is assigned to v6 with a subjective probability of 1/2 and that a
label of 0 is assigned to v6 with equal subjective probability.

[0193] The above example shows that the weight 2-5j in (32) is more
than adequate to differentiate between different levels j of maskings for
a neuronal encoder with only 3 inputs. The greater the number m of inputs
to a neuronal encoder, the less two adjacent levels, j and j+1, of
maskings need to be differentiated. For example, if the number m of
components in the input vector is 12, any 11 of the 12 components should
be almost as good as the 12 in determining the label of the input vector.
A reduction by 50% of emphasis on the subvector is usually adequate.

[0194] Therefore, if m is 12 or larger, the weight 2-ηj in (32)
can be set equal to 2-j so that the reduction of emphasis by 50%
from level j+1 and j is adequate. In this case, the masking matrix M is a
mathematical idealization and organization of a large number of neuronal
trees with nested and overlapped inputs. The following example
illustrates this biological interpretation of the masking matrix M with
the weight being 2-j in (32).

EXAMPLE 3

[0195] Using (29), we construct the masking matrix by the formula (32) for
J=1,

M{hacek over (v)}.sub.τ={hacek over
(v)}.sub.τ+(diagI(1-)){hacek over
(v)}.sub.τ+(diagI(2-)){hacek over
(v)}.sub.τ+(diagI(3-)){hacek over (v)}.sub.τ

(diagI (k-)) {hacek over (v)}.sub.τ eliminates all terms in
{hacek over (v)}.sub.τ that contain v.sub.τk and can be viewed as
the output vector of a model neuronal encoder for k=1, 2, 3. These model
neuronal encoders are shown in FIG. 13. M{hacek over (v)}.sub.τ can
be viewed as the sum of the output vectors from four model neuronal
encoders with nested and overlapped input vectors,
v.sub.τ=[v.sub.τ1 v.sub.τ2 v.sub.τ3]', [v.sub.τ2
v.sub.τ3]', [v.sub.τ1 v.sub.τ3]', and [v.sub.τ1
v.sub.τ2]'.

[0196] The difference between FIGS. 5a-7a and FIGS. 5b-7b is that the
former do not contain the masking matrix, but the latter do. The above
biological interpretation of the masking matrix explains said difference.
In other words, while biologically there is no masking factor being
multiplied to a synapse output as shown in FIG. 5a-7a, a large number of
biological neuronal encoders with nested and overlapped inputs are
mathematically idealized and organized to have the effect of the masking
factor as shown in FIGS. 5b-7b.

5.5 Processing Units (PUs)

[0197] The LOM (low-order model) disclosed in the present invention
disclosure is a network of processing units. A processing unit comprises
model neuronal encoders, model synapses, masking matrices, a model
nonspiking neuron and a number of model spiking neurons. Let us denote
the vector input to the PU at time or numbering t, the number of model
neuronal encoders in the PU and the number of model spiking neurons in
the PU by vt, R and Ψ, respectively.

5.5.1 Multiple Model Neuronal Encoders and General Neuronal Codes

[0198] Recall that if vt=[vt1 vt2 . . . vtN], then the
neuronal code {hacek over (v)}t of vt is 2N dimensional.
If there were only one neuronal encoder in a PU, there would be two
difficulties. First, 2N grows exponentially as N increases. Second,
a large number of terms in the masking matrix M in (32) are required for
masking even a small proportion J/N of the components of vt if N is
large. These difficulties motivated the use of Ψ model neuronal
encoders.

[0199] Let vt (ψ), ψ=1, . . . , Ψ, be subvectors of
vt, where the components of vt (ψ) are randomly selected
from those of vt such that the union of all the components of
vt (ψ), ψ=1, . . . , Ψ, is the set of components of
vt. The subvectors vt (ψ), ψ=1, . . . , Ψ, are
encoded by the Ψ model neuronal encoders into neuronal codes {hacek
over (v)}t (ψ), ψ=1, . . . , Ψ. Their averages over a
time window [t-qv+1, t] of width qv are denoted by {hacek over
(v)}t (ψ), ψ=1, . . . , Ψ. We assemble these vectors
into

Notice that {hacek over (v)}t here is not the neuronal code of
vt defined in (5), but consists of the neuronal codes of the Ψ
subvectors. {hacek over (v)}t and {hacek over (v)}t above are
called a general neuronal code of vt and an average general neuronal
code respectively. This dual use of the symbol {hacek over (v)}t is
not expected to cause confusion. Note that the vectors vt (ψ),
ψ=1, . . . , Ψ, may have common components and different
dimensionalities, dim vt (ψ), ψ=1, . . . , Ψ, but every
component of vt is included in at least one vt (ψ). Equally
importantly, the components of vt (ψ) are selected at random
from those of the vector vt input to the PU. The difference {hacek
over (v)}t-{hacek over (v)}t between the general neuronal code
of vt and the average general neuronal code is called the general
code deviation vector.

Here D and C are respectively called general code covariance matrix and
general code deviation accumulation vector, and M is called the general
masking matrix for the Ψ neuronal encoders. Note that D, C and M
above are not those defined for 2dim vt-dimensional neuronal
codes of the vector vt input to the PU. The dual use of the symbols
here is not expected to cause confusion either. The general masking
matrix M is a diagonal matrix, whose diagonal entries are called masking
factors.

[0201] D, C and M above are R×Σ.sub.ψ=1.sup.Ψ
2dim vt.sup.(ψ), 1×Σ.sub.ψ=1.sup.Ψ
2dim vt.sup.(ψ), and Σ.sub.ψ=1.sup.Ψ
2dim vt.sup.(ψ)×Σ.sub.ψ=1.sup.Ψ
2dim vt.sup.(ψ) matrices, respectively. Note that by
choosing dim vt (ψ) smaller than dim vt, 2dim
vt.sup.(ψ) is much smaller than 2dim vt, and the
dimensionalities of the general code covariance matrix D, general code
deviation accumulation vector C, and the general masking matrix M are
much smaller than those obtained from using 2dim vt-dimensional
neuronal codes of the vector vt input to the PU. Therefore, the two
aforementioned difficulties with a single neuronal encoder with a
2dim vt-dimensional output vector are alleviated by the use of
multiple neuronal encoders in a PU. A third advantage of multiple
neuronal encoders in a PU is the enhancement of generalization
capability: If the neuronal code from one neuronal encoder fails to
retrieve any useful information from a general code covariance matrix and
a code deviation accumulation matrix, other neuronal codes from other
neuronal encoders may still retrieve enough information for detecting and
recognizing the vector vt input to the PU.

[0202] An example model spiking neuron with its masking matrix M,
synapses, and two neuronal encoders for unsupervised covariance learning
in a PU is shown in FIG. 15. The two neuronal encoders input subvectors,
v.sub.τ (1) and v.sub.τ (2), of the vector v.sub.τ input to
the PU.

[0203] We stress that in describing a PU, the symbols, vt, {hacek
over (v)}t, D, C, and M, denote the vector input to the PU, the
general neuronal code, the general code covariance matrix, the general
code deviation accumulation vector, and the general masking matrix
respectively, unless indicated otherwise.

5.5.3 Processing Neuronal Codes

[0204] By the information retrieval formula, (33) and (34), applied to
each of the Ψ neuronal encoder, its synapses, the masking matrices M
(ψ), and the model spiking and nonspikeing neurons; upon the arrival
of {hacek over (v)}.sub.τ the following products, d.sub.τ (ψ)
and c.sub.τ (ψ), ψ=1, . . . , Ψ, are obtained for
ψ=1, . . . , Ψ:

d.sub.τ(ψ)=D(ψ)M(ψ)({hacek over
(v)}.sub.τ(ψ)-{hacek over (v)}.sub.τ(ψ)) (44)

c.sub.τ(ψ)=C(ψ)M(ψ)({hacek over
(v)}.sub.τ(ψ)-{hacek over (v)}.sub.τ(ψ)) (45)

which are a vector with R components and a scalar, respectively. Redefine
the symbols d.sub.τ, c.sub.τ and a.sub.τ by

Let us reset the subscripts of the Σ.sub.ψ=1.sup.Ψ
2dim {hacek over (v)}.sup.τ.sup.(ψ) components {hacek over
(v)}.sub.τ in 39 from 1 to Σ.sub.ψ=1.sup.Ψ 2dim
{hacek over (v)}.sup.τ.sup.(ψ). We do the same to those of the
diagonal entries of the general masking matrix M in 43, those of the
general code deviation accumulation (row) vector C in 42, and those of
each row of the general code covariance matrix D in 41. Then the two
equations, (46) and (47), are respectively equivalent to

d.sub.τkj=DkjMjj({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj) (50)

c.sub.τj=CjMjj({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj) (51)

for k=1, . . . , R and j=1, . . . , Σ.sub.ψ=1.sup.Ψ
2dim {hacek over (v)}.sup.τ.sub.(ψ), where Mjj,
d.sub.τkj, Cj, and {hacek over (v)}.sub.τj are the jth
diagonal entry of the general masking matrix M, the (k, j)th entry of the
general code covariance matrix D, the jth entry of the general code
deviation accumulation vector C, and the jth entry of the general
neuronal code {hacek over (v)}.sub.τ at time or numbering τ,
respectively. Using the matrix notation [d.sub.τkj], which denotes a
matrix with the (k, j)th entry being d.sub.τkj and the vector
notation [c.sub.τj], which denotes a row vector with the jth entry
being c.sub.τj, the above equations (50) and (51) can be expressed as
follows:

[d.sub.τkj]=[MjjDkj({hacek over (v)}.sub.τj-{hacek
over (v)}.sub.τj)] (52)

[c.sub.τj]=[MjjCj({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj)] (53)

where Dkj ({hacek over (v)}.sub.τj-{hacek over (v)}.sub.τj)
and Cj ({hacek over (v)}.sub.τj-{hacek over (v)}.sub.τj) can
be looked upon as the outputs of the synapses with weights Dkj and
Cj respectively after performing the multiplications involved in
response to the input {hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj; and Mjj are called masking factors. The masking
factor Mjj is multiplied to the output Dkj ({hacek over
(v)}.sub.τj-{hacek over (v)}.sub.τj) of synapse j preceding
spiking neuron k and to the output Cj ({hacek over
(v)}.sub.τj-{hacek over (v)}.sub.τj) of synapse j preceding
nonspiking neuron.

[0205] Denoting (c.sub.τ (ψ)+d.sub.τj (ψ))/2 by
a.sub.τj (ψ), For each ψ=1, . . . , Ψ, the ratio
a.sub.τj (ψ)/c.sub.τ (ψ)=(d.sub.τj
(ψ)/c.sub.τ (ψ)+1)/2 is an estimate of the subjective
conditional probability that the jth component r.sub.τj (ψ) of
the label r.sub.τ (ψ) of vt (ψ) is equal to 1 given
vt (ψ). If vt (ψ), ψ=1, . . . , Ψ, share the
same label r.sub.τ, that is the label of vt (i.e., r.sub.τ
(ψ)=r.sub.τ for ψ=1, . . . , Ψ); then the best estimate
of the subjective conditional probability that the jth component
r.sub.τj of label r.sub.τ of vt is equal to 1 given vt
(ψ), ψ=1, . . . , Ψ, is

which is a representation of a subjective probability distribution
denoted by p.sub.τ. How pseudo-random binary numbers are generated in
accordance with p.sub.τ is described in Subsubsection 5.5.5.

[0206] Note that {hacek over (v)}.sub.τ, {hacek over (v)}.sub.τ,
D, C and M, in describing a PU (and hence in this subsubsection) are the
general neuronal code, the general code covariance matrix, the general
code deviation accumulation vector, and the general masking matrix
defined in (39), (40), (41), (42), and (43), respectively. Note also that
a.sub.τ, c.sub.τ and d.sub.τ in describing a PU (and hence in
this subsubsection) are computed using the general versions of {hacek
over (v)}.sub.τ, {hacek over (v)}.sub.τ, D, C and M, unless
indicated otherwise. In describing a PU, the symbol vt remains to
denote the vector input to the PU.

which is an estimate of the total number of times v.sub.τ or its
variants have been encoded and stored in C with effects of the forgetting
factor, normalizing constant, and input corruption, distortion and
occlusion included. c is a graded signal transmitted to the neighboring R
model spiking neurons in the same PU (processing unit).

which is an estimate of the total number of times v.sub.τ and its
variants have been encoded and stored in D with the kth component
r.sub.τk of r.sub.τ being +1 minus the total number of times
v.sub.τ and its variants have been encoded and stored in D with the
kth component r.sub.τj being -1. Included in this estimate are the
effects of the forgetting factor, normalizing constant, and input
corruption, distortion and occlusion.

[0210] Therefore, (c.sub.τ+d.sub.τk)/2 is an estimate of the total
number of times v.sub.τ and its variants have been encoded and stored
in C with the kth component r.sub.τk of r.sub.τ being 1.
Consequently, (d.sub.τk/c.sub.τ+1)/2 is an estimate p.sub.τk
of the subjective probability that r.sub.τk is equal to 1 given
vt. Model spiking neuron k then uses a pseudo-random number
generator to generate a spike (i.e., 1) with probability p.sub.τk or
no spike (i.e., 0) with probability 1-p.sub.τk. This 1 or 0 is the
output u.sub.τk=v {p.sub.τk} of model spiking neuron k at time or
numbering τ. u.sub.τk is thus a point estimate of the k-th
component r.sub.τk of the label r.sub.τ of v.sub.τ.

[0211] Note that the vector

p.sub.τ=[p.sub.τ1 p.sub.τ2 . . . p.sub.τR]'

is a representation of a subjective probability distribution of the label
r.sub.τ. Note also that the outputs of the R model spiking neurons in
response to v.sub.τ form a binary vector u.sub.τ=v
{p.sub.τk}, which is a point estimate of the label r.sub.τ of
v.sub.τ.

[0213] Note that the PU has Σ.sub.ψ=1.sup.Ψ 2dim
vt.sup.(ψ) synapses for each of the R spiking neurons and has
the same number of synapses for the nonspiking neuron. However, the R
spiking neurons and the nonspiking neuron may have different numbers of
synapses.

[0214] A flow chart of a PU without a learning mechanism is shown in FIG.
21. At time or numbering τ, the PU receives a (feature) vector
v.sub.τ. The neuronal encoders in the PU encode v.sub.τ into the
general neuronal code {hacek over (v)}.sub.τ by (39). The synapses
and the masking matrices compute and output c.sub.τj=CjMjj
({hacek over (v)}.sub.τj-{hacek over (v)}.sub.τj) and
d.sub.τkj=DkjMjj ({hacek over (v)}.sub.τj-{hacek over
(v)}.sub.τj) for all k and j, or equivalently the vector
[c.sub.τj] and matrix [d.sub.τkj], where C, D and M are those in
(42), (41) and (43), respectively. The model nonspiking neuron sums up
c.sub.τj over all j to form

[0215] If the general code covariance matrix D in the PU in FIG. 21 has
been learned in unsupervised learning by the unsupervised covariance rule
and held fixed, the PU is called an unsupervised PU (UPU). If D in the PU
in FIG. 21 has been learned in supervised learning by the supervised
covariance rule and held fixed, the PU is called a supervised PU (SPU).

5.5.6 Unsupervised and Supervised Learning

[0216] The general code deviation accumulation vector C, whose components
are stored in the artificial synapses with a feedforward connection to
the artificial nonspiking neuron, is adjusted by the following
unsupervised accumulation rule:

C  λ C + Λ 1 ( v τ -
v τ ) ' ( 56 ) ##EQU00039##

where the general neuronal code {hacek over (v)}.sub.τ and its
average over time {hacek over (v)}.sub.τ are defined in (39), and
(40). Note that (56) is equivalent to:

C ( ψ )  C ( ψ ) + Λ 2 ( v
τ ( ψ ) - v τ ( ψ ) ) '
##EQU00040##

for ψ=1, . . . , Ψ.

[0217] In unsupervised learning by the PU, the general code covariance
matrix D, whose entries are stored in the artificial synapses with a
feedforward connection to the R artificial spiking neurons, is adjusted
by the following unsupervised covariance rule:

DλD+Λ(v{p.sub.τ}-v{p.sub.τ})({hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ)' (57)

where the general neuronal code {hacek over (v)}.sub.τ and its
average over time {hacek over (v)}.sub.τ are defined in (39), and
(40), and v {p.sub.τ} is the output vector u.sub.τ of the PU.
Note that (57) is equivalent to:

D(ψ)λD(ψ)+Λ(u.sub.τ-u.sub.τ)({hacek
over (v)}.sub.τ(ψ)-{hacek over (v)}.sub.τ(ψ))'

for ψ=1, . . . , Ψ, where u.sub.τ=v {p.sub.τ}.

[0218] In supervised learning by the PU, the general code covariance
matrix D, whose entries are stored in the artificial synapses with a
feedforward connection to the R artificial spiking neurons, is adjusted
by the following supervised covariance rule:

DλD+Λ(r.sub.τ-1/2)({hacek over
(v)}.sub.τ-{hacek over (v)}.sub.τ)' (58)

where the general neuronal code {hacek over (v)}.sub.τ and its
average over time {hacek over (v)}.sub.τ are defined in (39), and
(40), and r.sub.τ is the label of v.sub.τ (or the measurements in
the receptive field of the PU) that is provided from outside of the PU.
The provided label r.sub.τ is an R-dimensional binary vector. Note
that (58) is equivalent to:

D(ψ)λD(ψ)+Λ(r.sub.τ-1/2)({hacek over
(v)}.sub.τ(ψ)-{hacek over (v)}.sub.τ(ψ))'

for ψ=1, . . . , Ψ.

[0219] The adjusted C and D are delayed for one unit of time or one
numbering before being stored in the model synapses.

[0220]FIG. 22 illustrates a PU with an unsupervised learning mechanism to
learn the general code covariance matrix D by the unsupervised covariance
rule and an unsupervised learning mechanism to learn the general code
deviation accumulation vector C by the unsupervised accumulation rule.
The adjusted C and D are delayed for one unit of time or one numbering
before being stored in the model synapses. The PU in FIG. 22 is called an
unsupervised PU (UPU).

[0221] FIG. 23 illustrates a PU with a supervised learning mechanism to
learn the general code covariance matrix D by the supervised covariance
rule and an unsupervised learning mechanism to learn the general code
deviation accumulation vector C by the unsupervised accumulation rule.
The adjusted C and D are delayed for one unit of time or one numbering
before being stored in the model synapse. The PU in FIG. 23 is called a
supervised PU (SPU).

[0222] A network of at least one PU is called a low-order model (LOM) of
biological neural networks. The vector vt input to a PU in a layer
contains not only feedforwarded components from outputs of model spiking
neurons in the lower layers but also feedbacked components from outputs
of model spiking neurons in the same or higher layers. Feedbacked
components are delayed for at least one unit of time to ensure stability.
Feedforwarded components may come from more than one layer preceding the
layer in which the PU belongs to.

5.5.7 Creating a Vocabulary by Unsupervised Covariance Learning

[0223] Pseudo-random binary digit generation performed by the R model
spiking neurons in a PU (processing unit) is indispensable in making the
unsupervised covariance rule work for the PU. Let us now see how a
"vocabulary" is created by unsupervised covariance rule for the PU: If a
feature subvector v.sub.τ or a slightly different version of it has
not been learned by the PU, and CM ({hacek over (v)}.sub.τ-{hacek
over (v)}.sub.τ)=0; then d.sub.τ/c.sub.τ is set equal to 0
and p.sub.τ=(1/2)I, where I=[1 1 . . . 1]'. The R model spiking
neurons use this subjective probability vector to generate a purely
random label r.sub.τ. Once this r.sub.τ and the output vector
{hacek over (v)}.sub.τ have been learned and stored in C and D, if
v.sub.τ is input to the PU for a second time, then
u.sub.τ=r.sub.τ with probability 1, and one more copy of the pair
(u.sub.τ, r.sub.τ) is included in C and D.

[0224] If an input vector v.sub.τ or a slightly different version of
it has been learned by a PU with different labels for different numbers
of times, then y.sub.τ=d.sub.τk/c.sub.τΩ0 and
p.sub.τ≠(1/2)I. Since v.sub.τ may contain different parts
from different causes and are assigned different labels in different
rounds of unsupervised learning, p.sub.τ may not be a binary vector.
For example, assume that two labels, r.sub.τ1 and
r.sub.τ2 of the same input vector v.sub.τ have been learned
with relative frequencies, 0.7 and 0.3, respectively. Then in response to
v.sub.τ, each component of u.sub.τ that is output from the PU is
equal to r.sub.τ1 with probability 0.7 and is equal to
r.sub.τ2 with probability with probability 0.3. Since these two
labels may have common components, the point estimate of the label
resembles r.sub.τ1 with a probability of greater than 70% and
resembles r.sub.τ2 with a probability of greater than 30%.

5.6 Controlling Cluster Sizes

[0225] The set of vectors that, as input vectors to a PU, will retrieve
(or generate) the same label from the PU is called a cluster. The length
of the time interval over which the time average ut is taken of the
PU's output vector ut=v {pt} can be selected to control the
cluster sizes.

EXAMPLE 4

[0226] The Gray codes of two adjacent integers differ by one component.
For example, the Gray codes of the integers, 0 to 15, are, respectively,
0000, 0001, 0011, 0010, 0110, 0111, 0101, 0100, 1100, 1101, 1111, 1110,
1010, 1011, 1001, 1000. Let vt, t=1, 2, . . . , 16, be these 16
codes in the same order. For example, v1=[0 0 0 0]', v2=[0 0 0
1]' and so on. Repeating the 16 codes, an infinite sequence of vectors
xt, t=1, 2, . . . , is available for learning. Let us use a PU with
10 model spiking neurons and 1 model nonspiking neuron to learn vt,
t=1, 2, . . . , 16, in the given order without a supervisor. Assume that
Ψ=J=1 in (39) and (32) and that λ=Λ=1 in (12) and (21).

[0227] Because J=1 in M, an unsupervised learning rule generalizes, if
necessary, on three binary digits of each vt. Since two consecutive
codes differ by only one bit, the unsupervised correlation rule in THPAM
(James Ting-Ho Lo, Functional Model of Biological Neural Networks,
Cognitive Neurodynamics, Vol. 4, Issue 4, pp. 295-313, November 2010)
learns all the 16 codes (i.e., vt) with one single label and thus
puts them in one single cluster. Note that components 0 in vt are
replaced with -1 before the unsupervised correlation rule in THPAM can be
applied.

[0228] Let us now use the unsupervised covariance rule to learn the same
sequence of codes vt. To initiate the learning, we set v
{pt}=I/2 for t=1, 0, -1, -2, . . . , -∞ and set C=0 and D=0 at
t=0. For simplification, we assume {hacek over (v)}t=I/2 for all t.
Note that the time averages v {pt} and {hacek over (v)}t may be
taken over different time lengths.

EXAMPLE 4a

[0229] Assume that the time average v {p} is taken over 100,000 time
units, namely v {pt}=Σ.sub.τ=-100,000+t+1v
{p.sub.τ}/100,000. Under this assumption, v {pt} is virtually
1/2 for t=1, 2, . . . , 16. For notational simplicity, we take v
{pt} to be I/2 for t=1, 2, . . . , 16 in the following.

Retrieval at t=1: c1=0, d1=0, p1=I/2 and v {p1} is
purely random (i.e., v {p1i}=1 with probability 1/2 for i=1, . . . ,
10). Learning at t=1: C=1/2 ({hacek over (v)}1-I/2)' and D=(v
{p1}-I/2) ({hacek over (v)}1-I/2)'. Note that v {p1} is a
binary vector. Retrieval at t=2: c2=1, d2=2(v {p1}-I/2),
p2=v {p1}, and v {p2}=v {p1}, because v {p1} is
a binary vector. v2 is assigned the same label v {p1} as
v1, and v1 and v2 are put in the same cluster. Learning at
t=2: C=1/2 Σt=12 ({hacek over (v)}t-I/2)' and
D=Σt=12 (v {pt}-I/2) ({hacek over (v)}t-I/2)'.

[0230] Retrieval at t=3: c3=1, d3=2(v {p2}-I/2) p3=v
{p2}, v {p3}=v {p2}. v3 is assigned the same label v
{p1} as v2, and v1, v2 and v3 are put in the
same cluster.

Learning at t=3: C=1/2 Σt=13 ({hacek over
(v)}t-I/2)' and D=Σt=13 (v {pt}-I/2)({hacek
over (v)}t-I/2)' Retrieval at t=4: c4=1, d4=2(v
{p3}-I/2),p4=v {p3}, and v {p4}=v {p3}. v4
is assigned the same label v {p3} as v3, and v1, . . . ,
v4 are put in the same cluster.

[0231] Continuing in this manner, all the 16 codes vt are assigned
the same label v {p1} and thus put in the same cluster.

EXAMPLE 4b

[0232] Assume that the time average v {pt} is taken over 1 time unit,
namely v {pt}=v {pt}.

[0233] Continuing in this manner, the 16 codes vt are each assigned a
purely random label. Since there are 10 model spiking neurons and thus
each label has 10 entries, the chance is that the 16 codes are assigned
with different labels. Note that D remains to be 0 and thus no knowledge
is learned.

EXAMPLE 4c

[0234] Assume that the time average v {pt} is taken over n time
units, namely v {pt}=Σi=0n-1 v {pt-i}/n.

[0236] Note that limn→∞ pt=v {pt-1}. This is
consistent with Example 2a. Two observations can be made from these
formulas:

[0237] 1. Given a fixed t: the greater n is, the closer
pt is to v {pt-1}, the closer v {pt} is from v
{pt-1}, the more likely vt is assigned the same label v
{pt-1} as vt-1, or the more likely vt and vt-1 are
put in the same cluster.

[0238] 2. Given a fixed n: for t≦n, the
deviation of (Σ.sub.Σ=t-11 v {p.sub.τ}+(n-t+1)I/2)/n
from 1/2I increases as t increases, and hence the deviation of pt
from v {pt-1} increases. Consequently, the chance that vt is
assigned v {pt-1} as than vt-1 decreases as t increases, and
the chance that vt and vt-1 are put in the same cluster
decreases.

[0242] The two observations are valid here. However,
pt=(Σ.sub.τ=2t-1 p.sub.τ+v
{pt-1}-(Σ.sub.τ=t-11 v
{p.sub.τ}+(n-t+1)I/2)/n+1/2I)/(t-1) shows that the change from
pt-1 to pt for learning identical codes is much smaller than
that for learning the sequence of Gray codes in EXAMPLE 4c. This means
that for some n, the identical codes can be put in the same cluster, and
yet the 16 Gray codes are put in more than one cluster. In fact, it is
acceptable by the LOM that the identical codes are assigned different
labels as long as the labels differ by a couple of bits, because these
labels are inputs to other PUs, which have generalization capability.

[0243] The Gray codes in the given learning order are an extreme example
that does not exist in the real-world. The analysis in the above examples
show that sizes of the clusters of the Gray codes in the given learning
order can be controlled by selecting n appropriately. In the real world,
repeated learning is a frequently encountered situation. Therefore, we
should select n large enough to guarantee that the chance for putting the
same input vector in repeated learning into two different clusters is
negligible. Only under the condition this requirement is fulfilled, we
select n as small as possible to prevent a cluster to overgrow. Of
course, the maximum size of a cluster depends on the application of the
LOM. (In fact, having more clusters does little harm to the processing of
the LOM. It is like having multiple words for similar things.) The
formulas for determining pt in the above examples can be generalized
easily for any dimensionality of the input vector and any number of
presentations for similar input vectors (i.e., those with a small number
of different bits) or identical input vectors. Generalizations of the
formulas in Example 5 are especially useful for determining the smallest
n required for not assigning different labels to identical input vectors.

[0244] In supervised learning by an SPU (supervised processing unit), the
label is not generated by the SPU, but is provided from outside the SPU
(or the LOM). If different labels are provided for input vectors with
small differences, the SPU learns the different labels and can
distinguish said input vectors with small differences thereafter. In
unsupervised learning by an UPU, if the UPU is required to be able to
distinguish learned input vectors with small differences and to remain to
be able to recognize unlearned input vectors that have larger differences
with (or are more different from) learned input vectors, other masking
matrices that are more strict (or less tolerant) than the masking
matrices are used. These other masking matrices have a smaller J in (32)
than the masking matrices do and are called the learning masking
matrices. The learning masking matrices, denoted by M# (ψ),
ψ=1, . . . , Ψ, form a general masking matrix called the general
learning masking matrix

M#=diag[M#(1)M#(2) . . . M#(Ψ)] (59)

It is a diagonal matrix. The diagonal entries are numbered consecutively
from 1 to dim {hacek over (v)}.sub.τ=Σ.sub.ψ=1.sup.Ψ
2dim v.sup.τ.sup.(ψ), where {hacek over (v)}.sub.τ is
the general neuronal code, whose entries are also numbered consecutively
from 1 to dim {hacek over (v)}.sub.τ. The diagonal entries
Mjj#, j=1, . . . , dim {hacek over (v)}.sub.τ, of M#
are called learning masking factors.

[0245] An UPU (unsupervised processing unit) with learning masking
matrices M# (ψ), ψ=1, . . . , Ψ, still generates the
binary vector v {p.sub.τ} that is output from the UPU in response to
the vector v.sub.τ input to the UPU using its masking matrices M
(ψ), ψ=1, . . . , Ψ, as before. However, when the UPU learns,
the learning masking matrices M# (ψ), ψ=1, . . . , Ψ,
are used instead to generate an estimate v {p.sub.τ#} of the
label of vt, and the general code covariance matrix D is adjusted by

v {p.sub.τk#} is a pseudorandom number generated in accordance
with the subjective probability distribution p.sub.τk#, and then
v {p.sub.τ#}=[v {p.sub.τ1#} . . . v
{p.sub.τR#}]', where R is the number of model spiking neurons in
the UPU. A typical UPU with learning masking factors Mjj#, j=1,
. . . , dim {hacek over (v)}.sub.τ is illustrated in FIG. 24.

5.8 Spike Trains for Each Exogenous Feature Vector

[0247] Recall that a binary vector ut output from a PU (processing
unit), is obtained by a pseudo-random number generator using the
subjective probability distribution pt of the label rt of a
vector vt input to the PU. Components of such binary vectors ut
with uncertainty form vectors input to other or the same PUs through
feedforward or feedback connections. Upon receiving a vector with
uncertainty, a PU uses masking matrices to suppress or "filter out" some
components so that the remaining components are consistent with those
stored in the code covariance matrices. (Masking matrices are described
in Section 5.4.)

[0248] However, there is a chance for the pseudo-random number generator
to generate a binary vector ut that is such an outlier for the
subjective probability distribution pt that causes undesirable
effects on learning and retrieving of PUs receiving components of ut
in spite of masking matrices. To minimize such undesirable effects and to
represent the subjective probabilities involved in the PUs, the LOM
usually completes a certain number of rounds of retrieving and learning
for each exogenous vector vtex input to the LOM so that many
pseudorandom versions of ut are generated and learned by each PU for
the same vtex.

[0249] To have ζ rounds of retrieving and learning for each exogenous
vector vtex, the exogenous vector must be held constant for
ζ units of time. In other words, the exogenous vector vtex
is presented to the LOM with a different time scale. More specifically,
vtex changes at t=iζ+1, i=0, 1, 2, . . . . Consequently, a
PU generates a sequence of binary vectors denoted by ut,
t=iζ+j, j=1, 2, . . . , ζ, for each exogenous feature vector
vtex, which remains constant for t=iζ+j, j=1, 2, . . . ,
ζ. More specifically, once a new exogenous vector
viζ+1ex is presented,
viζ+jex=viζ+1ex for j=2, . . . , ζ.
The sequence ut, t=iζ+j, j=1, 2, . . . , ζ, output from
the PU consists of R spike trains, each having ζ spikes during the
period of time.

[0250] Alternatively, if each exogenous vector vtex is held
constant for 1 unit of time, then the PU generates a sequence of ζ
binary vectors within the 1 unit of time; namely the PU generates one
binary vector every 1/ζ unit of time.

5.9 Multilayer Networks of Processing Units with Feedbacks

[0251] The low-order model (LOM) of biological neural networks described
in the present invention disclosure is a network of processing units
(PUs) with or without time-delayed feedbacks. An LOM that is a multilayer
network of PUs is described in this Subsection.

[0252] An external vector input to the LOM is called an exogenous feature
vector, and a vector input to a layer of PUs is called a feature vector.
A feature vector input to a layer usually contains not only feedforwarded
outputs from the PUs in preceding layers but also feedbacked outputs from
the PUs in the same or higher layers with time delays. A feature vector
may contain components from an exogenous feature vector. For simplicity,
we assume that the exogenous feature vector is only input to layer 1 and
is thus a subvector of a feature vector input to layer 1. All these
feature vectors and output vectors over time usually form spike trains.

[0253] A subvector of a feature vector that is input to a PU is called a
feature subvector. Trace the feedforward connections backward from
neurons of a PU to a subvector of a feature vector input to a layer or
the exogenous feature vector. This subvector is called the receptive
field of the PU in the feature vector input to the layer or the exogenous
feature vector. All the measurements that affect the receptive field in
the exogenous feature vector are also called the receptive field of the
PU in the measurements. For example, let the measurements taken by the
pixels in a digital camera be input to a model of a retina, and let the
output vector from the model of the retina be used as the exogenous
feature vector of the LOM. Then the measurements or the pixels with those
measurements that affect the receptive field of a PU in the exogenous
feature vector are called the receptive field of the PU in the
measurements.

[0254] The collection of neurons in layer l-i, i=1, 2, . . . , that have a
direct feedforward connection (without going through another neuron) to a
neuron in a PU in layer l and the unit-time delay devices that hold a
feedback that is directly input (without going through another unit-time
delay device) to the same PU are called the immediate receptive field of
the PU.

[0255] The feature vector input to layer l at time or numbering τ is
denoted by v.sub.τl-1, and the output from layer l at τ is
denoted by v {p.sub.τl}. The feature vector v.sub.τl-1
consists of components of the feedforwarded vector v
{p.sub.τl-1} and feedbacked vector v {p.sub.τ-z(k)l+k}
feedbacked from the same layer l and higher layers l+k and after z (k)
time units of delay for k=0, 1, . . . , where z (k) is a function of k.

[0256]FIG. 25 shows layer l and layer l+2 of PUs of the LOM. In FIG. 25,
same-layer feedback connections with one unit-time delay device from
layer l to itself and 2-layer feedback connections with 5 unit-time delay
devices from layer l+2 to layer l are shown. The box under layer l of PUs
does not model a biological entity, but illustrates that the feature
vector input to layer l comprises feedforward vector v
{p.sub.τl-1}, the same layer feedback v {p.sub.τ-1l},
and the 2-layer feedback v {p.sub.τ-5l+2}. Note that the
symbols, v {p.sub.τl-1}, v {p.sub.τ-1l} and v
{p.sub.τ-5l+2}, include all the components output from layer
l-1, layer l+2, and layer l for simplicity in showing the feedback
structure. For an application, we do not have to include all of them, but
may select some of these components as inputs to layer l.

[0257] Once an exogenous feature vector is received by the LOM, the PUs
perform functions of retrieving and/or learning from layer to layer
starting with layer 1, the lowest layer. After the PUs in the highest
layer, layer L, complete performing their functions, the LOM is said to
have completed one round of retrievings and/or learnings. Each exogenous
feature vector is held constant for a certain number ζ of time
units, during which the LOM completes ζ of retrievings and/or
learnings.

[0258] We note that retrieving and learning by a PU are performed locally,
meaning that only the feature subvector input to the PU and its label are
involved in the processing by the PU. Causes in patterns, temporal or
spatial, usually form a hierarchy. Examples: (a). Phonemes, words,
phrases, sentences, and paragraphs in speech. (b). Musical notes,
intervals, melodic phrases, and songs in music. (c). Bananas, apples,
peaches, salt shaker, pepper shaker, fruit basket, condiment tray, table,
refrigerator, water sink, and kitchen in a house. Note that although
Example (c) is a spatial hierarchy, when one looks around in the kitchen,
the images scanned and received by the person's retina form a temporal
hierarchy.

[0259] The higher a layer in the LOM is, the higher in the hierarchy the
causes the PUs in the layer handle, and the more time it takes for the
causes to form and be detected and recognized by the PUs. Therefore, the
number of unit-time delay devices on a feedback connection is a monotone
increasing function z (k) of k, which are defined above. This requirement
is consistent with the workings in a biological neural network in the
cortex. Note that it takes time (i) for biological PUs to process feature
subvector, (ii) for spikes to travel along feedforward neural fibers from
a layer to the next layer, and (iii) for spikes to travel along feedback
neural fibers from a layer to the same or a lower-numbered layer. Note
also the subscripts of the input vector v.sub.τl-1 and output
vector v {p.sub.τl} of all layers l are the same, indicating the
same exogenous feature vector v.sub.τex is processed or
propagated in all layers. The common subscript τ does not represent
the time that the signals in the biological network reach or processed by
its layers. However, a feedback v.sub.τ-z(k)l+k from layer l+k
to layer l for inclusion in v.sub.τl-1 must have a delay z (k)
that reflects the sum of the times taken for (i), (ii) and (iii) from the
input terminals of layer l back to the same input terminals.

[0260] For notational simplicity, the superscript l-1 in vtl-1
and dependencies on l-1 or l in other symbols are sometimes suppressed in
the following when no confusion is expected.

of vtl is called a feature subvector of vtl. nl
is called a feature subvector index (FSI), and vtl (n) is said
to be a feature subvector on the FSI nl or have the FSI nl.
Each UPU is associated with a fixed FSI nl and denoted by UPU
(nl). Using these notations, the sequence of subvectors of
vtl-1, t=1, 2, . . . , that is input to UPU(nl) is
vtl-1 (nl), t=1, 2, . . . . The FSI nl of a UPU
usually has subvectors, nl (ψ), ψ=1, . . . , Ψ, on which
subvectors vtl-1 (nl (ψ)) of vtl-1 (nl)
are separately processed by the neuronal encoders in UPU(nl) at
first. The subvectors, nl (ψ), ψ=1, . . . , Ψ, are not
necessarily disjoint, but are all inclusive in the sense that every
component of nl is included in at least one of the subvectors
nl (ψ). Moreover, the components of nl (ψ) are usually
randomly selected from those of nl.

[0263] The components of a feature vector v.sub.τl-1 input to
layer l at time (or numbering) τ comprise components of binary
vectors generated by UPUs in layer l-1 and those generated at previous
times by UPUs in the same layer l or UPUs in higher layers with layer
numberings l+k for some positive integers k. The time delays may be of
different durations.

[0264] For illustration, an example is given in the following:

EXAMPLE 4

[0265] Let us set the number z (k) of unit-time delay devices equal to
4(k+1) for k=0, . . . , 7; and set the number ζ time units that each
exogenous feature vector is held constant equal to 16.

[0266] For k=1 and z (1)=8, the first 8 feedbacks used by layer l in
processing an exogenous feature vector vtex are output from
layer l+1 in response to vt-1ex, which provides temporally and
spatially associated information from the preceding exogenous feature
vector vt-1ex.

[0267] For k=5 and z (5)=24, the first 8 feedbacks used by layer l in
processing an exogenous feature vector vtex are output from
layer l+5 in response to vt-2ex; and the next 8 feedbacks are
output from layer l+5 in response to vt-1ex, which provides
temporally and spatially associated information from the preceding
exogenous feature vectors, vt-2ex and vt-1ex.

[0268] For k=8 and z (8)=36, the first 4 feedbacks used by layer l in
processing an exogenous feature vector vtex are output from
layer l+8 in response to vt-3ex; and the next 12 feedbacks are
output from layer l+8 in response to vt-2ex, which provides
temporally and spatially associated information from the preceding
exogenous feature vectors, vt-3ex and vt-2ex.

[0269] Note that the greater k, the larger the number of unit-delay
devices, and the further back the feedbacked information is in processing
the current exogenous feature vector yr. Note also that the further back
the feedbacked information, the less spatially but more temporally
associative information is used in processing vtex. Moreover,
given the same numbers of unit-delay devices on each feedback connection,
if an exogenous feature vector is presented to the LOM for a larger
number of time units, then more recent information and less further back
information is used in processing vtex. This means more
spatially associated information but less temporally associated
information is brought back by the feedback connections and utilized by
the LOM.

[0270] An example LOM with three layers of PUs and feedbacks is shown in
its entirety in FIG. 26. There are three types of feedback connection:
same-layer feedbacks, one-layer feedbacks and two-layer feedbacks. The
numbers of unit-time delay devices on the feedback connections are not
specified for simplicity. The second delay box on a feedback connection
represents an additional delay.

5.10 The Clusterer and Interpreter

[0271] An architecture of the LOM that makes for an effective artificial
neural network comprises a cluster and an interpreter, which are
described in the present Subsection. The clusterer in the LOM is a
network of unsupervised processing units (UPUs) with or without
time-delayed feedbacks. If the cluster has no feedback connection, it is
an unsupervised ANN (artificial neural network) that clusters spatial
data. If the clusterer has feedback connections, it is an unsupervised
ANN that clusters spatial and/or temporal data.

[0272] The vectors output from an UPU in the clusterer in the LOM are
point estimates v {p.sub.τ} of the labels of clusters of vectors
v.sub.τ input to the UPU. These labels form a vocabulary for the UPU.
To interpret such a label or its point estimate v {p.sub.τ} generated
by UPU into a word, a few words, a sentence or a few sentences using the
language of the human user, say English, an SPU is used.

[0273] Once an exogenous feature vector is received by a clusterer in the
LOM, the UPUs perform functions of retrieving and/or learning from layer
to layer starting with layer 1 (i.e., the lowest layer). After the UPUs
in the highest layer, layer L, complete performing their functions, the
clusterer in the LOM is said to have completed one round of retrievings
and/or learnings (or memory adjustments). For each exogenous feature
vector, the clusterer in the LOM will continue to complete a certain
number of rounds of retrievings and/or learnings.

[0275] FIG. 28 shows an LOM with a clusterer and an interpreter. The
clusterer in the LOM in FIG. 27 is again shown in FIG. 28. The
connections and delay devices in the clusterer are not shown for clarity
in FIG. 28. The three UPUs in the lowest layer are not connected to a
SPU, but each of the three PUs in the second and third layers is.
UPU(12) and UPU(22) in the second layer have feedforward
connections to SPU(13) and SPU(23) respectively, and
UPU(13) in the third layer has feedforward connections to
SPU(14).

[0276] The labels, r.sub.τ (13), r.sub.τ (23) and
r.sub.τ (14), which are used for supervised learning by the
synapses in SPU(13), SPU(23) and SPU(14) respectively are
provided by the human trainer of the artificial neural network. For
spatial pattern recognition, r.sub.τ (nl) is obtained by tracing
the feedforward connections from SPU(nl) all the way down to input
terminals that receive the exogenous feature vector and if necessary
further down to the sensor elements such as the pixels in the CCD of a
camera. These input terminals or sensory elements are called the
receptive field of SPU(nl) and UPU(nl-1) in the measurements.
If the human trainer sees a distinct cause such as an apple or John Doe's
face in the receptive field, he/she assigns the bipolar binary code of a
word or a phrase such as "apple" or "John Does's face" to the bipolar
binary label r.sub.τ (nl). Since supervised covariance learning
is very easy and the output vector of SPU(nl), which has the same
number of bits as r.sub.τ (nl), will not be used as inputs to
another PU, the total number of bits in r.sub.τ (nl) can be made
as large as needed to hold the longest code for the receptive field.
Shorter codes can be made longer by including zeros at the beginning or
ending of the shorter codes.

[0277] The receptive field of an SPU branching out from a higher layer is
larger than that of an SPU branching out from a lower layer. The cause in
a larger receptive field usually requires a longer bipolar binary code to
represent. For example, there are only 26 letters, more than 10,000
commonly used words, and millions of commonly used sentences; which need
codes of 5 bits, 14 bits, and 30 bits, respectively. To avoid using a
look-up table to translate codes into English word or sentences, we can
simply type the letters, the words or the sentences to assign their ASCII
codes to r.sub.τ (nl).

[0278] To avoid an SPU using different codes for the same pattern or
cause, we may use SPU(nl) to retrieve its v {p.sub.τl
(nl)}. If it is not recognizable by the human trainer, if it does
not agree with the human trainer, or if the subjective probability
distribution p.sub.τl (nl) does not contain enough
information; we assign a new r.sub.τ (nl) to SPU(nl). A
measure of information contained in p.sub.τl (nl) is

If p.sub.τkl (nl)=0 or 1 for k=1, 2, . . . , R, then ξ
(p.sub.τl (nl))=1 and ξ (p.sub.τl (nl)) is
maximized, meaning there is no uncertainty in p.sub.τ (nl). If
p.sub.τk (nl)=1/2 for k=1, 2, . . . , R, then ξ (p.sub.τ
(nl))=0 and ξ (p.sub.τ (nl)) is minimized, meaning there
is no information in p.sub.τ (nl).

[0279] Recall that the clusterer of the LOM learns independently of the
learning (or even existence) of SPUs in the LOM's interpreter. Whenever
new information is acquired for creating a new handcrafted label for the
output of an UPU with or without an SPU, a new SPU can be added and learn
the new label. Therefore, there can be multiple SPUs for an UPU,
providing multiple interpretations of the output from the UPUs. For
example, if the head of a man shows up in the receptive field of an UPU,
there may be 2 SPUs that output labels, "head of a man with long hair"
and "head of Albert Einstein" in response to the vector output from the
UPU. If it is later found that the photograph containing the head of
Einstein was taken in 1945, a third SPU can be added that outputs the
handcrafted label "An image of Albert Einstein in 1945" in response to
the same vector output from the UPU.

5.11 Preprocessing Exogenous Feature Vectors for the LOM

[0280] If feature vectors αtex, t=1, 2, . . . , to be
processed by an artificial neural network have components being real
numbers, a and b, or near a and b, where a and b are not unipolar binary
digits (i.e., 0 and 1), a preprocessor is used to convert the feature
vectors into unipolar binary digits or values near them before processing
by the LOM.

[0281] Given a variable α whose value is either a real number a or a
real number b, it can be transformed into a variable v whose value is
either 0 or 1, respectively, by the following function:

v = f ( α ) = α - a b - a ##EQU00045##

If α is near a or near b, then v is near f (a)=0 or near f (b)=1,
respectively. Given a vector α=[α1 α2 . . .
αm]' whose components are real numbers, a and b, the vector
α can be transformed into a binary vector v by the function

v=f(α)=[f(α1)f(α2) . . . f(αm)]'

If the component of α is near a or near b, then the corresponding
components of v are near f (a) or near f (b), respectively. Notice that
the symbol f denotes both a real-valued function of a real-valued
variable α and a vector-valued function of a vector variable
α with real components.

[0282] If the exogenous feature vectors αtex, t=1, 2, . .
. , to be processed by an artificial neural network have components being
real numbers, a and b, or near a and near b, the vectors
αtex, t=1, 2, . . . , can be converted into binary or
nearly binary vectors vtex, t=1, 2, . . . , by vtex=f
(αtex), t=1, 2, . . . . The low-order model (LOM)
described hereinabove can then be applied.

5.12 The CIPAM--an ANN Mathematical Equivalent to the LOM

[0283] The LOM is a model of biological neural networks, wherein most
biological neurons communicate with spike trains. Spike trains are
usually modeled as sequences of 1's and 0's. The feature vectors,
neuronal codes, neuron inputs, and neuron outputs in the LOM are all
(unipolar) binary vectors, whose components are 1's and 0's. On the other
hand, the corresponding quantities in the functional model of biological
neural networks, the THPAM, are bipolar binary vectors, whose components
are -1's and 1's. The development of the LOM motivated a re-examination
of the THPAM to improve the THPAM and eliminate its shortcomings.

[0284] This re-examination resulted in still another functional model of
biological neural networks, namely the Clustering Interpreting
Probabilistic Associative Memory (CIPAM), reported in James Ting-Ho Lo, A
Cortex-Like Learning Machine for Temporal Hierarchical Pattern
Clustering, Detection, and Recognition, Neurocomputing, Vol. 78, pp.
89-103, 2012; also available online with DOI:
10.1016/j.neucom.2011.04.046, which is incorporated into the present
invention disclosure by reference. As in the THPAM, the feature vectors,
neuronal codes, neuron inputs, and neuron outputs in the CIPAM are all
bipolar binary vectors, whose components are -1's and 1's. Among
differences between the CIPAM and the THPAM, a main difference is that
the former uses the unsupervised and supervised covariance learning rule
and the latter uses the unsupervised and supervised correlation learning
rule. This main difference necessitates some other differences including
that in their retrieving methods; namely the former uses decovariance
rule and the latter decorrelation rule. For applications, the LOM, THPAM
and CIPAM are all artificial neural networks (ANNs).

[0285] The high degree of similarity between the LOM and the CIPAM
motivated a study of their relation. This study led to the discovery that
LOM and the CIPAM together with their corresponding components can be
mathematically transformed into each other by the affine transformation
and its inverse:

v=f(x)=1/2(1-x) (65)

x=f-1(v)=1-2v (66)

where f-1 denotes the inverse function of f. Notice that f (1)=0 and
f (-1)=1, and that f-1 (0)=1 and f-1 (1)=1. Note f (0)=1/2.
Using the CIPAM to process vectors with components from {-1, 0, 1} is
equivalent to using the LOM to process vectors with components from {1,
1/2, 0}. However, for simplicity, processing vectors whose components are
from these ternary sets are not described in detail in the present
invention disclosure. Those skilled in the art are not expected to have
difficulty with extending the invention for processing such vectors.

[0286] A bipolar binary digit x, whose value is 1 or -1 and a unipolar
binary digit v, whose value is 0 or 1, can be transformed into each
other, respectively by (65) and (66). Each component of the LOM and its
corresponding component of the CIPAM can be transformed into each other
by (65) and (66). From this viewpoint, the CIPAM and the LOM are
mathematically equivalent artificial neural networks. The CIPAM has
advantages similar to those of the LOM over the THPAM. As an artificial
neural network, the CIPAM is computationally less expensive than the LOM.

[0287] In the operation of the CIPAM, the exogenous feature vectors and
the feature vectors input to each layer of its neurons or PUs (processing
units) are bipolar binary vectors whose components assume values from the
bipolar binary set {-1, 1}. As an equivalent of the LOM, the CIPAM does
not have the shortcomings of the PAM (i.e., THPAM) mentioned in
Subsection 2 (e.g., the inability of the PAM's unsupervised correlation
learning mechanism to prevent clusters from overgrowing under certain
circumstances). On the other hand, because the CIPAM processes bipolar
binary digits, {-1, 1}, it is computationally less expansive than the
LOM, which processes (unipolar) binary digits, {0, 1}. The CIPAM is
briefly described through transforming the LOM into the CIPAM below. A
detailed description is given in the paper by James Ting-Ho Lo, A
Cortex-Like Learning Machine for Temporal Hierarchical Pattern
Clustering, Detection, and Recognition, Neurocomputing, DOI:
10.1016/j.neucom.2011.04.046, which is incorporated herein by reference.

[0288] By (65) and (66), given a bipolar binary variable x, whose value is
1 or -1, we can transform it into a (unipolar) binary variable v by v=f
(x)=1/2 (1-x) and the other way around by x=f-1 (v)=1-2v. Let v=f
(x) and u=f (y). Recall that the neuronal node performs φ (v,
u)=-2vu+v+u. By simple substitution,

φ(f(x),f(y))=1/2(1-xy)

f-1(φ(f(x),f(y)))=xy

Therefore, this function xy is a representation of the neuronal node
operation, where the inputs of the neuronal node are represented by x and
y and its output is represented by xy. Note that if x and y are bipolar
binary variables, xy is the NXOR (not-exclusive-or) function.

[0289] Using this representation of the neuronal node operation, the
equations (2)-(5) that generate the neuronal code (or encoding) {hacek
over (v)} of an m-dimensional vector v are respectively replaced by the
following equations for generating the representation {hacek over (x)} of
{hacek over (v)}:

This representation of the neuronal code is actually the orthogonal
expansion of x discussed in the paper by James Ting-Ho Lo, Functional
Model of Biological Neural Networks, Cognitive Neurodynamics, Vol. 4, No.
4, pp. 295-313, November 2010 and the patent application by James Ting-Ho
Lo, A Cortex-Like Learning Machine for Temporal and Hierarchical Pattern
Recognition, U.S. patent application Ser. No. 12/471,341, filed May 22,
2009; Publication No. US-2009-0290800-A1, Publication Date: Nov. 26,
2009. In the same paper and same patent application, it is proven that
given two m-dimensional vectors, x and y, their orthogonal expansions,
{hacek over (x)} and {hacek over (y)}, satisfy

The relation between the above formula 71 and the formula (6) can be seen
by observing that for m-dimensional vectors, v, u, x, y, and their
neuronal codes, {hacek over (v)}, {hacek over (u)}, and orthogonal
expansions, {hacek over (x)}, {hacek over (y)}, respectively; we have
{hacek over (v)}-1/2I=-1/2{hacek over (x)} and {hacek over
(u)}-1/2I=-1/2{hacek over (y)}, and

where I=[1 1 . . . 1]', which we note is not the identity matrix I. From
the formula 71, it follows that if x and y are bipolar binary variables,

x ' y = 0 , if x ≠ y =
2 m , if x = y ( 73 ) ##EQU00050##

[0291] Based on the transformations, (65) and (66), and the above
corresponding properties between the variables x and y and the variables
v and u, the LOM (low-order model of biological neural networks)
described in preceding Subsections can be transformed into the CIPAM,
which is an equivalent of LOM. In this equivalent of LOM, components of
feature vectors, "spikes and nonspikes", and components of labels are
bipolar binary digits or in their vicinities. The variables whose values
are such digits and numbers will be denoted by xtj, yti and
zti instead of vtj, uti and wti in the following
lists of formulas for the equivalent, CIPAM. Here xtj=f-1
(vtj), yti=f-1 (uti), and zti=f-1
(wti).

[0292] For unsupervised covariance learning by the CIPAM, the formulas,
(11), (12) and (13), are respectively transformed into

where xt is a vector input to a neuronal encoder; each of the
2m outputs, {hacek over (x)}t1, {hacek over (x)}t2, . . .
, {hacek over (x)}t2m, from the neuronal encoder at time (or
numbering) t, pass through a synapse to reach each of a number, say R,
postsynaptic model spiking neurons and a postsynaptic model nonspiking
neuron; and the output of model spiking neuron i is denoted by yti.
yt is the vector of R outputs yti of the R model spiking
neurons. Furthermore, the formulas, (14), (15) and (16), are respectively
transformed into

where zt is a label of the pattern (or cause) inside the receptive
field of the R model spiking neurons, and is provided from outside the
CIPAM. Here zs, s=1, 2, . . . t, are bipolar binary vectors or
nearly bipolar binary vectors.

[0294] For unsupervised accumulation learning by the CIPAM, (21) and (22)
are respectively transformed into

[0298] The formula, (43), for the general masking matrix M for a PU in the
CIPAM remains the same. The general expansion covariance matrix D learned
by the unsupervised covariance rule by an UPU in the CIPAM is defined by
(41), where the code covariance matrix D (O) is defined by (76) or (79),
depending on whether ys-ys or ys is used. The general
expansion covariance matrix D learned by the supervised covariance rule
by an UPU in the CIPAM is defined by (41), where the code covariance
matrix D (ψ) is defined by (82). The general expansion covariance
matrix C for the CIPAM is defined by (42), where the code covariance
matrix C (ψ) is defined by (84).

[0299] For retrieving information from synapses by the CIPAM, the
formulas, (44) and (45), are transformed into

d.sub.τ(ψ)=D(ψ)M(ψ)({hacek over
(x)}.sub.τ(ψ)-{hacek over (x)}.sub.τ(ψ)) (91)

c.sub.τ(ψ)=C(ψ)M(ψ)({hacek over
(x)}.sub.τ(ψ)-{hacek over (x)}.sub.τ(ψ)) (92)

for ψ=1, . . . , Ψ.

[0300] The computations performed by a nonspiking neuron and by spiking
neuron k in a PU in the CIPAM are respectively the addition of components
of c.sub.τ (ψ), ψ=1, . . . , Ψ, and the addition of kth
rows of d.sub.τ (ψ), ψ=1, . . . , Ψ to get c.sub.τ
and d.sub.τk. Expressions of the resulting sums, c.sub.τ and
d.sub.τk, can be expressed by transforming (54) and (55) respectively
into

c.sub.τ=CM({hacek over (x)}.sub.τ-{hacek over (x)}.sub.τ)
(93)

d.sub.τk=DkM({hacek over (x)}.sub.τ-{hacek over
(x)}.sub.τ) (94)

(c.sub.τ+d.sub.τk)/2 is an estimate of 2m times the total
number of times v.sub.τ and its variants have been encoded and stored
in C with the kth component r.sub.τk of r.sub.τ being 1.
c.sub.τ is an estimate of 2m times the total number of times
v.sub.τ and its variants have been encoded and stored in C.
Consequently, (d.sub.τk/c.sub.τ+1)/2 is a subjective probability
p.sub.τk that r.sub.τk is equal to 1 given xt input to the
PU. Model spiking neuron k then uses a pseudo-random number generator to
generate a number (or "spike") 1 with probability p.sub.τk and a
number -1 (representing "no spike") with probability 1-p.sub.τk. This
pseudo-random number denoted by x {p.sub.τk} is the output
y.sub.τk of model spiking neuron k at time or numbering τ.
y.sub.τk=x {p.sub.τk} is thus a point estimate of the k-th
component r.sub.τk of the label r.sub.τ of x.sub.τ.

[0301] Note that the vector

p.sub.τ=[p.sub.τ1 p.sub.τ2 . . . p.sub.τR]'

is a representation of a subjective probability distribution of the label
r.sub.τ. Note also that the outputs of the R model spiking neurons in
response to x.sub.τ form a bipolar binary vector y.sub.τ=x
{p.sub.τ}, which is a point estimate of the label r.sub.τ of the
vector x.sub.τ input to the PU, whether r.sub.τ is y.sub.τ in
unsupervised learning or z.sub.τ in supervised learning.

[0302] The CIPAM has the following computational advantage over LOM: While
the (model) neuronal node in LOM evaluates -2vu+v+u, the (model) neuronal
node in the CIPAM evaluates xy. The former involves 2 multiplications and
two additions, and the latter a single multiplication. In a neuronal
encoder with an m-dimensional input vector, there are 2m neuronal
nodes. Therefore, the CIPAM is computationally much less expensive than
LOM.

[0303] The learning masking matrices, denoted by M# (ψ), ψ=1,
. . . , Ψ, form a general masking matrix called the general learning
masking matrix

M#=diag[M#(1)M#(2) . . . M#(Ψ)] (95)

which is a diagonal matrix. The diagonal entries are numbered
consecutively from 1 to dim {hacek over
(x)}.sub.τ=Σ.sub.ψ=1.sup.Ψ 2dim
x.sup.τ.sup.(ψ), where {hacek over (x)}.sub.τ is a general
neuronal code, whose entries are also numbered consecutively from 1 to
dim {hacek over (x)}.sub.τ. The diagonal entries Mjj#, j=1,
. . . , dim {hacek over (x)}.sub.τ, of M# are called learning
masking factors.

[0304] An UPU (unsupervised processing unit) with learning masking
matrices M# (ψ), ψ=1, . . . , Ψ, generates an estimate x
{p.sub.τ} of the label of the vector x.sub.τ input to the UPU
using its masking matrices M (ψ), ψ=1, . . . , Ψ, as before.
However, when it comes to learning, the learning masking matrices M#
(ψ), ψ=1, . . . , Ψ, are used instead to generate an estimate
x {p.sub.τ#} of the label of xt, and the general code
covariance matrix D is adjusted by

x {p.sub.τk#} is a pseudorandom number generated in accordance
with the subjective probability distribution p.sub.τk#, and then
x {p.sub.τ#}=[x {p.sub.τ1#} . . . x
{p.sub.τR#}]', where R is the number of model spiking neurons in
the UPU.

5.13 Preprocessing Exogenous Feature Vectors for the CIPAM

[0305] Given a variable α whose value is either a real number a or a
real number b, it can be transformed into a variable x whose value is
either -1 or 1, respectively, by the following function:

x = f ( α ) = 2 α - a - b b - a
##EQU00056##

If α is near a or near b, then x is near f (a)=-1 or near f (b)=1,
respectively. Given a vector α=[α1 α2 . . .
αm]' whose components are real numbers, a and b, the vector
α can be transformed into a bipolar binary vector x by the function

x=f(α)=[f(α1)f(α2) . . . f(αm)]'

If a component of α is near a or near b, then the corresponding
component of x is near f (a) or near f (b), respectively. Notice that the
symbol f denotes both a real-valued function of a real-valued variable
α and a vector-valued function of a vector variable α with
real components.

[0306] If the exogenous feature vectors αtex, t=1, 2, . .
. , to be processed by an artificial neural network have components being
real numbers, a and b, or near a and near b, the vectors
αtex, t=1, 2, . . . , can be converted into binary or
nearly binary vectors xtex, t=1, 2, . . . , by xtex=f
(αtex), t=1, 2, . . . . The CIPAM described hereinabove
can then be applied.

5.14 A General ANN Mathematically Equivalent to LOM

[0307] The mathematical equivalence of the LOM and CIPAM motivated search
for ANNs (artificial neural networks) that are mathematically equivalent
to the LOM and CIPAM. Generalizations of the affine transformations, (65)
and (66), are the following:

where f-1 denotes the inverse function of f. Notice that f (a)=0 and
f (b)=1, and that f-1 (0)=a and f-1 (1)=b. Note that the affine
transformations, (101) and (102) with a=1 and b=1, are (65) and (66).
Using (101) and (102), the LOM can be transformed into a general ANN for
a and b the way the LOM is transformed into the CIPAM for a=1 and b=-1.

[0309] The function φ (α, β) is a general representation of
the neuronal node operation φ (v, u), where the inputs of the
neuronal node are represented by α and β its output is
represented by φ (α, β).

[0310] The algebraic binary operation φ (α, β) is a
commutative and associative binary operation on α and β:

φ(α,β)=φ(β,α)

φ(γ,φ(α,β))=φ(φ(γ,α),β-
)

Hence, we can define a symmetric function φk by applying the
binary operation repeatedly as follows:

where {hacek over (α)}k and {hacek over (v)}k denote the
k-th components of {hacek over (α)} and {hacek over (v)}
respectively. The vector on the right side is denoted by f-1 ({hacek
over (v)}) for simplicity. The vector {hacek over (α)} is called
the neuronal code of α.

[0318] 2. If ({hacek over (α)}-μI)' ({hacek over
(β)}-μI)≠0, then ({hacek over (α)}-μI)' ({hacek
over (β)}-μI)=24(α-μI)'(β-μI)-2, where the
vector I on the left side of the equality sign is 2m-dimensional,
but I on the right side of the equality sign is m-dimensional.

[0319] Based on the transformations, (101) and (102), and the above
corresponding properties between the variables α and β and the
variables v and u, the LOM (low-order model of biological neural
networks) can be transformed into a general representation of the LOM,
which is a general ANN mathematically equivalent to the LOM. In this
general ANN, components of feature vectors, "spikes and nonspikes", and
components of labels are numbers, a and b, or numbers in their
vicinities. The variables whose values are such numbers will be denoted
by αtj, βti and γti instead of vtj,
uti and wti in the following lists of formulas for the general
representation. Here αtj=f-1 (vtj),
βti=f-1 (uti), and γti=f-1
(wti).

[0320] For unsupervised covariance learning by the general model, the
formulas, (11), (12) and (13), are respectively transformed to

where αt is a vector input to the neuronal encoder; the
2m outputs, {hacek over (α)}t1, {hacek over
(α)}t2, . . . , {hacek over (α)}t2m, from the
neuronal encoder at time (or numbering) t, pass through a synapse to
reach each of a number, say R, postsynaptic model spiking neurons and a
postsynaptic model nonspiking neuron, and the output of model spiking
neuron i is denoted by βti. βt is the vector of R
outputs βti of the R model spiking neurons. Furthermore, the
formulas, (14), (15) and (16), are respectively transformed into

where γt, t=1, 2, . . . , is a sequence of labels of the
causes (e.g., patterns) inside the receptive field of the R model spiking
neurons in the PU, and is provided from outside the CIPAM. The components
of γti assume values from the set {a, b}.

[0321] For unsupervised accumulation learning by the general model, (21)
and (22) are respectively transformed into

[0325] The formula, (43), for the general masking matrix M for a PU in the
CIPAM remains the same. The general expansion covariance matrix D learned
by the unsupervised covariance rule by an UPU in the CIPAM is defined by
(41), where the code covariance matrix D (ψ) is defined by (112) or
(115), depending on whether ys-ys or ys is used. The
general expansion covariance matrix D learned by the supervised
covariance rule by an UPU in the CIPAM is defined by (41), where the code
covariance matrix D (ψ) is defined by (118). The general expansion
covariance matrix C for the CIPAM is defined by (42), where the code
covariance matrix C (ψ) is defined by (120).

[0326] For retrieving information from synapses by the general model, the
formulas, (44) and (45), are transformed into

d.sub.τ(ψ)=D(ψ)M(ψ)({hacek over
(α)}.sub.τ(ψ)-{hacek over (α)}.sub.τ(ψ))
(127)

c.sub.τ(ψ)=C(ψ)M(ψ)({hacek over
(α)}.sub.τ(ψ)-{hacek over (α)}.sub.τ(ψ))
(128)

for ψ=1, . . . , Ψ.

[0327] The computation formulas performed by a nonspiking neuron and a
spiking neuron in the general model are obtained by transforming (54) and
(55) respectively into

c.sub.τ=CM({hacek over (α)}.sub.τ-{hacek over
(α)}.sub.τ) (129)

d.sub.τk=DkM({hacek over (α)}.sub.τ-{hacek over
(α)}.sub.τ) (130)

(c.sub.τ+d.sub.τk)/2 is an estimate of 2m times the total
number of times v.sub.τ and its general models have been encoded and
stored in C with the kth component r.sub.τk of r.sub.τ being a.
c.sub.τ is an estimate of 2m times the total number of times
v.sub.τ and its general models have been encoded and stored in C.
Consequently, (d.sub.τk/c.sub.τ+1)/2 is the subjective
probability p.sub.τk that r.sub.τk is equal to a given
αt. model spiking neuron k then uses a pseudo-random number
generator to generate a number (or "spike") a with probability
p.sub.τk and a number b (or no "spike") with probability
1-p.sub.τk. This pseudo-random number denoted by α
{p.sub.τk} is the output β.sub.τk of model spiking neuron k
at time or numbering τ. β.sub.τk=α {p.sub.τk} is
thus a point estimate of the k-th component r.sub.τk of the label
r.sub.τ of α.sub.τ input to the PU.

[0328] Note that the vector

p.sub.τ=[p.sub.τ1 p.sub.τ2 . . . p.sub.τR]'

is a representation of a subjective probability distribution of the label
r.sub.τ. Note also that the outputs of the R model spiking neurons in
response to α.sub.τ form a bipolar binary vector
β.sub.τ=α {p}, which is a point estimate of the label
r.sub.τ of α.sub.τ.

6 CONCLUSION, RAMIFICATION, AND SCOPE OF INVENTION

[0329] Many embodiments of the present invention are disclosed, which can
achieve the objects listed in the "Summary" of the present invention
disclosure. While our descriptions hereinabove contain many
specificities, these should not be construed as limitations on the scope
of the invention, but rather as an exemplification of preferred
embodiments. In addition to these embodiments, those skilled in the art
will recognize that other embodiments are possible within the teachings
of the present invention. Accordingly, the scope of the present invention
should be limited only by the appended claims and their appropriately
construed legal equivalents.