A 3D object gives rise to an infinite variety of 2D images or
views, because of the infinite number of possible poses relative to
the viewer, and because of arbitrarily different illumination
conditions. Is it possible to synthesize a module that can recognize
an object from any viewpoint, after it learns its 3D structure from a
small set of perspective views? We show that a recently proposed
network scheme for the approximation of multivariate functions
provides a key part of the solution to the problem. The results are
especially interesting as an application of a new technique for
learning from examples. They also have implications for computer
vision and possibly for understanding the process of object
recognition in natural vision.

In human vision, the processes and the representations involved in
identifying specific individuals are frequently assumed to be
different from those used for basic-level classification, because
classification is largely viewpoint-invariant, but identification is
not. This assumption was tested in psychophysical experiments, in
which objective similarity between stimuli (and, consequently, the
level of their distinction) varied in a controlled fashion. Subjects
were trained to discriminate between two classes of computer generated
3D objects, one resembling monkeys, and the other dogs. Both classes
were defined by the same set of 56 parameters, which encoded sizes,
shapes, and placement of the limbs, the ears, the snout, etc.
Interpolation between parameter vectors of the class prototypes
yielded shapes that changed smoothly between monkey and dog.
Within-class variation was induced in each trial by randomly
perturbing all the parameters. After the subjects reached 90% correct
performance on a fixed canonical view of each object, discrimination
performance was tested for novel views that differed by up to 60
deg from the training view. In experiment 1 (in which the
distribution of parameters in each class was unimodal) and in
experiment 2 (bimodal classes), the stimuli differed only
parametrically and consisted of the same geons (parts), yet were
recognized virtually independently of viewpoint in the low-similarity
condition. In experiment 3, the prototypes differed in their
arrangement of geons, yet the subjects' performance depended
significantly on viewpoint in the high-similarity condition. In all
three experiments, higher inter-stimulus similarity was associated
with an increase in the mean error rate and, for misorientation of up
to 45 deg, with an increase in the degree of viewpoint dependence. These results suggest that a
geon-level difference between stimuli is neither strictly necessary
nor always sufficient for viewpoint-invariant performance. Thus, basic
and subordinate-level processes in visual recognition may be more
closely related than previously thought.

How does the human visual system represent and recognize novel
three-dimensional objects? Variation in response time over different
views of objects, obtained in subordinate-level recognition tasks,
hints that objects may be represented by collections of specific
views, rather than by viewpoint-independent models. We report results
of four experiments that provide further evidence in support of the
viewpoint-specific representation hypothesis. In the first experiment
we tested the recognition of objects seen repeatedly from the same set
of viewpoints. Although the response times in this experiment became
uniform with practice, the differences in error rate for the different
views remained stable. In the second experiment, this result was
replicated in the presence of a variety of depth cues in the test
views, including binocular stereo. In the third experiment,
recognition under monocular and stereoscopic conditions was compared
over four testing sessions. In those two experiments, we found that
the addition of stereo depth reduced the mean error rate, but did not
affect the general pattern of performance over different views, and
its development with practice. Finally, the fourth experiment probed
the ability of subjects to generalize recognition to unfamiliar views
of objects previously seen at a limited range of attitudes, both under
mono and stereo. The same increase in the error rate with
misorientation relative to the training attitude was obtained in the
two conditions. Taken together, these results support the notion that
3D objects are represented by multiple specific views, possibly
augmented by partial viewer-centered three-dimensional information, if
it is available through stereopsis.

Performance of human subjects in a wide variety of early visual
processing tasks improves with practice. HyperBF networks (Poggio and
Girosi, Science, 247:978-982, 1990) constitute a mathematically
well-founded framework for understanding such improvement in
performance, or perceptual learning, in the class of tasks known as
visual hyperacuity. The present article concentrates on two issues
raised by the recent psychophysical and computational findings
reported in Poggio et al., Science 256:1018-1021 (1992). First, we
develop a biologically plausible extension of the HyperBF model that
takes into account basic features of the functional architecture of
early vision. Second, we explore various learning modes that can
coexist within the HyperBF framework and focus on two unsupervised
learning rules which may be involved in hyperacuity learning. Finally,
we report results of psychophysical experiments that are consistent
with the hypothesis that activity-dependent presynaptic amplification
may be involved in perceptual learning in hyperacuity.

We explore representation of 3D objects in which several distinct 2D
views are stored for each object. We demonstrate the ability of a
two-layer network of thresholded summation units to support such
representations. Using unsupervised Hebbian relaxation, the network
learned to recognize ten objects from different viewpoints. The
training process led to the emergence of compact representations of
the specific input views. When tested on novel views of the same
objects, the network exhibited a substantial generalization
capability. In simulated psychophysical experiments, the network's
behavior was qualitatively similar to that of human subjects.

Human performance in the recognition of 3D objects, as measured by
response times and error rates, frequently depends on the
orientation of the object with respect to the observer. We
investigated the dependence of response time (RT) and error rate
(ER) on stimulus orientation for a class of random wire-like
objects. First, we found no evidence for universally valid canonical
views: the best view according to one subject's data was often
hardly recognized by other subjects. Second, a subject by subject
analysis showed that the RT/ER scores were not linearly dependent on
the shortest angular distance in 3D to the best view, as predicted
by the mental rotation theories of recognition. Rather, the
performance was significantly correlated with an image-plane feature
by feature deformation distance between the presented view and the
best (shortest-RT and lowest-ER) view. Our results suggest that
measurement of image-plane similarity to a few
(subject-specific) feature patterns is a better model than mental
rotation for the mechanism used by the human visual system to
recognize objects across changes in their 3D orientation.

The appearance of a three-dimensional object (that is, the pattern
formed by its projection onto the retina of an eye or onto the imaging
plane of a camera) depends on the point of view of the observer. The
collective human awareness of this dependence is attested to by the
widespread use of expressions that involve the metaphor of point
of view, in languages as different as English, Russian, and
Hebrew. Nevertheless, as far as recognition is concerned, the matters
of viewpoint seem to be of secondary importance: the human visual
system exhibits an impressive ability to recognize a familiar object
viewed from an unfamiliar perspective. This phenomenon has been termed
shape constancy, by analogy with other perceptual constancies.

Computational understanding of shape constancy can be gained both by
attempting to build artificial vision systems for object recognition,
and by modeling human performance in this task. Maintaining a constant
interpretation of the three-dimensional world in the face of changing
viewing conditions has long been a major goal of computer vision. The
first part of this chapter classifies and reviews several approaches
to 3D object recognition developed within this field. In the second
part of the chapter, we list the central characteristics of shape
constancy in human vision, and compare the virtues and the
shortcomings of the computational approaches, this time considered as
models of human performance. We conclude with a general discussion of
the phenomenon of shape constancy within the framework of the
computational study of perception.

It is proposed to conceive of representation as an emergent phenomenon
that is supervenient on patterns of activity of coarsely tuned and
highly redundant feature detectors. The computational underpinnings of
the outlined concept of representation are (1) the properties of
collections of overlapping graded receptive
fields, as in the biological perceptual systems that exhibit hyperacuity-level performance, and (2) the
sufficiency of a set of proximal distances between stimulus
representations for the recovery of the corresponding distal contrasts
between stimuli, as in multidimensional scaling. The present
preliminary study appears to indicate that this concept of
representation is computationally viable, and is compatible with
psychological and neurobiological data.

We consider the representational capabilities of systems of receptive
fields found in early mammalian vision, under the assumption that the
successive stages of processing remap the retinal representation space
in a manner that makes objectively similar stimuli (such as different
views of the same 3D object) closer to each other, and dissimilar
stimuli farther apart. We present theoretical analysis and
computational experiments that compare the similarity between stimuli as they are represented at
the successive levels of the processing hierarchy, from the retina to
the nonlinear cortical units. Our results indicate that the
representations at the higher levels of the hierarchy are indeed more
useful for the classification of natural objects such as human faces.

Idealized models of receptive fields (RFs) can be used as building
blocks for the creation of powerful distributed computation systems.
The present report concentrates on investigating the utility of
collections of RFs in representing 3D objects under changing viewing
conditions. The main requirement in this task is that the pattern of
activity of RFs vary as little as possible when the object and the
camera move relative to each other. I propose a method for
representing objects by RF activities, based on the observation that,
in the case of rotation around a fixed axis, differences of activities
of RFs that are properly situated with respect to that axis remain
invariant. Results of computational experiments suggest that a
representation scheme based on this algorithm for the choice of stable
pairs of RFs would perform consistently better than a scheme involving
random sets of RFs. The proposed scheme may be useful under object or
camera rotation, both for ideal Lambertian objects, and for real-world
objects such as human faces.

The maximization of diversity of neuronal response properties has been
recently suggested as an organizing principle for the formation of
such prominent features of the functional architecture of the brain as
the cortical columns and the associated patchy projection patterns. We
report a computational study of two aspects of this hypothesis. First,
we show that maximal diversity is attained when the ratio of dendritic
and axonal arbor sizes is equal to one, as it has been found in many
cortical areas and across species. Second, we show that maximization
of diversity leads to better performance in two case studies: in
systems of receptive fields implementing steerable/shiftable filters,
and in matching spatially distributed signals, a problem that arises
in visual tasks such as stereopsis, motion
processing, and recognition.

An image of a face depends not only on its shape, but also on the
viewing position, illumination conditions, and facial expression.
Any face recognition system must overcome the changes in face
appearance induced by these factors. In this paper we address two
questions: how well humans can indeed generalize the recognition of
faces to novel images, and at which computational level this
generalization is performed. To answer these questions we studied
the performance of subjects in face discrimination task, and we
compared it for upright and inverted faces. For upright faces, we
found remarkably good generalization to novel conditions (i.e., new
illumination and viewpoint). For inverted faces, the generalization
to novel views was significantly worse, although the performance on
the training images was similar in both cases.

Our results indicate that at least some of the processes that
support generalization across viewpoint and illumination are neither
universal (because subjects did not generalize as easily for
inverted faces as for upright ones), nor strictly object-specific
(because in upright faces nearly perfect generalization was possible
from a single view, by itself insufficient for building a complete
object-specific model). We propose that generalization in face
recognition occurs at an intermediate level that is applicable to a
class of objects, and that at this level upright and inverted faces
initially constitute distinct object classes.

According to the paradigmatic reconstructionist approach to vision, a
visual system must first reconstruct the world internally, then
extract from the resulting representation whatever features are
necessary for the task at hand. Recent developments in computational
vision and visual neuroscience show that many of the features needed
for tasks ranging from spatial discrimination to object recognition
can be extracted from the image directly, much as in Gibson's
hypothesis of direct perception. In the emerging synthesis between
Gibson's position and that of Marr, representation, and not
necessarily reconstruction, plays a central role. This new synthesis
seems to constitute a reasonable compromise between the extreme
version of the purposive vision credo, which, paraphrasing Brooks, is
vision without representation, and the reigning paradigm of
reconstruction without purpose.

Nonmetric multidimensional scaling (MDS) is a family of
algorithms that allow one to derive a quantitative representation of
data from a set of qualitative measurements which must satisfy certain
simple constraints. As a tool for vision, MDS combines the advantages
of both qualitative and classical approaches, by relying, on the one
hand, on an ordinal-scale input representation, and by supporting, on
the other hand, the extraction of metric information. The present
paper illustrates an application of MDS to the recovery of depth from
the rank order of binocular disparity differences for a set of points.
Our results indicate that multidimensional scaling constitutes a
promising approach to the integration of biological and computational
insights into the problem of depth perception.

How does the brain represent visual objects? In simple perceptual
generalization tasks, the human visual system performs as if it
represents the stimuli in a low-dimensional metric psychological
space. In theories of 3D shape recognition, the role of feature-space
representations (as opposed to structural or pictorial descriptions)
has been for a long time a major point of contention. If shapes are
indeed represented as points in a feature space, patterns of perceived
similarity among different objects must reflect the structure of this
space. The feature space hypothesis can then be tested by presenting
subjects with complex parameterized 3D shapes, and by relating the
similarities among subjective representations, as revealed in the
response data by multidimensional scaling, to the objective
parameterization of the stimuli. The results of four such tests,
accompanied by computational simulations, support the notion that
discrimination among 3D objects may rely on a low-dimensional feature
space representation, and suggest that this space may be spanned by
explicitly encoded class prototypes.

The computational building blocks of biological
information processing systems are highly interconnected networks
of simple units with graded overlapping receptive fields, arranged
in maps. In view of this basic constraint, it is proposed that the
present stage in the study of cognition should concentrate on
gaining understanding of the cognitive system at the level of the
distributed computational mechanism. The model of script
understanding introduced in the target book ["Subsymbolic Natural Language
Processing", Risto Miikkulainen, Cambridge, MA: MIT Press, 1993] appears
promising, both because it treats seriously the question of architecture of
the language processor, and because its architectural features
resemble those used in modeling other cognitive modalities such as
vision.

We describe a computational model of face recognition, which
generalizes from single views of faces, by taking advantage of prior
experience with other faces, seen under a wider range of viewing
conditions. The model represents face images by vectors of
activities of graded overlapping receptive fields
(RFs). It relies on high spatial frequency
information to estimate the viewing conditions, which are then used
to normalize (via a transformation specific for faces), and
identify, the low spatial frequency representation of the input. The
class-specific transformation approach allows the model to replicate
a series of
psychophysical findings on face recognition,
and constitutes an advance over current
face recognition methods, which are incapable of generalization from
a single example.

Using a small number of prototypical reference objects to span the
internal shape representation space has been suggested as a general
approach to the problem of object representation in vision.
We have investigated the ability of human
subjects to form the low-dimensional metric shape representation
space predicted by this approach. In each of a series of
experiments, which involved pairwise similarity judgment, and
delayed match to sample, subjects were confronted with several
classes of computer-rendered 3D animal-like shapes, arranged in a
complex pattern in a common high-dimensional parameter space. We
combined response time and error rate data into a measure of view
similarity, and submitted the resulting proximity matrix to
nonmetric multidimensional scaling (MDS). In the two-dimensional MDS
solution, views of the same shape were invariably clustered
together, and, in each experiment, the relative geometrical
arrangement of the view clusters of the different objects reflected
the true low-dimensional structure in parameter space (star,
triangle, square, line) that defined the relationships between the
stimuli classes. These findings are now used used to guide the
development of a detailed computational theory of shape vision based
on similarity to prototypes.

A representational scheme under which the ranking between
represented dissimilarities is isomorphic to the ranking between the
corresponding shape dissimilarities can support perfect shape
classification, because it preserves the clustering of shapes
according to the natural kinds prevailing in the external world. We
discuss the computational requirements of rank-preserving
representation, and examine its plausibility within a
prototype-based framework of shape vision.

A theory of representation is incomplete if it states
"representations are X" where X can be symbols, cell assemblies,
functional states, or the flock of birds from Theaetetus,
without explaining the nature of the link between the universe of
X's and the world. Amit's thesis, equating representations with
reverberations in Hebbian cell assemblies, will only be considered a
solution to the problem of representation when it is complemented by
a theory of how a reverberation in the brain can be a representation
of anything.

We consider training classifiers for multiple tasks as a method for
improving generalization and obtaining a better low-dimensional representation.
To that end, we introduce a hybrid training methodology for MLP networks;
the utility of the hidden-unit representation is assessed by embedding it into a
2D space using multidimensional scaling. The proposed methodology is tested
on a highly nonlinear image classification task.

Many of the lower-level areas in the mammalian visual system are
organized retinotopically, that is, as maps which preserve to a
certain degree the topography of the retina. A unit that is a part
of such a retinotopic map normally responds selectively to
stimulation in a well-delimited part of the visual field, referred
to as its receptive field (RF). Receptive fields are probably
the most prominent and ubiquitous computational mechanism employed
by biological information processing systems. This paper surveys
some of the possible computational reasons behind the ubiquity of
RFs, by discussing examples of RF-based solutions to problems in
vision, from spatial acuity, through sensory coding, to object
recognition.

A representational scheme under which the ranking between
represented similarities is isomorphic to the ranking between the
corresponding shape similarities can support perfectly correct shape
classification, because it preserves the clustering of shapes
according to the natural kinds prevailing in the external world.
This note discusses the computational requirements of representation
that preserves similarity ranks, and points out the
straightforwardness of its connectionist implementation.

Does the human brain represent objects for recognition by storing a
series of two-dimensional snapshots, or are the object models, in some
sense, three-dimensional analogs of the objects they represent? One
way to address this question is to explore the ability of the human
visual system to generalize recognition from familiar to novel views
of three-dimensional objects. Three recently proposed theories of
object recognition --- viewpoint normalization or alignment of 3D
models (Ullman, 1989), linear combination of 2D views (Ullman and Basri,
1991) and nonlinear view interpolation (Poggio and Edelman, 1990) ---
predict different patterns of generalization to novel views. We have
exploited the conflicting predictions to test the three theories
directly, in a psychophysical experiment involving computer-generated
wire-like objects. Our results suggest that the human
visual system is better described as recognizing these objects by nonlinear
2D view interpolation than by alignment or other methods that rely on
object-centered 3D models.

We present a unified approach to visual representation, addressing
both the needs of superordinate and basic-level categorization and
of identification of specific instances of familiar categories.
According to the proposed theory, a shape is represented by its
similarity to a number of reference shapes, measured in a
high-dimensional space of elementary features. This amounts to
embedding the stimulus in a low-dimensional proximal shape space.
That space turns out to support representation of distal shape
similarities which is veridical in the sense of Shepard's (1968)
notion of second-order isomorphism (i.e., correspondence between
distal and proximal similarities among shapes, rather than between
distal shapes and their proximal representations). Furthermore, a
general expression for similarity between two stimuli, based on
comparisons to reference shapes, can be used to derive models of
perceived similarity ranging from continuous, symmetric, and
hierarchical, as in the multidimensional scaling models
(R. N. Shepard, 1980), to discrete and non-hierarchical, as in the
general contrast models (A. Tversky, 1977; R. N. Shepard and P. Arabie, 1979).

We describe a method for automatic word sense disambiguation using a
text corpus and a machine-readable dictionary (MRD). The method is
based on word similarity and context similarity measures. Words are
considered similar if they appear in similar contexts; contexts are
similar if they contain similar words. The circularity of this
definition is resolved by an iterative, converging process, in which
the system learns from the corpus a set of typical usages for each
of the senses of the polysemous word listed in the MRD. A new
instance of a polysemous word is assigned the sense associated with
the typical usage most similar to its context. Experiments show
that this method can learn even from very sparse training data,
achieving over 92% correct disambiguation performance.

Intelligent systems are faced with the problem
of securing a principled (ideally, veridical) relationship between
the world and its internal representation. I propose a unified
approach to visual representation, addressing both the needs of
superordinate and basic-level categorization and of identification
of specific instances of familiar categories. According to the
proposed theory, a shape is represented by its similarity to a
number of reference shapes, measured in a high-dimensional space of
elementary features. This amounts to embedding the stimulus in a
low-dimensional proximal shape space. That space turns out to
support representation of distal shape similarities which is
veridical in the sense of Shepard's (1968) notion of second-order
isomorphism (i.e., correspondence between distal and proximal
similarities among shapes, rather than between distal shapes and
their proximal representations). Furthermore, a general expression
for similarity between two stimuli, based on comparisons to
reference shapes, can be used to derive models of perceived
similarity ranging from continuous, symmetric, and hierarchical, as
in the multidimensional scaling models (Shepard, 1980), to discrete
and non-hierarchical, as in the general contrast models
(Tversky, 1977; Shepard and Arabie, 1979).

We report results from perceptual judgment, delayed matching to
sample, and long-term memory recall experiments, which indicate that
the human visual system can support metrically veridical
representations of similarities among 3D objects. In all the
experiments, animal-like computer-rendered stimuli formed regular
planar configurations in a common 70-dimensional parameter space.
These configurations were fully recovered by multidimensional
scaling from proximity tables derived from the subject data. This is
possible if shapes are encoded by their similarities to a number of
reference (prototypical) shapes (as in the computational model that
accompanies the psychophysical data), but not if the system stores
merely the distinctive features of the objects, or their structural
descriptions (which were the same for all the stimuli).

To explore the nature of the representation space of 3D objects, we
studied human performance in forced-choice classification of objects
composed of four geon-like parts, emanating from a common center. The
two class prototypes were distinguished by qualitative contrasts
(bulging vs.\ waist-like limbs). Subjects were trained to discriminate
between the two prototypes (shown briefly, from a number of
viewpoints, in stereo) in a 1-interval forced-choice task, until they
reached a 90% correct-response performance level. In the first
experiment, 11 subjects were tested on shapes obtained by varying the
prototypical parameters both orthogonally (Ortho) and in
parallel (Para) to the line connecting the prototypes in the
parameter space. For the eight subjects who performed above chance,
the error rate increased with the Ortho parameter-space
displacement between the stimulus and the corresponding prototype (the
effect of the Para displacement was marginal). Clearly, the
parameter-space location of the stimuli mattered more than the
qualitative contrasts (which were always present). To find out
whether both prototypes or just the nearest neighbor of the test shape
influenced the decision, in the second experiment we tested 18 new
subjects on a fixed set of shapes, while the test-stage distance
between the two classes assumed one of three values (Far, Intermediate,
and Near). For the 13 subjects who performed
above chance, the error rate (on physically identical stimuli) in the
Near condition was higher than in the other two conditions. The
results of the two experiments contradict the prediction of theories
that postulate exclusive reliance on qualitative contrasts, and
support the notion of a metric representation space, with the
subjects' performance determined by distances to more than one
reference point or prototype.

Learning to recognize visual objects from examples requires the
ability to find meaningful patterns in spaces of very high
dimensionality. We present a method for dimensionality reduction
which effectively biases the learning system by combining multiple
constraints via an extensive use of class labels. The use of
multiple class labels steers the resulting low-dimensional
representation to become invariant to those directions of variation
in the input space that are irrelevant to classification; this is
done merely by making class labels independent of these directions.
We also show that prior knowledge of the proper dimensionality of
the target representation can be imposed by training a
multiple-layer bottleneck network. A series of computational
experiments involving parameterized fractal images and real human
faces indicate that the low-dimensional representation extracted by
our method leads to improved generalization in the learned tasks,
and is likely to preserve the topology of the original space.

Psychophysical findings accumulated over the past several decades
indicate that perceptual tasks such as similarity judgment tend to be
performed on a low-dimensional representation of the sensory data. Low
dimensionality is especially important for learning, as the number of
examples required for attaining a given level of performance grows
exponentially with the dimensionality of the underlying representation
space. In this chapter, we argue that, whereas many perceptual
problems are tractable precisely because their intrinsic
dimensionality is low, the raw dimensionality of the sensory data is
normally high, and must be reduced by a nontrivial computational
process, which, in itself, may involve learning. Following a survey of
computational techniques for dimensionality reduction, we show that it
is possible to learn a low-dimensional representation that captures
the intrinsic low-dimensional nature of certain classes of visual
objects, thereby facilitating further learning of tasks involving
those objects.

Nearest-neighbor correlation-based similarity computation in the space
of outputs of complex-type receptive fields can support robust
recognition of 3D objects. Our experiments with four collections of
objects resulted in mean recognition rates between 84% (for
subordinate-level discrimination among 15 quadruped animal shapes) and
94% (for basic-level recognition of 20 everyday objects), over a
40deg X 40deg range of viewpoints, centered on a
stored canonical view and related to it by rotations in depth
(comparable figures were obtained for image-plane translations). This
result has interesting implications for the design of a front end to
an artificial object recognition system, and for the understanding of
the faculty of object recognition in primate vision.

The positional specificity of short-term visual memory for a variety
of 3D shapes was investigated in a series of same-different
discrimination experiments, using computer-rendered stimuli
displayed either at the same or at different locations in the visual
field. For animal-like shapes, we found complete translation
invariance, regardless of the inter-stimulus similarity, and
irrespective of direction and size of the displacement
(experiments 1 and 2). Invariance to translation was obtained also
with animal-like stimuli that had been ``scrambled'' by randomizing
the relative locations of their parts (experiment 3). The
invariance broke down when the stimuli were made to differ in their
composition, but not in the shapes of the corresponding parts
(experiments 4 and 5). We interpret this pattern of findings in the
context of several current theories of recognition, focusing in
particular on the issue of the representation of object structure.

Visual categorization, or making sense of novel shapes and shape
classes, is a computationally challenging and behaviorally important
task, which is not widely addressed in computer vision or visual
psychophysics (where the stress is rather on the generalization of
recognition across changes of viewpoint). This paper examines the
categorization abilities of four current approaches to object
representation: structural descriptions, geometric models,
multidimensional feature spaces, and similarities to reference
shapes. It is proposed that a scheme combining features of all four
approaches is a promising candidate for a comprehensive and
computationally feasible theory of categorization.

One of the difficulties of object recognition
stems from the need to overcome the variability in object
appearance caused by factors such as illumination and pose. The
influence of these factors can be countered by learning to
interpolate between stored views of the target object, taken under
representative combinations of viewing conditions. Difficulties of
another kind arise in daily life situations that require
categorization, rather than recognition, of objects. We show that,
although categorization cannot rely on interpolation between
stored examples, knowledge of several representative members, or
prototypes, of each of the categories of interest can still
provide the necessary computational substrate for the
categorization of new instances. The resulting representational
scheme based on similarities to prototypes is computationally
viable, and is readily mapped onto the mechanisms of biological
vision revealed by recent psychophysical and physiological
studies.

Proc. Edinburgh
Workshop on Similarity and Categorization, 75-81, November 1997.

Visual objects can be represented by their similarities to a small
number of reference shapes or prototypes. This method yields
low-dimensional (and therefore computationally tractable)
representations, which support both the recognition of familiar
shapes and the categorization of novel ones. In this note, we show
how such representations can be used in a variety of tasks involving
novel objects: viewpoint-invariant recognition, recovery of a
canonical view, estimation of pose, and prediction of an arbitrary
view. The unifying principle in all these cases is the
representation of the view space of the novel object as an
interpolation of the view spaces of the reference shapes.

Theories of object representation can be classified as structural,
holistic or hybrid, depending on their approach to the mereology and
compositionality of shapes. We tested the predictions of some of the
current theories in three experiments, by quantifying the effects of
various priming cues on response times to 3D objects. In
experiment~1, there were two possible locations for the stimulus
components: left-right and top-bottom. The prime could be identical
to the stimulus, identical in location but with different parts,
identical in the complement of differently located parts, or
altogether different. Both location and part identity effects were
significant. In experiment~2 we added a part-neutral (empty frame)
prime condition; the effect of location, but not of part, remained
significant. In experiment~3, which included an additional
location-neutral prime condition, only the location effect, again,
was significant. These findings are not entirely compatible either
with the structural description theories of representation (which
predict priming by ``disembodied'' parts or geons) or with the
holistic theories (which do not predict priming by ``shapeless''
location on its own). They may be interpreted in terms of a hybrid
theory, according to which conjunctions of shape and location are
explicitly represented, and therefore amenable to priming.

The ability to deal with object structure --- to determine what is
where in a given object, rather than merely to categorize or identify
it --- has been hitherto considered the prerogative of ``structural
description'' approaches, which represent shapes as categorical
compositions of generic parts taken from a small alphabet. In this
note, we propose a simple extension to a theoretically motivated and
extensively tested appearance-based model of recognition and
categorization, which should make it capable of representing object
structure. We describe a pilot implementation of the extended model,
survey independent evidence supporting its {\it modus operandi}, and
outline a research program focused on achieving a range of object
processing capabilities, including reasoning about structure, within a
unified appearance-based framework.

Reports of columnar organization of macaque inferotemporal cortex
(Tanaka 1992, Tanaka 1993) indicate that ensembles of cells
responding to particular objects may be both sufficiently extensive
and properly localized to allow their detection and discrimination by
means of functional magnetic resonance imaging (fMRI). A recently
developed theory of object representation by ensembles of coarsely
tuned units (Edelman and Duvdevani-Bar, 1997; Edelman, 1998) and its
implementation as a computer model of recognition and categorization
(Cutzu and Edelman, 1998; Edelman and Duvdevani-Bar, 1997) provide a
computational framework in which such findings can be interpreted in a
straightforward fashion. Taken together, these developments in the
study of object representation and recognition suggest that direct
visualization of the internal representations may be easier than
previously thought. In this paper, we show how fMRI techniques can
be used to investigate the internal representation of objects in
human visual cortex. Our initial results reveal that the activation
of most voxels in object-related areas remains unaffected by a
coarse scrambling of the natural images used as stimuli, and that a
map of the representation space of object categories in individual
subjects can be derived from the distributed pattern of voxel
activation in those areas.

We describe a unified framework for the understanding of structure
representation in primate vision. A model derived from this framework is
shown to be effectively systematic in that it has the ability to interpret
and associate together objects that are related through a rearrangement of
common ``middle-scale'' parts, represented as image fragments. The model
addresses the same concerns as previous work on compositional
representation through the use of what+where receptive fields and
attentional gain modulation. It does not require prior exposure to the
individual parts, and avoids the need for abstract symbolic binding.

The paper outlines a computational approach to face representation
and recognition, inspired by two major features of biological
perceptual systems: graded-profile overlapping receptive fields, and
object-specific responses in the higher visual areas. This approach,
according to which a face is ultimately represented by its
similarities to a number of reference faces, led to the development
of a comprehensive theory of object representation in biological
vision, and to its subsequent psychophysical exploration and
computational modeling.

To find out how the representations of structured visual objects
depend on the co-occurrence statistics of their constituents, we
exposed subjects to a set of composite images with tight control
exerted over (1) the conditional probabilities of the constituent
fragments, and (2) the value of Barlow's criterion of ``suspicious
coincidence'' (the ratio of joint probability to the product of
marginals). We then compared the part verification response times for
various probe/target combinations before and after the exposure. For
composite probes, the speedup was much larger for targets that
contained pairs of fragments perfectly predictive of each other,
compared to those that did not. This effect was modulated by the
significance of their co-occurrence as estimated by Barlow's
criterion. For lone-fragment probes, the speedup in all conditions was
generally lower than for composites. These results shed light on the
brain's strategies for unsupervised acquisition of structural
information in vision.

Understanding the perception of all but the most impoverished and
artificial scenes presents a different (and likely far greater)
kind of challenge than understanding face recognition, reading, or
identification (or even categorization) of standalone objects. This
article surveys central issues in the interpretation of structured
objects and scenes (starting with basics, such as the meaning of
seeing), and outlines a theoretical approach to this formidable task,
motivated by some recent developments in neuroscience and
neurophilosophy.

The problem of representing the spatial structure of images, which
arises in visual object processing, is commonly described using
terminology borrowed from propositional theories of cognition,
notably, the concept of compositionality. The classical propositional
stance mandates representations composed of symbols, which stand for
atomic or composite entities and enter into arbitrarily nested
relationships. We argue that the main desiderata of a representational
system --- productivity and systematicity --- can (indeed, for a
number of reasons, should) be achieved without recourse to the
classical, proposition-like compositionality. We show how this can be
done, by describing a systematic and productive model of the
representation of visual structure, which relies on static rather than
dynamic binding and uses coarsely coded rather than atomic shape
primitives.

The principle of complementary distributions (Harris, 1954; Harris, 1991),
according to which morphemes that occur in identical contexts belong,
in some sense, to the same category, has been advanced as a means for
extracting syntactic structures from corpus data. We extend this
principle by applying it recursively, and by using mutual information
for estimating category coherence. The resulting model learns, in an
unsupervised fashion, highly structured, distributed representations
of syntactic knowledge from corpora. It also exhibits promising
behavior in tasks usually thought to require representations anchored
in a grammar, such as systematicity.

To learn a visual code in an unsupervised manner, one may attempt to
capture those features of the stimulus set that would contribute
significantly to a statistically efficient representation (as
dictated, e.g., by the Minimum Description Length principle).
Paradoxically, all the candidate features in this approach need to be
known before statistics over them can be computed. This paradox may
be circumvented by confining the repertoire of candidate features to
actual scene fragments, which resemble the ``what+where'' receptive
fields found in the ventral visual stream in primates. We describe a
single-layer network that learns such fragments from unsegmented raw
images of structured objects. The learning method combines fast
imprinting in the feedforward stream with lateral interactions to
achieve single-epoch unsupervised acquisition of spatially localized
features that can support systematic treatment of structured objects.

We describe a pattern acquisition algorithm that learns, in an
unsupervised fashion, a streamlined representation of linguistic
structures from a plain natural-language corpus. This paper addresses
the issues of learning structured knowledge from a large-scale natural
language data set, and of generalization to unseen text. The
implemented algorithm represents sentences as paths on a graph whose
vertices are words (or parts of words). Significant patterns,
determined by recursive context-sensitive statistical inference, form
new vertices. Linguistic constructions are represented by trees
composed of significant patterns and their associated equivalence
classes. An input module allows the algorithm to be subjected to a
standard test of English as a Second Language (ESL) proficiency. The
results are encouraging: the model attains a level of performance
considered to be ``intermediate'' for 9th-grade students, despite
having been trained on a corpus (CHILDES) containing transcribed
speech of parents directed to small children.

We compare our model of unsupervised learning of linguistic
structures, ADIOS (Solan et al, NIPS'03), to some recent work in
computational linguistics and in grammar theory. Our approach
resembles the Construction Grammar in its general philosophy (e.g., in
its reliance on structural generalizations rather than on syntax
projected by the lexicon, as in the current generative theories), and
the Tree Adjoining Grammar in its computational characteristics (e.g.,
in its apparent affinity with Mildly Context Sensitive Languages).
The representations learned by our algorithm are truly emergent from
the (unannotated) corpus data, whereas those found in published works
on cognitive and construction grammars and on TAGs are hand-tailored.
Thus, our results complement and extend both the computational and the
more linguistically oriented research into language acquisition. We
conclude by suggesting how empirical and formal study of language can
be best integrated.

Computer vision systems are, on most counts, poor performers, when
compared to their biological counterparts. The reason for this may
be that computer vision is handicapped by an unreasonable assumption
regarding what it means to see, which became prevalent as the
notions of intrinsic images and of representation by reconstruction
took over the field in the late 1970's. Learning from biological
vision may help us to overcome this handicap.

We examined the role of fitness, commonly assumed without proof to be
conferred by the mastery of language, in shaping the dynamics of
language evolution. To that end, we introduced island migration (a
concept borrowed from population genetics) into the shared lexicon
model of communication (Hurford, 1989; Nowak, 1999). The effect of
fitness in language coherence was compared to a control condition of
neutral drift. We found that in the neutral condition (no
coherence-dependent fitness) even a small migration rate -- less than
1% -- suffices for one language to become dominant, albeit after a
long time. In comparison, when fitness-based selection is introduced,
the subpopulations stabilize quite rapidly to form several distinct
languages. Our findings support the notion that language confers
increased fitness. The possibility that a shared language evolved as
a result of neutral drift appears less likely, unless migration rates
over evolutionary times were extremely small.

We address the problem, fundamental to linguistics, bioinformatics and
certain other disciplines, of using corpora of raw symbolic sequential data
to infer underlying rules that govern their production. Given a corpus of
strings (such as text, transcribed speech, chromosome or protein sequence
data, sheet music, etc.), our unsupervised algorithm recursively distills
from it hierarchically structured patterns. The ADIOS (Automatic
DIstillation of Structure) algorithm relies on a statistical method for
pattern extraction and on structured generalization, two processes that
have been implicated in language acquisition. It has been evaluated on
artificial context-free grammars with thousands of rules, on natural
languages as diverse as English and Chinese, and on protein data
correlating sequence with function. This is the first time an unsupervised
algorithm is shown capable of learning complex syntax, generating
grammatical novel sentences, and proving useful in other fields that call
for structure discovery from raw data, such as bioinformatics.

One of the greatest challenges facing the cognitive sciences is to
explain what it means to know a language, and how the knowledge of
language is acquired. The dominant approach to this challenge within
linguistics has been to seek an efficient characterization of the
wealth of documented structural properties of language in terms of a
compact generative grammar  ideally, the minimal necessary set of
innate, universal, exception-less, highly abstract rules that
jointly generate all and only the observed phenomena and are common
to all human languages. We review developmental, behavioral, and
computational evidence that seems to favor an alternative view of
language, according to which linguistic structures are generated by
a large, open set of constructions of varying degrees of abstraction
and complexity, which embody both form and meaning and are acquired
through socially situated experience in a given language community,
by probabilistic learning algorithms that resemble those at work in
other cognitive modalities.

Reverse-engineering the brain involves adopting and testing a
hierarchy of working hypotheses regarding the computational problems
that it solves, the representations and algorithms that it employs,
and the manner in which these are implemented. Because problem-level
assumptions set the course for the entire research program, it is
particularly important to be open to the possibility that we have
them wrong, but tacit algorithm- and implementation-level hypotheses
can also benefit from occasional scrutiny. The present paper focuses
on the extent to which our computational understanding of how the
brain works is shaped by three such rarely discussed assumptions,
which span the levels of Marr's hierarchy: (i) that animal behavior
amounts to a series of stimulus/response bouts, (ii) that learning
can be adequately modeled as being driven by the optimization of a
fixed objective function, and (iii) that massively parallel,
uniformly connected layered or recurrent network architectures
suffice to support learning and behavior. In comparison, a more
realistic approach acknowledges that animal behavior in the wild is
characterized by dynamically branching serial order and is often
agentic rather than reactive. Arguably, such behavior calls for
open-ended learning of world structure and may require a neural
architecture that includes precisely wired circuits reflecting the
serial and branching structure of behavioral tasks.