Speech recognition remains a daunting challenge for computer programmers partly because the continuous speech stream is highly under-determined. For example take coarticulation, which refers to the fact that the auditory frequencies corresponding to a given letter are strongly influenced by the letters both preceding and following it – sometimes interpreted to mean that there is no invariant set of purely auditory characteristics defining any given letter. Thus it’s difficult to recover the words that a person is saying, since each part of that word is influenced by the words surrounding it (and so on, ad infinitum).

One class of speech perception theories (the “motor theories”) propose that this computational problem is circumvented by the brain in a very interesting way: the incoming speech stream is actually simulated by the motor system, and thus the perceiver can, through some reverse translation process, recover what the intended articulatory components of the speech stream were.

There’s at least some interesting evidence to support these theories, as reviewed by Westermann and Miranda in their recent Brain and Language paper, including the fact that deaf or tracheotomized infants don’t show normal babbling. This indicates a tight coupling between speech gestures and speech perception.

Westermann & Miranda report their efforts to simulate the motor theory of speech perception in a computational neural network model of development, which consists of two processing layers: a motor layer and a perception layer.

The models can operate in two modes: listening, or listening+babbling. The perception layer consists of neurons which receive input from the first two peak frequencies of an incoming sound, with their connection weights to those frequencies randomly centered in the potential input space, and activation of that unit calculated according to to a gaussian function of distance from the centroid. Similarly, each unit in the motor layer is tuned to a particular random combination of parameters in a realistic speech synthesizer, falling off according to a gaussian function of distance. Critically, these layers are bidirectionally connected to one another.

It is easy to see why babbling would be important for development in a model like this: the network is effectively training itself on the correspondence between motor parameters and the resulting sounds. In other words, it produces a sound (thus providing itself with motor input) and then hears that sound (thus providing itself with auditory input). Westermann & Miranda utilize a Hebbian algorithm (essentially, “fire together, wire together”) to allow for these representations to become associated with one another, and thus for the network to learn.

The authors demonstrated how “preferred response” regions develop within the network, such that more linear relationships between motor and auditory changes are reflected in the migration of units towards those areas of the layer parameter space which have more consistent perceptual-motor mappings. There are two potential reasons for the profile of these preferred response regions: first, Westermann & Miranda appear to use a linear activation function (in contrast to the sigmoidal functions used in other network formalisms); second, Westermann & Miranda have elected not to include a hidden layer between the perceptual and motor layers, limiting the computational power of the network to linear relationships.

The authors also demonstrate how exposure to a language environment such as French or German can skew the kinds of types perceptual sensitivities which self-organize in the network, by causing migration of the perceptual units towards those frequency distributions most present in the ambient language environment.

The end result of this architecture and learning algorithm is a set of “mirrored” perceptual/motor units which may respond regardless of whether speech is self-produced or merely heard (or presumably, merely seen).

The larger point:

Of course there are numerous shortcomings to the model (many of which Westermann & Miranda admit) but the success of the model in the face of these shortcomings illustrates the simplicity of the assumptions required for mirror-neurons to emerge in a computational architecture.

According to this view, mirror neurons are essentially a “convergence zone” for sensory and motor input. The apparent location of mirror neurons in the human (as extrapolated from their location in monkeys) seems to support this idea: they tend to be located in premotor and planning-related regions of the cortex, areas which require a tight relationship between sensory and motor information. This also hints towards one explanation for the most fascinating characteristic of mirror neurons: they appear to be “goal” (or at least “object”) directed, in that mirror neurons in monkeys will not fire to the mere observation of mimed behaviors when they are not plausibly goal- or object-directed. One might speculate that this apparent goal-sensitivity is related to the benefits that “convergence zones” for sensori-motor input have in object-directed action.