When we hear spoken words we sense that they are made of auditory sounds. The motor theory of speech perception argues that behind the sounds we hear are the intended movements of the vocal tract that pronounces them.

The motor theory of speech perception is the hypothesis that people perceivespoken words by identifying the vocal tract gestures with which they are pronounced rather than by identifying the sound patterns that speech generates.[1][2][3][4][5] It originally claimed that speech perception is done through a specialized module that is innate and human-specific. Though the idea of a module has been qualified in more recent versions of the theory,[5] the idea remains that the role of the speech motor system is not only to produce speech articulations but also to detect them.

The hypothesis has gained more interest outside the field of speech perception than inside. This has increased particularly since the discovery of mirror neurons that link the production and perception of motor movements, including those made by the vocal tract.[5] An alternative interpretation of research linking speech perception to speech production, however, is that it links to speech imitation rather than speech perception.[6]

Contents

The hypothesis has its origins in research using pattern playback to create reading machines for the blind that would substitute sounds for orthographic letters.[7] This led to a close examination of how spoken sounds correspond to the acoustic spectrogram of them as a sequence of auditory sounds. This found that successive consonants and vowels overlap in time with one another (a phenomenon known as coarticulation).[8][9][10] This suggested that speech is not heard like an acoustic "alphabet" or "cipher," but as a "code" of overlapping speech gestures.

Initially, the theory was associationist: infants mimic the speech they hear and that this leads to behavioristic associations between articulation and its sensory consequences. Later, this overt mimicry would be short-circuited and become speech perception.[9] This aspect of the theory was dropped, however, with the discovery that prelinguistic infants could already detect most of the phonetic contrasts used to separate different speech sounds.[1]

The behavioristic approach was replaced by a cognitivist one in which there was a speech module.[1] The module detected speech in terms of hidden distal objects rather than at the proximal or immediate level of their input. The evidence for this was the research finding that speech processing was special such as duplex perception.[11]

If speech is identified in terms of how it is physically made, then nonauditory information should be incorporated into speech percepts even if it is still subjectively heard as "sounds". This is, in fact, the case.

The McGurk effect shows that seeing the production of a spoken syllable that differs from one an auditory one synchronized with it affects the perception of the auditory one. In other words, if someone hears "ba" but sees a video of someone pronouncing "ga", what they hear is different—some people believe they hear "da".

People find it easier to hear speech in noise if they can see the speaker.[16]

People can hear syllables better when their production can be felt haptically.[17]

Using a speech synthesizer, speech sounds can be varied in place of articulation along a continuum from /bɑ/ to /dɑ/ to /ɡɑ/, or in voice onset time on a continuum from /dɑ/ to /tɑ/ (for example). When listeners to discriminate between two different sounds, they perceive sounds as belonging to discrete categories, even though the sounds vary continuously. In other words, 10 sounds (with the sound on one extreme being /dɑ/ and the sound on the other extreme being /tɑ/, and the ones in the middle varying on a scale) may all be acoustically different from one another, but the listener will hear all of them as either /dɑ/ or /tɑ/. Likewise, the English consonant /d/ may vary in its acoustic details across different phonetic contexts (the /d/ in /du/ does not technically sound the same as the one in /di/, for example), but all /d/'s as perceived by a listener fall within one category (voiced alveolar stop) and that is because "linguistic representations are abstract, canonical, phonetic segments or the gestures that underlie these segments."[18] This suggests that humans identify speech using categorical perception, and thus that a specialized module, such as that proposed by the motor theory of speech perception, may be on the right track.[19]

If people can hear the gestures in speech, then the imitation of speech should be very fast, as in when words are repeated that are heard in headphones as in speech shadowing.[20] People can repeat heard syllables more quickly than they would be able to produce them normally.[21]

The evidence exits that perception and production are generally coupled in the motor system. This is supported by the existence of mirror neurons that are activated both by seeing (or hearing) an action and when that action is carried out.[29] Another source of evidence is that for common coding theory between the representations used for perception and action.[30]

The motor theory of speech perception has not had much success. As three of its advocates have noted, "it
has few proponents within the field of speech perception, and many authors cite it primarily to offer critical commentary".[5]p. 361 Several critiques of it exist.[15][31]

Speech perception is affected by nonproduction sources of information, such as context. Individual words are hard to understand in isolation but easy when heard in sentence context. It therefore seems that speech perception uses multiple sources that are integrated together in an optimal way.[15]

The motor theory of speech perception would predict that speech motor abilities in infants predict their speech perception abilities, but in actuality it is the other way around.[32] It would also predict that defects in speech production would impair speech perception, but they do not.[33]

The evidence provided for the motor theory of speech perception is limited to tasks such as syllable discrimination that use speech units not full spoken words or spoken sentences. As a result, "speech perception is sometimes interpreted as referring to the perception of speech at the sublexical level. However, the ultimate goal of these studies is presumably to understand the neural processes supporting the ability to process speech sounds under ecologically valid conditions, that is, situations in which successful speech sound processing ultimately leads to contact with the mental lexicon and auditory comprehension."[34] This however creates the problem of " a tenuous connection to their implicitt target of investigation, speech recognition".[34]

The motor theory of speech perception faces the problem that the research linking speech perception to speech production is also consistent with the brain processing speech to imitate spoken words. The brain must have a means to do this if language is to exist, since a child's vocabulary expansion requires a means to learn novel spoken words, as does an adult's picking up of new names.[6] Imitation has to be initiated for all vocalizations since a word's novelty cannot be known until after it is heard, and so after when the information needed to identify its articulation gestures and motor goals has gone. As result vocal imitation needs to be initiated by default into short term memory for every heard spoken vocalizations.[6] If speech perception uses multiple sources of information, this default imitation processing would provide as a secondary use an extra source for word perception. Since imitation will be most needed for vocalizations that are not proper words, this could explain why sublexical tasks that do not use proper words so strongly link to processing of motor gestures.[6]