PRINTED FROM MIT PRESS SCHOLARSHIP ONLINE (www.mitpress.universitypressscholarship.com). (c) Copyright The MIT Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in MITSO for personal use (for details see www.mitpress.universitypressscholarship.com/page/privacy-policy). Subscriber: null; date: 21 January 2019

Dynamic Facial Speech: What, How, and Who?

Dynamic Facial Speech: What, How, and Who?

Chapter:

(p.77)
6 Dynamic Facial Speech: What, How, and Who?

Source:

Dynamic Faces

Author(s):

Harold Hill

Publisher:

The MIT Press

DOI:10.7551/mitpress/9780262014533.003.0007

Abstract and Keywords

This chapter examines the extent to which dynamic facial speech expresses what is being said, how it is being said, and by whom, by investigating a number of studies that focus on lip-reading and facial speech perception. It also discusses spatiotemporal aspects of the interplay of auditory and facial speech signals that help in answering questions about what is being said and by whom. The importance of lip-reading studies for theories of dynamic face processing is also discussed. The chapter further focuses on temporal properties, global spatiotemporal patterns of movement, encoding, and brain mechanisms that help dynamic facial speech to provide answers about what, how, and who.

A great deal of dynamic facial movement is associated with speech. Much of this movement is a direct consequence of the mechanics of speech production, reflecting the need to continuously shape the configuration of the vocal tract to produce audible speech. This chapter looks at the evidence that this movement tells us about not only what is being said, but also about how it is being said and by whom. A key theme is what information is provided by dynamic movement over and above that available from a photograph. Evidence from studies of lip-reading1 is reviewed, followed by work on how these movements are modulated by differences in the manner of speech and on what automatic and natural exaggeration of these differences tells us about encoding. Supramodal cues to identity conveyed by both the voice and the moving face are then considered. The aim is to explore the extent to which dynamic facial speech allows us to answer the questions who, what, and how, and to consider the implications of the evidence for theories of dynamic face processing.

Key Questions about the Perception of Dynamic Facial Speech

We can tell a lot about a person from a photograph of their face, even though a photograph is an artificial and inherently limited stimulus, especially with respect to dynamic properties. The first question is what, if anything, does seeing a dynamic face add to seeing a photograph? At this level “dynamic” is simply being used to indicate that the stimulus is moving or time varying, as opposed to static. In this sense, auditory speech is inherently dynamic and so audiovisual speech would seem a particularly promising area in which to look for dynamic advantages.

If movement is important, we should also ask how it is encoded. Is it, like film or video, simply encoded as an ordered and regularly spaced series of static frames? Or are the processes involved more similar to the analysis of optic flow, with movement encoded as vector fields indicating directions and magnitudes for any change from one frame to the next? Both such encodings, or a combination (Giese & Poggio,
(p.78)
2003), would be viewer-centered, a function of the observer’s viewpoint and other viewing conditions as much as of the facial speech itself.

In contrast, an object-centered level of encoding might have advantages in terms of efficiency through encoding only properties of facial speech itself. It is the relative motions of the articulators that shape speech, especially since their absolute motions are largely determined by whole head and body motions (Munhall & Vatikiotis-Bateson, 1998). Of particular relevance to this question are the rigid movements of the whole head that often accompany natural speech. These include nods, shakes, and tilts of the whole head, as well as various translations relative to the viewer. All of these rigid movements would greatly affect any viewer-centered representation of the nonrigid facial movements, including those of the lips, that are most closely associated with facial speech.

In auditory speech recognition, a failure to find auditory invariants corresponding to the hypothesized phonetic units of speech led to the motor theory of speech perception (Liberman & Mattingly, 1985). This argues that speech is encoded in terms of the articulatory “gestures” underlying speech production rather than auditory features. One appeal of this theory in the current context is that the same articulatory gestures could also be used to encode facial speech, naturally providing a supramodal representation for audiovisual integration. Gestures also provide a natural link between perception and production. At this level, the “dynamic” of dynamic facial speech includes the forces and masses that are the underlying causes of visible movement, rather than being limited to the kinematics of movement. From the point of view of the perceptual processes involved, such a theory would need to specify how we recover dynamic properties from the kinematic properties of the movements available at the retina which, like the recovery of depth from a two-dimensional image, is an apparently underspecified problem. There is evidence from the perception of biological motion that human observers are able to do this (Runeson & Frykholm, 1983).

Another critical question with regard to encoding movement concerns the temporal scale involved. Many successful methods of automatic speech recognition treat even auditory speech as piecewise static, and use short time frames (≅ 20 ms). Each frame is characterized by a set of parameters that are fixed for that frame and that are subsequently fed into a hidden Markov model. However, within the auditory domain, there is strong evidence that piecewise static spectral correlates associated with the traditional distinctive features of phonemes are not the sole basis of human perception (Rosenblum & Saldaña, 1998). Transitions between consonants and vowels vary greatly according to which particular combinations are involved. While the variation this introduces might be expected to be a problem for the identification of segments based on temporally localized information, on the contrary, experiments with cross-splicing and vowel nucleus deletion show that the pattern of transitions
(p.79)
is a useful, and sometimes sufficient, source of information. In addition, subsequent vowels can influence the perception of preceding consonants, and this influence also operates cross-modally as well as auditorily (Green & Gerdeman, 1995). The structure and redundancy of language means that recognition is determined at the level of words or phrases as well as segments. Word frequency, number of near neighbors, and number of distinctive segments can all contribute. This is also true of facial speech where, for example, polysyllabic words tend to be more distinctive and easier to speech-read than monosyllabic words (Iverson, Bernstein, & Auer, 1998). While ba, ma, and pa can be confused visibly, in the context of a multisyllabic word like balloon, the ambiguity is readily resolved lexically because malloon and palloon are not words. Thus there may be important advantages to encoding both seen and heard speech at longer time scales that cover syllables, whole words, sentences, and even entire utterances.

In summary, dynamic facial speech, like vision in general, will inevitably be initially encoded in an appearance-based, viewer-centered way. However, there is evidence that more abstract object-centered and/or muscle-based levels of encodings may also be involved.

What? Meaning from Movement

We all speech-read in the general sense that we are sensitive to the relationships between facial movement and the sound of the voice. For example, we know immediately if a film has been dubbed into another language (even when we know neither of the languages), or if the audio is out of synchrony with the video. The degree to which we find these audiovisual mismatches irritating and impossible to ignore suggests that the cross-modal processing involved is automatic and obligatory [even preverbal, 10–16-month-old infants find a talking face that is out of sync by 400 ms distressing (Dodd, 1979)].

Speech-reading in hearing individuals is not the same as the silent speech-reading forced upon the profoundly deaf (or someone trying to work out what Marco Materazzi said to Zinedine Zidane at the 2006 Soccer World Cup final). The two abilities may be related and start from the same stimulus, the dynamic face, but visual speech-reading in normal-hearing individuals is closely linked to auditory processing. Indeed, seeing a silent video of a talking face activates the auditory cortex (Calvert et al., 1999), but this activity is significantly reduced in congenitally deaf people (MacSweeney et al., 2001). For hearing individuals, being able to see the speaker’s face helps us to compensate for noisy audio, an advantage that can be equivalent to a 15-dB reduction in noise (Sumby & Pollack, 1954). Even when the audio is clear, seeing the speaker helps if the material is conceptually difficult (a passage from Immanuel Kant was tested!) or spoken in a foreign accent (Reisburg, 1987). Perhaps
(p.80)
the most dramatic illustration of speech-reading in normal-hearing individuals is the so-called McGurk effect (McGurk & MacDonal, 1976). When certain pairs of audible and visible syllables are mismatched, for example audio “ba” with visual “ga,” what we hear depends on whether our eyes are open (“da”), or closed (“ba”). This section explores the visual information that can affect what we hear.

Even a static photograph of a speaking face can provide information about what is being said. Photographs of the apical positions associated with vowels and some consonants can be matched to corresponding audio sounds (Campbell, 1986). The sound of the vowel is determined by the shape of the vocal tract, as this in turn determines the resonant frequencies or formants. Critical vocal tract parameters include overall length and the size and shape of the final aperture formed by the lips. These can be seen in photographs. The area and position of maximum constriction formed by the tongue may also be visible when the mouth is open, and is often correlated with clearly visible lip shape (Summerfield, 1987). A McGurk-like mismatching of audio sounds and visual vowels results in the perception of a vowel with intermediate vocal tract parameters, suggesting these parameters as another potential supramodal representation derivable from both vision and auditory speech (Summerfield & McGrath, 1984). The effect of visual information on even the perception of vowels is particularly compelling, given that vowels tend to be clearly audible as they are voiced and relatively temporally stable.

Consonants, in contrast, involve a transitory stopping or constriction of the airflow and, acoustically, are vulnerable to reverberation and noise. The location of the restriction, the so-called “place of articulation,” is often visible and can also be captured in a photograph, at least when it occurs toward the front of the vocal tract. Clearly visible examples include bilabials (p, b, m), labiodentals (f, v), and linguodentals (th). The “ba” and “ga” often used to illustrate the McGurk effect are clearly distinguishable in photographs. More posterior alveolar or palatal constrictions can be visible if the mouth is relatively open. The place of articulation is often difficult to determine from audio, and vision may be particularly important in providing complementary information for consonants.

The visual confusability of consonants is approximately inversely related to their auditory confusability (Summerfield, 1987). Campbell proposes a strong separation between complementary and correlated visual information, associating the former with static and the later with dynamic properties of audiovisual stimuli, with processing carried out by different routes (Campbell, 2008). Static photographs have been reported to generate McGurk effects equivalent to those of dynamic stimuli for consonant vowel (CV) syllables, and were even reported to support better silent speechreading of the stimuli (Irwin, Whalen, & Fowler, 2006; but see Rosenblum, Johnson, & Saldaña, 1996). Thus for speech-reading, as for so many face-processing tasks,
(p.81)
static photographs appear to capture much of the critical visual information, such as vocal tract parameters, including the place of articulation of consonants.

Although demonstrably useful, the high-frequency spatial information provided by photographs does not seem to be necessary for speech-reading. Studies of the perceiver’s eye movements show speech-reading benefits, even when people are looking directly at the mouth less than half of the time (Vatikiotis-Bateson, Eigsti, Yano, & Munhall, 1998). This suggests that the high spatial sensitivity of the fovea is not necessary for speech-reading. This is confirmed by studies using spatially filtered stimuli (Munhall, Kroos, Jozan, & Vatikiotis-Bateson, 2004). Instead, the temporally sensitive but spatially coarse resolution periphery appears sufficient.

Further evidence for the importance of motion over spatial form comes from single case studies. HJA, a prosopagnosic patient who is unable to process static images of faces, shows typical speech-reading advantages from video (Campbell, Zihl, Massaro, Munhall, & Cohen, 1997). Conversely, the akinetopsic patient LM, who has severe problems with motion processing, cannot speech-read natural speech but can process static faces, including being able to identify speech patterns from photographs (Campbell et al., 1997). There is also suggestive evidence that speech-reading ability is correlated with motion sensitivity (Mohammed et al., 2005). Studies involving the manipulation of video frame rates show decreases in ability at or below 12.5 fps (Vitkovich & Barber, 1994). Adding dynamic noise also reduces our ability to speech read (Campbell, Harvey, Troscianko, Massaro, & Cohen, 1996). Finally, the most direct evidence for the usefulness and indeed sufficiency of motion information comes from studies using point-light stimuli (Rosenblum et al., 1996). These displays are designed to emphasize dynamic information and limit form-based cues. They are sufficient to support McGurk effects and speech-in-noise advantages when presented as dynamic stimuli (Rosenblum & Saldaña, 1998).

What is it about speech that is captured and conveyed by visual movement? One simple temporal cue would be the time of onset of various events. The onset of speech is characterized by both mouth and head movements (Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004), and this can highlight the onset of the corresponding auditory signal that might otherwise be lost in noise. Another temporal cue that is available multimodally is duration. In many languages, including Japanese, Finnish, and Maori, though not typically in English, differences in vowel duration serve to distinguish between phonemes and can change meaning. In an experiment recording facial motion during the production of long or short minimal pairs (pairs of words differing in only one phoneme) in Japanese, differences in facial movement that would provide visible cues to duration are clearly visible. For example, figure 6.1 shows trajectories associated with koshou (breakdown) and koushou (negotiations). As can be seen, durations between corresponding features of the
(p.82)

Figure 6.1 An example of (a) unfocused (left) and focused (right) audio waveforms and (b) spectrograms and motion trajectories for a number of components of dynamic facial speech. The Japanese phrase spoken is the short member of a minimal pair, koshou, defined in terms of a long or short linguistic distinction and contained in a carrier phrase. For details, please see the text.

(p.83)
plotted trajectories of jaw and lip movement contain visual information about duration. The segments indicated were defined on the basis of the audio speech, but the corresponding changes in the trajectories would be visible.

The interpretation of duration is highly dependent on overall speech rate, another temporal property that is provided by facial as well as audible speech. When auditory and visible speech rates are deliberately mismatched, the rate seen can influence heard segmental distinctions associated with voice onset time and manipulated on a continuum between /bi/ and /pi/ (Green & Miller, 1985). As well as being unable to speech-read, LM cannot report differences in the rate of observed speech (Campbell et al., 1997).

Velocities of facial movements are also available from facial speech; they reflect aerodynamics and are diagnostic of phonetic differences, as in, for example, the difference between p and b (Munhall & Vatikiotis-Bateson, 1998). Rapid movement also is associated with, and helps to signal, changes from constricted consonants to open vowels. The overall amount of lip movement can also distinguish between different dipthongs (Jackson, 1988).

Dynamic facial speech potentially provides information about transitions. The most dramatic evidence for the importance of formant transitions, as opposed to more complex, temporally localized spectral features, is sine wave speech (Remez, Rubin, Pisoni, & Carrell, 1981). This is “speech” synthesized by using three sine waves that correspond to the amplitudes and frequencies of the first three fundamentals of the original audio. The resultant temporally distributed changes in formant frequencies can be sufficient for understanding spoken speech despite the almost complete absence of traditional transitory acoustic features. These auditory parameters have been shown to be correlated with three-dimensional face movement (Yehia et al., 1998).

Perceptual experiments also show advantages, in terms of syllables correctly reported, for presenting sine wave speech in combination with corresponding video overpresenting of audio or video alone (Remez, Fellowes, Pisoni, Goh, & Rubin, 1998). For single tones, the combination of video and F2 was found to be particularly effective. F2 is the formant most highly correlated with facial movement (Grant & Seitz, 2000) owing to its association with lip spreading or rounding (Jackson, 1988). This highlights the importance of information common to both audio and video in audio visual speech processing, in contrast to the traditional emphasis on complementary cues. Both F2 and facial speech are informative about place of articulation. If complementary audio and visual information was critical, F0 might have been expected to receive the most benefit from the addition of visual information. This is because F0 provides voicing which is not readily visible in dynamic facial speech, being determined primarily by the vibration of the vocal cords rather than the shape of the vocal tract, and not itself being directly available from dynamic
(p.84)
facial speech, might have been expected to receive the most benefit from the addition of visual information. Taken together, these findings suggest audiovisual integration at the level of shared spectrotemporal patterns suited to encoding patterns of transitions.

Studies of the perception of speech in noise show that even rigid head movements can increase the number of syllables recovered under noisy conditions (Munhall et al., 2004). These rigid movements do not provide information about the shape of the vocal tract aperture or the place of articulation. Indeed, for a viewer-centered encoding and many automatic systems, head movements would be expected to interfere with the encoding of the critical nonrigid facial movements. However, head movements are correlated with the fundamental frequency, F0, and absolute intensity. Both of these are important in conveying prosody, and this may be the mechanism by which they facilitate recognition of syllables. Prosody could help at the level of initial lexical access to words, and at the level of sentences by providing cues to syntactic structure (Cutler, Dahan, & van Donselaar, 1997). Prosody is the central theme of the next section, the how of facial speech.

In summary, speech can to an extent be treated as piecewise static in both auditory and visual domains. Within this framework, photographs can clearly capture critical visual information. However, this information will always be limited to the level of individual segments, and the effects of coarticulation mean that even segmental cues will be temporally spread out. Such temporally distributed information can never be captured by a single photograph, but is associated with patterns of movement. Thus the durations, rates, velocities, amounts, and spatiotemporal patterns of movements that make up dynamic facial speech all help us to know what is being said.

How? It’s Not What You Say, but How You Say It

Prosody allows us to convey many different messages even when speaking the same words and is critical to the message communicated. The contrast between good and bad acting amply illustrates how important these differences can be. Acoustically, prosody is conveyed by the differences in pitch, intensity, and duration that determine overall patterns of melody and rhythm. It is suprasegmental; that is, its effects are spread out over the utterance and have meaning only in relation to each other. At the level of individual words, characteristic prosodic patterns, particularly syllable stress, are often associated with accent but can also change the meaning or part of speech, as with convict the noun as opposed to convict the verb. These differences are often visible. Prosody also conveys syntax, emphasis, and emotion, as well as helping to regulate turn-taking and reflecting social relationships, all of which are fundamental to communication. Many of these functions are associated with acoustic patterning of the pitch and intensity of the voice. Although the vibrations of the
(p.85)
vocal cords that determine F0 are not visible on a face, they are correlated with head movements (Munhall et al., 2004). Intensity is strongly associated with face motion (Yehia et al., 1998) and to a lesser extent, head movement. Visible movements and expressions that are not directly related to audio, for example, the raising of an eyebrow or widening of the eyes, also provide an additional channel by which facial speech can modulate the spoken message.

Work on visual prosody has looked at contrastive focus, the use of prosodic cues to emphasize one part of a sentence over another. There are known acoustic indicators of focus, but perceptual experiments show that it is also possible to speech-read emphasis from video alone (Dohen, Loevenbruck, Cathiard, & Schwartz, 2004). Production studies have involved recording different people’s movements while they produce examples of contrastive focus. These show a number of visual correlates, including increases in lip and jaw opening, cheek movements, and duration for the focal syllable, coupled with corresponding decreases for the remaining parts of the sentence (Dohen, Loevenbruck, & Hill, 2005). Individual differences are also found, for example, in head movements and anticipatory strategies, which may have an important role in answering the question who?

Emotion produces large effects in people’s speech. We looked at how this was conveyed by face and head movements using silent animations (Hill, Troje, & Johnston, 2005). In particular we were interested in whether differences in either the timing or the spatial extent of movement would be more important. We found that exaggerating differences in the spatial extent of movement relative to the grand average across emotion reliably increased the perceived intensity of the emotion. This is again consistent with the importance of apical positions in the perception of facial motion, although moving greater or lesser distances in the same time will also change velocities and accelerations. In natural speech, peak velocity and peak amplitude tend to be linearly related (Vatikiotis-Bateson & Kelso, 1993). In our studies, directly changing timing while leaving spatial trajectories unchanged was less effective in exaggerating emotion. This may have been because the changes in timing used were restricted to changes in the durations of segments relative to the average.

Still convinced that timing is important, we decided to use contrastive focus to look at how people naturally exaggerate differences in duration for emphasis. As noted earlier, vowel length is important for distinguishing between phonemes in Japanese, among other languages. We made use of a set of “minimal pairs,” that is, pairs of words that differ in only one phoneme, in this case based on a contrast in duration. The speakers whose movements were being recorded read a simple carrier sentence, “Kare kara X to kikimashita” (“I heard X from him”) where X was one of the members of a minimal pair. An experimenter then responded “Kare kara Y to kikimashita?” (“You heard Y from him”?), where Y was the other member of the pair. The speaker being recorded then responded, “Ie, kare kara X to kikimashita!”
(p.86)
(“No, I heard X from him!”). We were primarily interested in how the second instance of X would differ from the first, given that X differs from Y in terms of the duration of one of its segments.

Using a two-AFC task, we found that observers could distinguish which version of X had been emphasized with 95% accuracy from audio alone and with 70% accuracy from video alone (where chance was 50%). The effect on facial movement for one speaker is shown in figure 6.1. In this case the first syllable of the keyword, “ko,” is the short version of the minimal pair, which consisted of koshou (breakdown) and koushou (negotiations). The “ko” corresponds with the segment between the first and second vertical lines, the positions of which were defined from the audio. When focused, as in the right half of the figure, the absolute duration of this part does not change, but its proportion as a part of the whole word (contained in the section between the first and third vertical lines) is clearly reduced; i.e., relative duration is exaggerated. This was borne out by an analysis of all the minimal pairs for the clearest speaker (see figure 6.2a). It is clear that focus increased all durations, including the carrier phrase, which is consistent with our generally speaking more slowly when emphasizing something. Indeed, focus on average increased the durations of the short keywords in each minimal pair, although not as much as it increased the duration of the long keywords. Overall the relative duration of the critical syllable was exaggerated. These effects of emphasis on duration are consistent with similar effects of speaking rate, and with the importance of relative rather than absolute duration as the cue for phonemic vowel length (Hirata, 2004). The relevance to dynamic facial speech is that relative duration can be recovered from visual as well as auditory cues.

There were also effects on the range of vertical jaw displacements (see figure 6.2b). Jaw movement corresponds to the first principal component of variation for facial movement in Japanese (Kuratate, Yehia, & Vatikiotis-Bateson, 1998). Again, relative encoding is critical, and focus reduces the amplitude of the first part of the carrier phrase, an example of the hypoarticulation often used for contrastive focus (Dohen et al., 2004, 2005). There was also an increase in the amplitude of movement, hyperarticulation, for the long version of the focused keyword. Short and long syllables did not differ in amplitude without focus, but emphasizing duration naturally leads to an increase in the amplitude of the movement. With focus, there were also changes in the amplitude of other movements, including additional discontinuities in trajectories, facial postures held even after audio production had ceased, and accompanying head nods, all of which can contribute to conveying visual prosody.

Thus there are many ways in which dynamic facial and head movements can signal changes in how the same words are spoken, including differences in the extents and durations of the movements associated with production of sounds. Visible differences in the spatial extent of movements play a role in the perception of both the how
(p.87)

Figure 6.2 Visible correlates of phonetic distinctions. (a) This shows the mean durations for three segments of twentytwo sentences of the kind shown in figure 6.1. Average values are shown for short and long versions of each minimal pair when spoken either with or without contrastive focus intended to emphasize the difference in duration between the pairs. Note how the focus increases overall duration and also the relative difference in duration between the long and short keywords. (b) Maximum range of jaw movements corresponding to the same segments and sentences as in (a). Note in particular the increased jaw movement associated with emphasizing the long keyword in each pair and hypoarticulation of the preceding part of the carrier phrase. Error bars show standard errors of means.

(p.88)
and the what of audiovisual speech. However, dynamic differences in timing, particularly differences in the duration and rate of speech, can also be perceived from visual motion as well as from audio sounds. Thus dynamic facial speech can convey the spatiotemporal patterning of speech, the temporal patterning that is primarily associated with the opening and closing of the vocal tract. Even rigid head movements not directly associated with the shaping of the vocal tract reflect these patterns of intensity as well as pitch. It is this patterning that carries prosody, the visible as well as auditory how of speech, which in turn affects both what and who.

Who? Dynamic Facial Speech and Identity

Other chapters in this volume present evidence that movement of a face can be useful for recognizing people. Much of this evidence is drawn from examples of facial speech. In this section the focus is on cues to identity that are shared by face and voice.

Facial speech is different for different individuals. Some people hardly move their lips and rarely show their teeth, while others (especially U.S. television news presenters) speak with extreme ranges of motions. These differences affect how well someone can be speech-read and relate to the number of phonemes that can be distinguished visually (visemes) for that person (Jackson, 1988). This variety is also mirrored by voices, which can range from dull monotones to varying widely in speed and pitch over utterances. We wanted to know if these differences could provide a cross-modal cue to identity, and whether people could predict which face went with which voice and vice versa (Kamachi, Hill, Lander, & Vatikiotis-Bateson, 2003; Lander, Hill, Kamachi, & Vatikiotis-Bateson, 2007).

The answer was that people perform at chance in matching voices to static photographs, suggesting that fixed physical characteristics, such as the length of the vocal tract, that relate to properties of a voice are not sufficient for the task. The performance of participants was above chance when the face was seen moving naturally. Performance dropped off if the movement was played backward (all the voices were played forward and matching was always sequential in time). Playing movement backward is a strong control for demonstrating an effect of movement over and above the associated increase in the amount of static information available with videos compared with photographs. The result shows that neither static apical positions, nor direction-independent temporal properties such as speech rate, or speed, or amount of movement, are sufficient for this task. Instead, direction-dependent, dynamic patterns of spatiotemporal movement support matching.

Performance was generalized across conditions where the faces and the voice spoke different sentences, showing that identical segmental information was not necessary. The audio could even be presented as sine wave speech, which again is consistent
(p.89)
with the importance of spatiotemporal patterns over segmental cues. Previous work has also shown that the rigid head movements that convey the spatiotemporal patterning of prosody (Munhall et al., 2004) are more useful than segment-related face movements in conveying identity (Hill & Johnston, 2001). Performance at matching a face to a voice was disrupted less by changes in what was being said than changes in how the person was speaking the words. In this case, we used the same sentences spoken as statements or questions and with casual, normal, or clear styles of speech. Artificial uniform changes in overall speaking rate did not affect identity matching, again ruling out many simple temporal cues (Lander et al., 2007).

In summary, dynamic cues associated with different manners of speaking convey supramodal clues to identity. These clues appear relatively independent of what is being said. It remains to consider how theories of dynamic facial speech encoding can provide a unified account for performance on these different tasks.

Conclusion: Encoding Dynamic Facial Speech

We have seen that dynamic facial speech provides a wealth of information about what is being said, how it is being said, and who is saying it. In this section we will consider the information, encoding, and brain mechanisms that allow dynamic facial speech to provide answers to these questions.

Even a static photograph can capture a considerable proportion of facial speech. Photographs of apical position in particular can be matched to sounds and convey cues to emotion that can be made stronger by exaggerating relative spatial positions. Functional magnetic resonance imaging studies show that static images activate brain areas associated with biological motion, the superior temporal sulcus, the premotor cortex, and Broca’s area, suggesting that they activate circuits involved in the perception and production of speech (Calvert & Campbell, 2003). This is consistent with static facial configurations playing a role in the representation of dynamic facial speech, much as animators use static keyframes of particular mouth shapes when generating lip-synchronized sequences (Thomas & Johnston, 1981). One issue is how these keyframes could be extracted from a naturally moving sequence, given they will not occur at regular times. Optic flow fields may play a crucial role in this process, with changes in the direction of patterns signaling extreme positions. A test of whether these key frames do have a special role would be to compare facial speech perception from sequences composed of selected apical positions with sequences containing an equal number of different frames sampled at random or regular intervals.

An issue for any encoding based on static images or optic flow is the role of head movements. Rigid movements of the whole head might be expected to disrupt the recovery of flow patterns associated with the relative facial movements of the oral cavity most closely linked to speech sounds, and have to be factored out. Perhaps
(p.90)
surprisingly, for human perception such head movements, rather than being a problem, actually appear to facilitate the perception of facial speech (Munhall et al., 2004). In addition, the perception of facial speech is relatively invariant with viewpoint (Jordan & Thomas, 2001), and cues to identity from nonrigid facial movement generalize well between views (Watson, Johnston, Hill, & Troje, 2005). This evidence suggests view-independent primitives or face-centered representations. Temporally based cues provide a number of possible candidates for view-independent primitives and are considered in more detail later. Spatiotemporal patterns of motion have a view-dependent spatial component that would not be invariant but could conceivably be encoded in a head-centered coordinate system. Rigid movements are not simply noise and may be encoded independently because they are also a valuable source of information in their own right. There is suggestive neuropsychological evidence that this may be the case (Steede, Tree, & Hole, 2007).

As noted, temporal cues are inherently view independent. They are also available multimodally, thus providing a potential medium for cross-modal integration. The temporal relationship between auditory and facial speech is not exact. In terms of production, preshaping means that movement can anticipate sound and, at the other extreme, lips continue to move together after sound has ceased (Munhall & Vatikiotis-Bateson, 1998). Perceptually, audiovisual effects persist with auditory lags as much as 250 ms (Campbell, 2008). Possible temporal cues include event onsets, durations, the timing and speed of transitions, and speaking rates. Evidence has been presented that all of these are involved in the perception of dynamic facial speech. Global temporal patterns of timing and rhythm associated with prosody and carried by head and face movements also play an important part in the perception of dynamic facial speech.

Movement-based point-light sequences, lacking static high spatial-frequency cues but including spatial as well as temporal information, convey facial speech (Rosenblum & Saldaña, 1998), emotion (Bassili, 1978; Pollick, Hill, Calder, & Patterson, 2003; Rosenblum, 2007), and identity (Rosenblum, 2007). Eye movement studies suggest that this low-spatial, high-temporal, frequency information plays an important role in everyday facial speech perception when we tend to be looking at the eyes rather than the mouth (Vatikiotis-Bateson et al., 1998). Patterns of cognitive deficits also suggest that preservation of motion processing is more important to facial speech perception than preservation of static face perception (Campbell et al., 1997). Silent moving speech, unlike still images, activates the auditory cortex (Calvert & Campbell, 2003) and only moving silent speech captures dynamic cues to identity shared with voice (Kamachi et al., 2003). How is the kinematic information encoded? As noted, view-independent patterns of performance suggest it may be represented in a head-centered coordinate system.

(p.91)
Articulatory gestures are an appealing candidate for the representation of movement (Liberman & Mattingly, 1985; Summerfield, 1987). An example of a proposed articulatory gesture might be a bilabial lip closure, which would typically involve motion of both lips and the jaw as well as more remote regions such as the cheeks. Such gestures are the underlying cause of the correlation between sound and movements. They are face-centered, spread out in time, and directly related to linguistic distinctions but allow for the effects of coarticulation and show individual and prosodic variability. They would also naturally complete the perception production loop, with mirror neurons providing a plausible physiological mechanism for such a model (Skipper, Nusbaum, & Small, 2005, 2006).

Gestures are determined by what is being said, but how do they relate to who and how? One might expect that differences associated with speaker and manner would be noise for recovering what is being said and vice versa. Instead, speaker-specific characteristics appear to be retained and beneficially affect the recovery of linguistic information (Yakel, Rosenblum, & Fortier, 2000). We appear to “tune in” to facial as well as auditory speech. Evidence from face-voice matching suggests that cues to identity are also closely tied to manner, i.e., how something is said (Lander et al., 2007). The words to be said (what) determine the articulator movements required, which in turn determine both facial movement and the sound of the speech. Recordings of facial movement show broadly equivalent patterns of movements across speaker (who) and manner (how) for equivalent utterances. Individual differences and changes in manner affect the timing and patterns of displacement associated with the underlying trajectory. It is these differences that allow us to answer who and how, but how are the differences encoded? When analyzing motion-capture data, the whole trajectory is available and equivalent features can be identified and temporally aligned for comparison. This is clearly not the case for interpreting speech viewed in real time. One possibility is to find underlying dynamic parameters, including masses, stiffnesses, and motion energy, that are characteristic of different individuals or manners and that affect trajectories globally. This would truly require the perception of dynamic facial speech.

In conclusion, while both audio and visual speech can be treated as piecewise static to an extent, both strictly temporal properties and global spatiotemporal patterns of movement are important in helping us to answer the questions of who, what, and how from dynamic facial speech.

Acknowledgments

The original data referred to in this chapter are from experiments carried out at Advanced Telecommunication Research in Japan and done with the support of
(p.92)
National Institute of Information and Communications Technology. None of the work would have been possible without the help of many people, including Kato Hiroaki for the sets of minimal pairs and linguistic advice, Marion Dohen for advice on visible contrastive focus, Julien Deseigne for doing much of the actual work, Miyuki Kamachi for her face and voice, and Takahashi Kuratate and Erik Vatikiotis-Bateson for inspiration and pioneering the motion-capture setup. Ruth Campbell and Alan Johnston kindly commented on an earlier draft, but all mistakes are entirely my own.

Notes

(1.)
“Speech-reading” is a more accurate term for the processes involved in recovering information about speech from the visible appearance of the face because this involves not only the lips but also the tongue, jaw, teeth, cheeks, chin, eyebrows, eyes, forehead, and the head and quite possibly the body as a whole.

Notes:

(1.)
“Speech-reading” is a more accurate term for the processes involved in recovering information about speech from the visible appearance of the face because this involves not only the lips but also the tongue, jaw, teeth, cheeks, chin, eyebrows, eyes, forehead, and the head and quite possibly the body as a whole.

PRINTED FROM MIT PRESS SCHOLARSHIP ONLINE (www.mitpress.universitypressscholarship.com). (c) Copyright The MIT Press, 2018. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in MITSO for personal use (for details see www.mitpress.universitypressscholarship.com/page/privacy-policy). Subscriber: null; date: 21 January 2019