By David Crystal

New from Cambridge University Press!

By Peter Mark Roget

This book "supplies a vocabulary of English words and idiomatic phrases 'arranged … according to the ideas which they express'. The thesaurus, continually expanded and updated, has always remained in print, but this reissued first edition shows the impressive breadth of Roget's own knowledge and interests."

This book is concerned with how multiple sources of information areprocessed in speech perception and, more generally, in patternrecognition. It is based upon an important research programmeconducted by Massaro and his colleagues over the last two decades. Thebook focuses on the perception of so-called bimodal speech, addressinga wide range of issues about the way in which visual information (asprovided by the speaker's face) and auditory information are combinedwith each other by the perceptual system. The scope of the book ismuch larger, however, as Massaro's purpose here is to describe anddefend a new psychological law relevant to a wide variety ofdomains. In contrast to already well-established laws of the same kind(e.g. Weber's law of perception), which are all unidimensional, thenew principle is multidimensional, in that it describes how severalfactors impact behaviour. This principle is embodied in acomputational model of pattern recognition, the Fuzzy Logical Model ofPerception (FLMP), whose latest version is presented and discussed indetail. The FLMP is systematically contrasted with alternativecomputational models, using a broad perceptual database as benchmarkthroughout the book. In a separate part, the book also deals withmethods for synthesizing talking faces in experiments on bimodalspeech perception, and introduces Baldi, the talking face developed byMassaro and his coworkers. The book is accompanied by a CD-ROM whichcontains a series of demonstrations relating to many of the topicsdealt with.

The book is divided into four main sections. Section 1, &quot;PerceivingTalking Faces&quot;, focuses on the perception of speech by ear andeye. Massaro reviews the most significant empirical findings in thatdomain, discusses the main methodological issues, and presents ageneral classification of the existing computational models of bimodalspeech perception. Central to this section is the idea that speechperception obeys a general behavioural principle of integrationbetween different sources of information. Section 2, &quot;Broadening theDomain&quot;, aims at assessing how well this principle holds up acrossbroad individual and situational variability. The author demonstratesthat inter-individual variations in how bimodal speech is perceived,depending on the listener's age or native language for instance, canbe accounted for within the FLMP framework. Using examples taken fromdifferent perceptual and cognitive situations, Massaro also defendsthe idea that the FLMP adequately describes information processingirrespective of these situational differences. Section 3, &quot;Broadeningthe Framework&quot;, opens with a presentation of an extended and moreexplicit version of the FLMP, designed in particular to account forthe dynamics of speech processing. The section also includes adetailed analysis of the methodological issues involved in assessingquantitative predictions in psychology, along with a discussion of thecritiques expressed by other investigators about the FLMP over theyears. Finally, Section 4, &quot;Creating Talking Faces&quot;, is specificallydedicated to the synthesis of visual speech.

1.2 The new behavioural principle

Although many readers may already be familiar with Massaro's FuzzyLogical Model of Perception, I shall here assume the contrary, andproceed to present a brief outline of the model.

A central assumption of the FLMP is that pattern recognition involvesa common set of processes regardless of the specific nature of thepatterns. Speech is not seen as being associated with a dedicatedprocessing module, as in the motor theory of speech perception(Liberman, 1996) for instance. On the contrary, the sensoryinformation is assumed to be processed in the same way whether ourbrain is busy recognizing speech sounds, letters, or manual gestures,to take but a few examples. In any of these cases, the FLMP postulatesthat mapping a stimulus into a unique perceptual category entailsgoing through three main stages of processing, the feature evaluationstage, the feature integration stage, and the decision stage.

The evaluation stage consists of converting the available sources ofinformation into a set of properties referred to as features. Eachfeature is given a continuous (fuzzy truth) value, and represents thedegree to which the stimulus corresponds to each of a set of internalprototypical patterns, along a particular perceptual dimension. Thus,one important visual feature in the perception of CV syllables is thedegree of opening of the lips. The model therefore assumes that theinternal prototypes available to the perceptual system will specifythat the lips are open at the onset of the syllable for /da/, closedfor /ba/, etc. In a second stage, the features are integrated witheach other, so as to determine the overall degree of match of thesensory input with each of the prototypes (e.g. each of the syllablesknown to the receiver). In the third and final stage, a decision istaken, on the basis of the relative goodness of match of the inputwith each prototype.

The FLMP makes a number of specific assumptions at each stage in thisprocess. First, it hypothesizes that all of the available sources ofinformation are simultaneously brought into play in patternrecognition. Thus, visible speech and auditory speech are both assumedto have an influence on how bimodal speech is perceived. Second,different sources of information are assumed to be evaluatedindependently of each other. This means for example that visiblespeech does not have any effect on how auditory speech is convertedinto a set of features, the two sources of information being combinedat a later stage of processing only. The model also makes specificassumptions about how sources of information are integrated with eachother (multiplicative rule), and about how decisions are taken(relative goodness rule).

A major prediction of the model is that &quot;the influence of one sourceof information is greatest when the other source is neutral orambiguous&quot; (19). This prediction is best illustrated by an experimentwhose results served as a database for testing models of patternrecognition on several occasions in the book (chapters 2 and 11). Inthis experiment, synthetic auditory stimuli ranging on a continuumbetween /ba/ and /da/ were crossed with visual stimuli also varyingbetween /ba/ and /da/. The bimodal stimuli were presented to subjectsin a forced-choice identification task, along with each of theunimodal stimuli. (This expanded factorial design is shown by Massaroto be the most appropriate experimental design for determining how twosources of information are combined with each other in patternrecognition.) For the bimodal stimuli, the main results are typicallydepicted as a two-factor plot, with the proportion of /da/ responseson the ordinate, the levels of the auditory source of information onthe abscissa, and a different curve for each of the levels of thevisual source of information. When represented in that way, theresults clearly show a statistical interaction between the two sourcesof information. Specifically, the influence of one source ofinformation proves to be larger in the middle, ambiguous range of theother source. This interaction graphically takes the shape of anAmerican football, which is for this reason presented throughout thebook as the hallmark of the the Fuzzy Logical Model of Perception.

In summary, Massaro proposes a universal principle of perceptualcognitive performance to explain pattern recognition. According tothis principle, &quot;people are influenced by multiple sources ofinformation in a diverse set of situations. In many cases, thesesources of information are ambiguous and any particular source alonedoes not usually specify the appropriate interpretation. The perceiverappears to evaluate the multiple sources of information in parallelfor the degree to which each supports various interpretations,integrate them together to derive the overall support for eachinterpretation, assess the support of each alternative based on all ofthe alternatives, and select the most appropriate response.&quot;(p. 291).

2 CRITICAL EVALUATION

2.1 General evaluation

This book is clearly a major contribution to the study of speechperception and, more generally, to cognitive psychology. It isadmirably clear and is written in quite an elegant manner.

I do not doubt that the book will be read with great interest byresearch scientists from many different fields. This work is theresult of an ambitious intellectual endeavour aimed at introducing anew behavioural law, which is placed by Massaro on an equal footingwith Weber's law of perception, or the power law of learning. Speechscientists are presented with an extensive series of experiments onthe perception of bimodal speech. Whatever stance they take in thatdomain, they should find quite challenging Massaro's view that speechperception constitutes but one aspect of a much more general form ofcognitive processing, namely pattern recognition. Computer scientistsworking in the field of speech technology should be particularlyinterested in the book's final section about the synthesis of visualspeech.

Regardless of their background, readers should also find the bookworth using as a tutorial on the experimental methods available forinvestigating speech perception. A great variety of experimentalparadigms and tasks are discussed at length by Massaro, who alsoextensively discusses the methods for assessing computational modelsof pattern recognition and, in particular, for fitting these models toobserved results. In that respect, using the results of the experimentdescribed above as a reference database was quite a good initiative inmy view, as this allows the reader more easily to understand Massaro'spoint as new issues are raised, without having again to go through thedetails of the experimental design each time.

The book should also prove an invaluable resource for teaching. Carewas taken to select prototypical results, as well as to set this workin its historical context. A number of rather fascinating anecdotesand historical references are given, going from McGurk's personalaccount of the discovery of the McGurk effect, to an audio-visualrendition of the introduction to George Miller's seminal article onthe ubiquitousness of the number 7 plus or minus 2, with Miller's facetexture-mapped onto Baldi's wire-frame head. The CD-ROM thataccompanies the book enables the reader directly to experience thepsychological illusions associated with the perception of bimodalspeech, and constitutes as such a most useful research and teachingtool.

On the negative side, Massaro's use of the /ba/-/da/ experiment as aleading strand throughout obviously results in the book being focusedon the perception of non-sense syllables. Although the interaction ofvisible speech and audible speech in word recognition is mentioned ona number of occasions (e.g. pp 21-23 and pp. 181-182), the bookcontains few suggestions as to how we perceive isolated words, letalone connected speech. I also was surprised by the fact that littleplace was devoted to presenting other current theories and models ofspeech perception. Although models such as TRACE are mentioned onseveral occasions in the book, I think it is fair to say that the FLMPis still given the lion's share.

The book also has some minor defects such as the absence of a list offigures, and the fact that some of the CD-ROM bands (1.4, 1.5 and 1.6)are referred to incorrectly in the text. The list of the CD-ROMselections should have pointed to the pages where each band isreferred to. In another domain, it would have been quite interestingto have the perceptual database used in the book made available on theCD-ROM. Although this would have probably required a substantialamount of additional work, I should also have found it useful to beprovided with an interactive version of the main computational modelsdiscussed in the book (FLMP, the RACE model, the Single Channel model,etc.). The FLMP model can be downloaded from Massaro's laboratory Website at Santa Cruz (http://mambo.ucsc.edu), but it is currentlydistributed in FORTRAN code which has to be modified and recompiledfor each new set of data, an operation which is probably out of reachof many students in psychology or linguistics.

2.2 Specific comments

I am not familiar with all of the areas dealt with in this book, andwill not hide the fact that this review is biased towards my owninterests, namely the production and perception of auditoryspeech. The following comments more specifically concentrate on twoissues relating to this area of research, the role of features inspeech processing and the time course of speech processing.

2.2.1 Features

Most useful are the extensive comments made by Massaro about thestatus of features in his model (see in particular Chapter 2 andChapter 10). I long have found it difficult to determine how closethese features were to classical phonetic features. The book makes itclear to me that there is no direct relation between the former andthe latter.

As indicated above, the FLMP postulates that there are three mainstages of processing in pattern recognition: the feature evaluationstage, the feature integration stage, and the decision stage. Specificassumptions are made in the model about how features are integratedwith each other, and how a decision is taken depending on the outcomeof this integration. From a set of feature values, therefore, themodel will predict the probability of occurrence of each possibleresponse (e.g. &quot;ba&quot; and &quot;da&quot;).

However, attention should be paid to the fact that these featurevalues are in no way derived from the stimulus. They are actuallydetermined in an posteriori manner, from the subjects' observedresponses, using an algorithm (STEPIT) which allows the deviationbetween these responses and the predicted ones to be minimal. Featuresare seen in the model as *free parameters*, whose values are set onthe basis of the actual performance of the subject in the patternrecognition task, so as to make the model perform at its best, i.e. tomaximize its goodness of fit. According to Massaro, &quot;[the model is]*predicting* the exact *form* of the results, but *postdicting* theactual quantitative *values* that make up the overall predictions&quot;(p. 294, his emphasis).

In other words, the stimulus is on no occasion explicitly mapped ontothe internal features of the FLMP model. In that respect, features asdefined in the FLMP look markedly different from phoneticfeatures. Let us take for example the opposition between /ba/ and/da/, on which much emphasis is put in the book. Acoustically, /b/ and/d/ are said to differ from each other according to the featuregrave-acute, /b/ being classified as grave and /d/ as acute. As is thecase with FLMP features, grave and acute can be viewed as targetvalues referring to prototypical stops. However, the grave-acutefeature is explicitly defined in acoustical terms (e.g. slope of theshort-term spectrum at the release of the stop, see Stevens &amp;Blumstein, 1978). On the contrary, the exact nature of the FLMPfeatures remains undetermined, their values being subject to one mainconstraint which is to make the model account for the subjects'responses as accurately as possible. Thus, the acoustic structure ofthe stimulus is not directly taken into consideration in theestimation of the feature values.

In the experiments using audible speech, FLMP features do lendthemselves to an acoustic interpretation. In the /ba/-/da/ experimentfor example, the prototypes for /ba/ and /da/ are assumed to includeone auditory feature, namely the variations in frequency of the second(F2) and third (F3) formants at the onset of the vowel (slightlyfalling F2-F3 for /da/, rising F2-F3 for /ba/). However, thisinterpretation stems from the fact that F2 and F3 onset frequencieswere precisely the acoustic parameters manipulated by theexperimenters to synthesize the auditory continuum between /ba/ and/da/. In other words, the acoustic significance of the FLMP featuresis derived from the way in which the experiment has been designed. Themodel does rely on a particular system of acoustic features (see forexample Stevens &amp; Blumstein, 1978, for an alternative system), butthis system is embodied in the experimental design, and is as suchexternal to the model itself.

In practice, therefore, the issue of how speech sounds are mapped ontofeatures is not addressed in the model. Why this is so is not clear tome. On several occasions, Massaro suggests that determining in advancehow a given individual will convert a given stimulus into a set offeature values is simply out of our reach. This stimulus-to-featuremapping shows a variability which is said to be analogous to thevariability of the weather: there are just too many previouscontributions and influences to allow quantitative prediction (135).A fundamental distinction is in fact established in the FLMP betweenthe intake of *information*, i.e. the stimulus-to-feature mapping, and*information processing*, i.e. how features are combined with eachother and mapped into a response (cf. p. 135). While the FLMP predictsthat the information will be processed in the same way from oneindividual to the other, regardless of whether it relates to speechsounds, facial movements, manual gestures, etc., it is assumed thatthe way in which this information is extracted from the stimulus is onthe contrary subject to too many sources of variations to beaccurately characterized ahead of time. In my understanding, thismeans that the so-called evaluation stage cannot be accounted for bythe model, or at least not with much accuracy.

However, at least on one occasion Massaro does suggest that thislimitation is not consubstantial with every model of perception andpattern recognition, and could be circumvented in some way. Accordingto him, one could indeed &quot;easily hypothesize functions relating thefeature values to the stimulus levels, [although] that would representa *model of information* in addition to one of information processing&quot;(294, my emphasis). This suggests that building such a model ofinformation is feasible. Whether there is a possibility of the FLMPbeing completed with a model of this kind, i.e. an explicitstimulus-to-feature mapping stage, is an issue which remains to beaddressed.

2.2.2 The time course of speech processing

Time plays quite a central role in different ways in the book. First,Massaro shows how the FLMP can be explicitly formalized to account forthe dynamics of perceptual processing (chap. 9). This formalization ispresented in reply to criticisms expressed by a number ofinvestigators (e.g. McClelland, 1991), who have pointed out that theFLMP accurately characterizes the asymptotic outcome of the perceptualsystem (e.g. the probability for a particular response to occur), buthas little to say about the time course of processing. The dynamicversion of the FLMP is intended to address these reactions. In thisversion, the stimulus-to-feature mapping is assumed to take a certainamount of time. During this interval, the information about thestimulus gradually accumulates, and becomes increasingly accurate. Itis assumed that accuracy increases as a negatively acceleratedfunction of processing time, so that more information is gleaned earlythan late in the processing of the stimulus. One further assumption isthat &quot;integration of the separate features [is] updated continuouslyas the featural information is being evaluated. Similarly, decision[can] occur at any time after the stimulus presentation&quot; (259). Thus,there is a partial temporal overlap between the different stages ofprocessing, in the sense that one process can begin before a previousprocess is finished (see also Figure 2.1, p. 41).

These assumptions about the time course of information processing aresupported by a number of experiments concerned with the effect ofbackward masking in the recognition of pure tones, and in therecognition of letters. Speech obviously raises a number of specificissues in that domain, however. Unlike written words, speech is atemporal phenomenon, it is continuous (i.e. there are no systematicacoustic boundaries between phonemes, syllables, or words) and,furthermore, time per se serves as a source of information in speech,as pointed out by Massaro (e.g. vowel duration is a major cue to thevoicing of the following obstruent, to take but one example). Somewhatregrettably, few indications are given about how the model could beassessed in the speech domain (see remarks p. 194 and p. 263).

In addition to discussing the dynamics of processing, Massaro examineshow the temporal relations between sources of information are dealtwith in pattern recognition. Chapter 3 focuses on our sensitivity totemporal asynchronies between visible and audible speech. In theexperiments reported in this chapter, bimodal CV syllables withvarious degrees of onset asynchrony between the auditory syntheticspeech and the visible synthetic speech were presented to subjects ina forced-choice identification task. The results show that integrationbetween the two sources of information still occurs when these sourcesof information are made asynchronous, provided that the time shiftdoes not exceed a certain duration.

One major challenge for phoneticians and psycholinguists alike is tocharacterize the relationship between what could be called the*external* dynamics of speech, i.e. the temporal organization of thespeech signal, and the *internal* time course of speech processing.Both play a role in the perception of speech, and it is most difficultto tell apart their respective influences on the listener's behaviour(Samuel, 1996). For example, in a gating study investigating the roleof vowel duration as a cue to the voicing of the post-vocalic stop inCVC syllables, Warren and Marslen-Wilson (1988) found that theproportion of voiced-coda responses increased as the listeners werepresented with increasingly long portions of the initial CVsequence. One obvious interpretation is that longer vowels wereperceived as being associated with voiced coda rather than voicelessones. In keeping with Massaro's dynamical FLMP, however, it may alsobe assumed that evaluating the information provided by the vowel takestime, and that the evidence pointing to a voiced coda graduallyaccumulates as more processing time is made available to the listener,all other things being equal. Thus, the above finding raises the issueof how to differentiate the effect of vowel duration per se on thelistener's response, from that of the internal dynamics ofprocessing. Although this issue is not directly addressed in the book,there is no doubt that the FLMP would constitute a most appropriateframework for further investigations in this domain.

2.3 General Conclusion

This book provides us with quite an extensive review of the workcarried out by the author and others on the use of multiple cues inspeech perception and, more generally, pattern recognition. It isaimed at a very large audience, and constitutes a most useful toolboth for teaching and research purposes. I do not doubt that it willsoon become a major reference for researchers in phonetics,psycholinguistics, and cognitive psychology.

The reviewer is a lecturer in the Laboratory for Psycholinguistics,FPSE, University of Geneva, Switzerland. His current research covers avariety of topics ranging from the dynamics of articulatory movementsin speech production to the phonetic bases of word recognition. Thanksare due to Uli Frauenfelder for helpful comments. A LaTeX version ofthis document is available upon request (nnguyen@fapse.unige.ch).