We use cookies to enhance your experience on our website. By continuing to use our website, you are agreeing to our use of cookies. You can change your cookie settings at any time.Find out moreJump to
Content

Speech Perception and Generalization Across Talkers and Accents

Summary and Keywords

The seeming ease with which we usually understand each other belies the complexity of the processes that underlie speech perception. One of the biggest computational challenges is that different talkers realize the same speech categories (e.g., /p/) in physically different ways. We review the mixture of processes that enable robust speech understanding across talkers despite this lack of invariance. These processes range from automatic pre-speech adjustments of the distribution of energy over acoustic frequencies (normalization) to implicit statistical learning of talker-specific properties (adaptation, perceptual recalibration) to the generalization of these patterns across groups of talkers (e.g., gender differences).

The overarching goal of speech perception research is to explain how listeners recognize and comprehend spoken language. One of the biggest challenges of speech perception is the lack of a one-to-one mapping between acoustic information in the speech signal and linguistic categories in memory. This so-called lack of invariance in speech (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967) stems from a host of factors (see Klatt, 1986, for an overview). The physical properties of speech sounds produced by a given talker vary across productions due to factors such as tongue position, jaw position, and the temporal coordination of articulators (Stevens, 1972; see also Marin, Pouplier, & Harrington, 2010), as well as articulatory carefulness in formal versus casual speech (Lindblom, 1990), speaking rate, emotional state (Protopapas & Lieberman, 1997), and the degree of coarticulation with adjacent sounds (see Ladefoged, 1980; Öhman, 1966). Further, when comparing across talkers, variability arises due to differences in anatomical structures such as vocal tract length and vocal fold size (Fitch & Giedd, 1999; Peterson & Barney, 1952), as well as differences due to age (Lee, Potamianos, & Narayanan, 1999), gender (Perry, Ohde, & Ashmead, 2001), idiolectal articulatory gestures (Ladefoged, 1989), and regional or non-native accent (Labov, Ash, & Boberg, 2006). As a result, the physical realization of the same speech category can differ greatly over time, especially when produced by different talkers (Hillenbrand, Getty, Clark, & Wheeler, 1995; Peterson & Barney, 1952; Potter & Steinberg, 1950). For example, an adult female’s production of /ʃ/ might be very similar to an adult male’s production of /s/ due to the influence of vocal tract size on spectral center of gravity (one of the primary cue dimensions that indicates place of articulation for fricatives). Similarly, one talker’s realization of the vowel /ε‎/ (as in said) might sound like another talker’s realization of the vowel /æ/ (as in sad) due to cross-dialectal differences in the realization of these vowels. Figure 1 illustrates this many-to-many mapping problem for a hypothetical category contrast. Note that Figure 1 illustrates the consequences of within- and between-talker variability along a single category-relevant acoustic cue dimension; however, speech is high dimensional, and speech categories are often signaled by multiple cues (Fox, Flege, & Munro, 1995; Jongman, Wayland, & Wong, 2000).

Figure 1. Schematic example illustrating the lack of invariance in speech and the resulting generalization problem. Top panel: Categories are realized as distributions of cue values (C1 and C2) produced by two talkers with different distributions and category boundaries (dots indicate the point of maximal overlap between categories for each talker). For both talkers, lower cue values in the talker’s range map to C1. However, Talker 2’s productions are shifted relative to Talker 1’s, indicating the fact that talker-related factors such as vocal anatomy and accent affect the overall range of cue values that different talkers produce. Further, Talker 2 has a more peaked distribution for C2, compared to Talker 1, indicating less variability (greater precision) in the acoustic realization of this category. Bottom panel: Categorization function of an ideal observer model (which closely match human behavior), showing the probability of hearing C2 for each cue value (for discussion of ideal observer models, see Section 4.2). Categorization is at chance where a cue is equally likely to have come from either category (see top panel). Note the relationship between the shaded region in the top panel and the categorization functions in the bottom panel. Cue values that fall within this shaded region are strongly associated with C1 when produced by Talker 2, but the same cue values produced by Talker 1 are more likely to map onto C2. Listeners must be able to generalize their implicit knowledge about the mapping between cues and categories across talkers, while adjusting for the fact that these categories are realized in different ways by different talkers.

The lack of invariance in speech leads to an inference and generalization problem. To achieve perceptual constancy in the face of highly variable speech input, listeners must be able to generalize knowledge about the sound structure of their language across words, phonological contexts, talkers, accents, and speaking styles. Indeed, despite the ubiquity of variability in speech, listeners tend to understand native speakers of their language with surprisingly little difficulty. That is, listeners tend to succeed in mapping acoustic input to the linguistic categories intended by the talker. Even in extreme cases, such as listening to a talker with a heavy foreign accent amid a noisy background, listeners can often overcome initial perceptual difficulties. For example, with relatively brief exposure to foreign accented speech, listeners show improvements in both processing speed (Clarke & Garrett, 2004) and category identification accuracy (Baese-Berk, Bradlow, & Wright, 2013; Bradlow & Bent, 2008; Reinisch & Holt, 2014; see also Romero-Rivas, Martin, & Costa, 2015). Thus, the speech perception system is able to adjust for the fact that the same physical cue values map onto different categories with different probabilities depending on the talker, and conversely that the same category can map onto different cue values (or even entirely different cue dimensions; Smith & Hayes-Harb, 2011; Tosacano & McMurray, 2010; see also Schertz, Cho, Lotto, & Warner, 2015).

Precisely how the systems involved in speech perception cope with variability has been, and continues to be, a central and hotly debated theoretical issue in the field (for extensive discussion, see Cutler, Eisner, McQueen, & Norris, 2010; Johnson, 2005; Pisoni, 1997; Strange, 1989). Some theories assume the existence of invariant aspects of speech (e.g., acoustic invariants or invariant phonetic-articulatory gestures) that uniquely define phonetic categories (e.g., Fant, 1960; Galantucci, Fowler, & Turvey, 2006; Myers, Blumstein, Walsh, & Eliassen, 2009; Stevens & Blumstein, 1981). According to these theories, surface variability in the form of an utterance is uninformative, if not irrelevant, with respect to the mapping of speech input to linguistic categories. Other theories assume that variability within and across talkers is highly informative and plays a fundamental role in speech perception (Elman & McClelland, 1986; Goldinger, 1998; Holt, 2005; Kleinschmidt & Jaeger, 2015b). According to the latter view, listeners are sensitive to the relationship between acoustic variability (within and across talkers) and phonetic categories. For example, the categorization functions in the bottom panel of Figure 1 show the statistically optimal cue-to-category mapping function based on talker-specific knowledge of the distribution of variability associated with each category.

This article provides an overview of research on the role of variability in speech perception. We focus primarily on talker variability and the issue of cross-talker generalization. However, the mechanisms for coping with talker variability also play a role in how listeners cope with within-talker contextual variability (for further discussion of this point, see Nusbaum & Magnuson, 1997). We aim to provide an overview of the critical concepts and debates in this domain of research; to chart significant historical developments; to emphasize underlying assumptions and the evidence that supports or opposes those assumptions; and to highlight overlap among lines of research that are often viewed as orthogonal or opposing. We organize the current discussion around four questions that have guided research on talker variability: (a) To what extent are there invariant aspects of speech? (b) How is the speech signal adjusted during (the early stages of) processing? (c) How do listeners make us of the statistical distribution of variability across talkers, such as systematic variation due to a talker’s accent, sex or other social group membership? And (d) how is such information learned? We conclude by indicating important directions for future research.

2. To What Extent Are There Invariant Aspects of Speech?

Early approaches to the issue of talker variability in speech perception assumed that acoustic variability in the realization of speech sounds was perceptual “noise” that obscures the abstract symbolic content of the linguistic message. To understand how listeners achieve perceptual constancy when confronted with noisy input, a large body of research focused on identifying invariant aspects of speech that uniquely identify phonemic categories (e.g., Fowler, 1986; Ladefoged & Broadbent, 1957; Peterson, 1952; Shankweiler, Strange, & Verbrugge, 1975). According to this approach, listeners’ ability to generalize the sound structure of their language across talkers—i.e., to recognize physically different speech signals from different talkers as instances of the same speech category—is the result of the perceptual system focusing on or extracting invariant aspects of speech and ignoring variability due to the talker’s idiolect, accent or vocal anatomy/physiology.

2.1 Acoustic Invariance

Within this tradition, some researchers aimed to identify invariant acoustic information in the speech signal: that is, category-specific acoustic cues that are produced the same by all talkers (Cole & Scott, 1974; Fant, 1960). The theory of acoustic invariance is most fully elaborated for stop consonants (e.g., /b/, /d/, /g/, etc.; Kewley-Port, 1983; Lahiri, Gewirth, & Blumstein, 1984; Walley & Carrell, 1983). Several studies argued that the shape of the spectrum (the distribution of energy as a function of frequency) at the release of a stop consonant is an invariant cue to place of articulation (Blumstein & Stevens, 1979; Halle, Hughes, & Radley, 1957; Stevens & Blumstein, 1977, 1978; Zue, 1976). As shown in Figures 2 and 3, the gross shape of the short-term spectrum at the time of burst release is diffuse and either falling or flat for labial consonants (e.g., /b/), diffuse and rising for alveolar consonants (e.g., /d/), and compact for velar consonants (e.g., /g/). Thus, it was proposed that a perceptual mechanism that samples the short-term spectra at the time of burst release can reliably distinguish stop consonants with a place of articulation contrast (see, e.g., Stevens & Blumstein, 1981). Indeed, early automatic phoneme recognition systems that implemented such a mechanism achieved considerable accuracy in classifying stop consonants produced by different talkers (Searle, Jacobson, & Rayment, 1979). For further elaboration of acoustic invariance for stop consonant place of articulation, see research on formant transitions (e.g., Delattre, Liberman, & Cooper, 1955; Story & Bunton, 2010).

Figure 2. Examples of waveforms and short-term spectra (boxes) sampled at the release of three voiced and voiceless stop consonants as indicated. Superimposed on two of the waveforms (for/ba/and/pu/) is the time window of width 26 msec used for sampling the spectrum. Short-time spectra are determined for the first difference of the sampled waveform (for details, see source).

Figure 3. Schematization of the diffuse-rising, diffuse-falling, and compact templates designed to capture the gross spectral shapes characteristic of alveolar (e.g., /d/, /t/), labial (/b/, /p/), and velar (/g/, /k/) places of articulation, respectively. The diffuse templates require a spread of spectral peaks across a range of frequencies, with increasing energy at higher frequencies for the diffuse-rising template and a falling or flat spread of energy for the diffuse-falling template. The compact template requires a single spectral peak in the mid-frequency range.

For a theory of acoustic invariance to provide a sufficient account of talker variability and speech perception, invariant category-distinguishing acoustic cues must be identified for the full set of sounds in a language. To date, however, sufficient cues have not been identified for speech sounds like vowels and fricatives, in part because the physical properties of these sounds are highly dependent on the talker’s vocal anatomy and accent, as we discuss further below. It should be noted that while the search for acoustic invariance has not yielded a viable account of speech perception, this line of research markedly advanced the understanding of the spectral properties of speech, which has had far reaching benefits: inter alia, improving the quality of synthesized speech (see Klatt, 1987).

2.2 Articulatory/Motor Invariance

Another approach to invariance has focused on articulatory gestures, arguing that the invariant aspects of speech are not part of the acoustic signal but rather part of the production processes that generate the signal (Iskarous, Fowler, & Whalen, 2010; Sussman, Fruchter, Hilbert, & Sirosh, 1998). The central tenet of the motor theory of speech perception is that the objects of speech perception are the “intended phonetic gestures” of a talker (Liberman, 1982; Liberman et al., 1967; Liberman & Mattingly, 1985). This claim is based on several assumptions about the architecture of the speech processing system. First, motor theory assumes that speech production and speech perception are tightly linked and share the same representations. Second, this theory assumes that speech sounds are represented in the brain as “invariant motor commands that call for movements of the articulators through certain linguistically significant configurations” (Liberman & Mattingly, 1985, p. 2, emphasis added): e.g., the category [m] is described as a combination of a labial gesture and a velum-lowering gesture. While the abstract category-specific motor commands are assumed to be invariant, the physical execution of these commands naturally varies across utterances and talkers. Thus, Liberman and Mattingly (1985, p. 3) argue that:

[t]o perceive an utterance is to perceive a specific pattern of intended gestures. We have to say “intended gestures,” because, for a number of reasons (coarticulation being merely the most obvious), the gestures are not directly manifested in the acoustic signal or in the observable articulatory movements. It is thus no simple matter . . . to define specific gestures rigorously or to relate them to their observable consequences.

On this view, speech perception involves reconstructing the production plan. In other words, speech input is perceived as the intended phonetic gestures by internally deriving the gestures involved in producing the speech signal (e.g., analysis by synthesis). As the quote above indicates, however, one of the challenges faced by motor theory is to provide an explicit account of how speech input is translated into ostensibly invariant motor commands.

One answer to this challenge holds that the objects of speech perception are the actual physical gestures produced by a talker, rather than the intended gestures (see the direct realist view of speech perception, which is broadly related to the motor theory, but differs in many of the basic assumptions; Best, 1995; Fowler, 1986, 1991; Gibson, 1966). For physical gestures to be the objects of speech perception, these gestures must be “perceivable” even when listeners have no visual information about the physical production of speech sounds (e.g., when talking on the phone or listening to the radio). Research in the field of automatic speech recognition has demonstrated that articulatory gestures can be recovered from the acoustic signal, without any corresponding visual articulatory information (for a review, see Schroeter & Sondhi, 1994) and that these recovered gestures can indeed guide speech recognition (e.g., Mitra, Nam, Epsy-Wilson, Saltzman, & Goldstein, 2012).

Any theory of speech perception based on articulatory recovery must (at minimum) account for anatomical and postural differences between talkers; otherwise, acoustic variation resulting from such factors might be wrongly attributed to differences in the movement of articulators, or vice versa. One proposal concerning talker variability and articulatory recovery (see, e.g., McGowan & Berger, 2009; McGowan & Cushing, 1999) starts from the assumption that speech perception relies on an internal articulatory model, which comprises a talker-independent representation of the human vocal tract, along with knowledge of the acoustic consequences that result from different gestural configurations. When listeners hear speech, talker-specific anatomical features are estimated from the speech signal, and these estimates are used to adjust the internal vocal tract representation (for estimation methods, see, e.g., Hogden, Rubin, McDermott, Katagiri, & Goldstein, 2007). Articulatory gestures can then be recovered by comparing the observed speech input to the output of different configurations of the adjusted internal model. McGowan and Cushing (1999) demonstrated that this approach aids the recovery of category-relevant articulatory movements from male and female talkers who differ, inter alia, in vocal tract length and palette height.

There is a considerable body of evidence showing that the perception of speech sounds is indeed influenced by information about the physical production of those sounds (for extensive discussion, see Galantucci et al., 2006; Vroomen & Baart, 2012), such as visual information about articulatory movements—as in the case of the classic McGurk effect (McGurk & MacDonald, 1976)—or haptic information gathered by touching a talker’s face during articulation (Fowler & Deckle, 1991; Sato, Cavé, Ménard, & Brasseur, 2010). These findings indicate that speech production can provide information that guides speech perception (as claimed by auditory theories of speech processing that include a role for motor knowledge; see Figure 4a). However, these findings are insufficient to support the claim that articulatory gestures form the sole basis of speech perception (see Figure 4b). In fact, there are several reasons to doubt this claim.

One of the main issues with gesture-based theories is that speech production and speech perception can be disrupted independently, which calls into question the assumption that production is required for perception. For example, expressive aphasia, also known as Broca’s aphasia, is a language disorder that is characterized by severe disruption of speech production processes as a result of brain damage (e.g., a brain lesion or stroke), but often only mild, if any, disruption to perception and comprehension processes (see, e.g., Naeser, Palumbo, Helm-Estabrooks, Stiassny-Eder, & Albert, 1989). The linguistic abilities of expressive aphasics are particularly problematic for the motor theory, given that this theory assumes motor-based representations of speech categories that are shared between production and perception (Hickok, Costanzo, Capasso, & Miceli, 2011; Lotto, Hickok, & Holt, 2009; though see Wilson, 2009, for a counterargument).

Further evidence for a dissociation between perception and production comes from studies that demonstrate human-like speech perception phenomena in animals, despite the fact that the animals being tested lack the anatomical apparati to produce speech. For example, chinchillas show human-like categorical perception of speech sounds—e.g., abrupt rather than gradual changes in perception of voiced /d/ and voiceless /t/ when tokens are varied along a voice onset time continuum (Kuhl & Miller, 1975). Many animals also show a human-like ability to differentiate phonological categories while ignoring talker-related variability in the realization of those categories: e.g., zebra finches (Ohms, Gill, Van Heijningen, Beckers, & ten Cate, 2010); ferrets (Bizley, Walker, King, & Schnupp, 2013); rats (Eriksson & Villa, 2006); chinchillas (Burdick & Miller, 1975); and cats (Dewson, 1964). It is unlikely that these animals evolved to have a mental representation of the human vocal tract or specialized knowledge of the motor commands used to produce speech sounds. Thus, the findings from these animal studies pose a challenge for theories that place gesture-based knowledge at the center of speech perception (for further discussion, see Kriengwatana, Escudero, & ten Cate, 2015).

3. How Is the Speech Signal Adjusted During the Early Stages of Processing?

A third approach to variability in the realization of speech categories proposes that invariance is achieved via perceptual processes that warp or transform the speech signal. This approach is often referred to as normalization: speech perception is taken to effectively normalize variability by interpreting certain aspects of the speech signal in relation to other aspects: e.g., adjusting the perception of voice onset times based on the talker’s speaking rate (Newman & Sawusch, 1996) or adjusting the perceived distribution of energy at different frequencies based on an estimate of the talker’s vocal tract size (see Johnson, 2005, for a review of talker normalization). In other words, unlike the accounts in the previous section that were—at least in their original conceptions—concerned with absolute acoustic or articulatory invariance, normalization accounts are often concerned with relational invariance (e.g., Sussman, 1989).

Before discussing the merits and limitations of normalization approaches, we begin with a detailed example that illustrates one of the core phenomena addressed by this line of research: variability resulting from the talker’s vocal anatomy/physiology. We focus specifically on vocal tract-related vowel variability, which has played a central role in the normalization literature (for discussion, see reviews by Adank, Smits, & van Hout, 2004; Johnson, 2005). Note, however, that normalization accounts have also been developed for consonants (Holt, 2006; Johnson, 1991; Mann & Repp, 1980) and lexical tone (Fox & Qi, 1990; Huang & Holt, 2009; Moore & Jongman, 1997).

Adult men tend to have longer vocal tracts than adult women due to laryngeal descent (i.e., the lowering of the larynx in the throat) during puberty (Fitch & Giedd, 1999). Longer vocal tracts resonate at lower frequencies than shorter vocal tracts (Chiba & Kajiyama, 1941). Thus, vowel productions from adult men tend to be characterized by formants (acoustic resonances of the vocal tract) at lower frequencies than corresponding vowels produced by adult females (see the top left panel of Figure 5; see also Huber, Ash, & Johnson, 1999). This biological difference has consequences for vowel perception. The first and second formants (F1 and F2) vary systematically by vowel type (Ladefoged, 1980) and are two of the primary cue dimensions used in vowel identification (Fox et al., 1995; Verbrugge, Strange, Shankweiler, & Edman, 1976; Yang & Fox, 2014): for example, the vowel /u/ (as in the word suit) is characterized by a lower F1 and (in many varieties of English) F2 than the vowel /ʊ/ (as in the word soot). As a result of laryngeal descent during puberty, and hence formant lowering, the F1 and F2 of an adult male’s realization of /ʊ/ (the vowel with the relatively higher formants) might match the F1 and F2 of an adult female’s realization of /u/, as shown in the middle left panel of Figure 5 (see also Hillenbrand et al., 1995; Peterson & Barney, 1952).1

Figure 5. Example of cross-talker vowel variability before (left) and after (right) normalizing for differences in vocal-tract length based on F3. Top panel: the average vowel space for adult male and adult female talkers in the vowel corpus collected by Hillenbrand et al. (1995). Talkers in this corpus are from the northern dialect region of American English. Bottom panel: the degree of overlap among adult male productions of /ʊ/ and adult female productions of /u/. Plots show individual vowel tokens (small dots), category means (large dots), and 95% confidence ellipses. Note that the unnormalized male and female vowel spaces (top left panel) have approximately the same geometric configuration, but the male vowels are characterized by comparatively lower absolute F1 and F2 values, which reflects the fact that longer vocal tracts resonate at lower frequencies than shorter vocal tracts. As a result, the distribution of adult male tokens of /ʊ/ is highly overlapping with the distribution of adult female tokens of/u/in F1xF2 space (bottom left panel). That is, the same acoustic information maps onto different phonological categories with different probabilities depending on the talker’s sex. Hence neither F1 nor F2 provides reliable information for discriminating these vowel categories across talker sex. Normalizing F1 and F2 based on F3 (which is correlated with vocal tract length) considerably reduces the difference between the average male and female vowel spaces (top right panel), while preserving the overall shape of the space (i.e., the relative position of vowels). As a result of F3-normalization, tokens of /ʊ/ and /u/ are less overlapping (bottom right panel) and, hence, more discriminable despite vocal tract differences across talkers.

The example in Figure 5 highlights several facts. First, there is not a one-to-one mapping between category-relevant acoustic cues and linguistic categories: e.g., one talker’s [u] is another talker’s [ʊ]. Second, anatomical differences across talkers have a systematic influence on the acoustic realization of speech sounds: e.g., there is a relationship between vocal tract length and resonance frequencies. Third, talkers with the same dialect maintain the same structural relationships among speech sounds: e.g., the contrast between the vowels /u/ and /ʊ/ and the relative position of these vowels in acoustic-phonetic space. According to the normalization approach, the speech perception system copes with acoustic variability by capitalizing on relational aspects of speech. In other words, it is not the absolute value of category-relevant speech cues that matters for speech perception, but rather the relationship among various cues.

3.1 Normalization as an Automatic Auditory Process

Normalization has a long history in research on speech perception, and a number of related normalization mechanisms have been proposed (see, e.g., Barreda, 2012; Irino & Patterson, 2002; Joos, 1948; Lloyd, 1890a; Lobanov, 1971; Nearey, 1989; Nordström & Lindblom, 1975; Strange, 1989; Zahorian & Jagharghi, 1993). Some normalization accounts are purely auditory, such as ratio-based accounts2 in which category-relevant speech cues (e.g., F1 and F2 for vowels) are normalized based on acoustic correlates of the talker’s vocal tract length, such as a talker’s fundamental frequency (F0) or third formant (F3; Bladon, Henton, & Pickering, 1984; Claes, Dologlous, Bosch, & van Compernolle, 1998; Halberstam & Raphael, 2004; Miller, 1989; Monahan & Idsardi, 2010; Nordström & Lindblom, 1975; Peterson, 1961; Sussman, 1986; Syrdal & Gopal, 1986). Figure 5 shows an example of F3-normalization in which the F1 and F2 from a set of American English vowels (left panel) are converted to F1/F3 and F2/F3 ratios (right panel), which dramatically reduces vocal tract-related variability across talkers with the same accent (see Monahan & Idsardi, 2010, for further discussion). Other normalization accounts assume an articulatory basis for speech perception. For example, in McGowan and Cushing’s (1999) articulatory recovery model (see above), vocal tract normalization is the first step in extracting category-relevant gestural information from the speech signal. What the majority of these proposals share is the belief that normalization of the speech signal results from automatic pre-categorical auditory processes (Huang & Holt, 2011; Sjerps, McQueen, & Mitterer, 2013; Sussman, 1986; Watkins & Makin, 1996).

Early work on normalization was motivated in part by how the peripheral auditory system encodes the frequency content of speech. In humans (and other mammals), sound frequency discrimination begins in the cochlea. The basilar membrane, which is part of the cochlea, is tonotopically organized, meaning that different regions respond to different frequencies. Specifically, hair cells that are positioned further along the membrane respond to progressively lower frequencies. Building on these basic aspects of sound perception, Potter and Steinberg (1950, p. 812) proposed that “within limits, a certain spatial pattern of stimulation on the basilar membrane may be identified as a given sound regardless of position along the membrane.” In other words, Potter and Steinberg proposed a ratio-based account of normalization in which the peripheral auditory system perceives the relationship among co-occurring formants, rather than perceiving individual formants. In a similar vein, Sussman (1986) developed a simulation-based model of vowel normalization and representation that involved “combination-sensitive neurons,” which integrate information from multiple formants before mapping the normalized input to abstract representations of vowel categories. This line of research suggests that normalization is a low-level process, both in terms of the point in the processing stream at which the adjustments occur (i.e., pre-categorical adjustments to the speech signal, as opposed to higher-level adjustments that affect the mapping of phonetic percepts to phonological categories) and in terms of the perceptual systems that are responsible for these adjustments (i.e., the peripheral auditory system).

It is worth noting that the frequency-position map of the basilar membrane—the relationship between acoustic frequency and position along the membrane—appears to be logarithmic over most of the cochlea’s range of frequency sensitivity (Greenwood, 1961). Thus, log-transforming frequency—or converting the frequency scale to Bark or mel, which are units of measurement that are (approximately) logarithmically related to frequency—is a type of normalization that aims to capture the biological structure of the inner ear and the psychophysics of sound perception (Adank et al., 2004; Sussman, 1986; Syrdal & Gopal, 1986).

Some of the most compelling behavioral evidence for speech normalization comes from experiments showing context-dependent shifts in the perception of speech sounds (Holt, 2005; Laing, Liu, Lotto, & Holt, 2012; Lindblom & Studdert-Kennedy, 1967; Mann, 1980; Mann & Repp, 1980). These studies provide evidence for normalization as a pre-categorical process, though not necessarily a cochlear or peripheral auditory process (see, e.g., Holt & Lotto, 2002; Sjerps et al., 2013). In a seminal study, Ladefoged and Broadbent (1957) found that perception of a target utterance—a word that was relatively ambiguous between “bit” (with an [ɪ]) and “bet” (with an [ε‎]) due to the frequency of F1—shifted depending on the formant structure of the preceding carrier phrase. Ladefoged and Broadbent (1957) manipulated the F1 and F2 of all vowels in the carrier phrase, either lowering or raising them. The lowered or raised F1 and F2 across vowels in the carrier phrase thus suggested a talker with a relatively longer or shorter vocal tract, respectively. Ladefoged and Broadbent found that this manipulation had a spectrally contrastive effect on the perception of the target vowel: when the carrier phrase had raised vowel formants, the vowel in the target word tended to be heard as [ɪ] (as in “bit”), which has a lower F1 than [ε‎]; when the carrier phrase had lowered formants, the target tended to be heard as the relatively higher vowel [ε‎] (as in “bet”). That is, listeners interpreted the vowel in the target word as relative to the talker’s vowel space. Thus, the speech perception system compensated for talker-related variability as revealed in the preceding utterance (see also Ladefoged, 1989).

Building on this seminal finding, Holt and colleagues (Holt, 2005, 2006; Huang & Holt, 2009) demonstrated that the same spectrally contrastive perceptual shift occurs even when the carrier phrase is replaced with a series of non-speech sine-wave tones. In these experiments, a constant speech target was interpreted as relatively higher when preceded by a series of pure tones sampled from a distribution of low-frequency tones, but as relatively lower when preceded by a series of tones drawn from a distribution of high-frequency tones. The fact that both speech and non-speech contexts elicit this shift suggests that pre-categorical normalization results from general, rather than speech-specific, auditory processes that are sensitive to the relational properties of the acoustic input (Laing et al., 2012). The finding of spectrally contrastive perceptual shifts suggests that speech perception is sensitive to the statistical distributions of spectral information in the local context, even if these distributions include nonlinguistic spectral information.

Findings by Holt and colleagues further indicate that normalization does not result solely from peripheral auditory mechanisms. Holt and Lotto (2002), for example, found that the spectrally contrastive effect of context occurred even when the preceding context and the target utterance were presented in different ears. This finding suggests that normalization is due, at least in part, to central auditory processes because information from both ears must have been integrated in order for the context-dependent perceptual shift to emerge.

3.2 Category-Intrinsic vs. Category-Extrinsic Normalization

One dimension along which normalization proposals differ is the type of information that is used to perform the normalization (for discussion, see Ainsworth, 1975; Johnson, 1990). Category-extrinsic procedures involve normalizing a category token based on an external frame of reference, such as information from the preceding utterance or context (Holt, 2005, 2006; Sjerps et al., 2013). In an early but highly influential proposal, Joos (1948, p. 61) argued that vowel information is perceptually evaluated on a talker-specific “coordinate system” that the listener “swiftly constructs” based on information from other vowels from the same talker. By contrast, category-intrinsic procedures rely exclusively on information from a given category token to normalize that token. An example of category-intrinsic normalization is calculating the interval between adjacent formants of a given vowel token in order to isolate the relative pattern of formants independent of the absolute formant frequencies (e.g., F2−F1, F1−F0; Syrdal & Gopal, 1986).

Research on the role of F0 in talker normalization serves to highlight the complementarity of category-intrinsic and category-extrinsic approaches. F0 results from the periodic pulsing of the vocal folds and is correlated with vocal cord size and (indirectly) with vocal tract size. F0 therefore provides a vowel-intrinsic reference point for normalizing variability due to the talker’s vocal anatomy (for various instantiations of this approach, see, e.g., Hirahara & Kato, 1992; Johnson, 2005; Katz & Assmann, 2001; Syrdal & Gopal, 1986) However, one limitation of vowel-intrinsic F0 normalization is that listeners can recognize vowels with a high degree of accuracy even when F0 is not present in the signal (Tartter, 1991), as in the case of whispered speech (i.e., there is no periodic pulsing of the vocal folds during whispered speech because the vocal folds are held tight). Vowel-extrinsic F0 normalization (Miller, 1989) provides a potential solution because the formants in whispered vowels could be normalized based on an aggregate measure of the talker’s fundamental frequency, calculated over previous tokens in which F0 was present.

The evidence discussed above suggests that normalization involves sensitivity to both syntagmatic relationships—e.g., the relationship between a given speech sound and aspects of the surrounding context—and paradigmatic relationships—e.g., the relationships among category-internal sources of information. This leaves open the possibility that listeners draw on a wide variety of cues, possibly weighing them in accordance to their informativity. This possibility receives some support from a review of proposed vowel normalization algorithms, which found that vowel-extrinsic procedures performed better than vowel-intrinsic procedures in achieving relational invariance (Adank et al., 2004). Since category-extrinsic information is more available and plentiful than category-intrinsic information (the latter is limited, by definition), extrinsic cues are a priori more likely to yield reliable information about talker-related sources of variation, and hence to provide a stable baseline for normalization.

3.3 Normalization and Learning

While normalization algorithms have been shown to reduce talker-related acoustic variability, particularly due to vocal anatomy (see, e.g., Figure 5), this approach has been met with a number of criticisms. We briefly review some of the most important criticisms (for extensive discussion, see Johnson, 1997, 2005). Then we discuss an aspect of these criticisms that has received comparatively little attention: the relationship between normalization and learning.

One criticism of normalization accounts is that instance-specific details of perceived speech are retained in long-term memory and influence subsequent speech processing (Bradlow, Nygaard, & Pisoni, 1999; Goldinger, 1996; Palmeri, Goldinger, & Pisoni, 1993; Schacter & Church, 1992), which indicates that acoustic variability is not “filtered out” during the early stages of processing. These findings spurred a tremendous body of research into speech perception. According to episodic (Goldinger, 1996, 1998) and exemplar-based approaches (Johnson, 1997; Pierrehumbert, 2002; Pisoni, 1997), detailed representations of speech episodes play a central role in how listeners cope with talker variability.

Two related criticisms of normalization accounts come from cross-linguistic studies. First, the exact difference between adult male and adult female vowel formants varies across languages and cannot be reduced to differences in vocal anatomy (Johnson, 2006; see also Bladon et al., 1984; Johnson, 2005). This suggests that cultural factors such as gender-norms contribute to patterns of variation in speech, above and beyond biologically-determined variation, such as sex-based vocal tract differences after puberty. As further evidence of this point, boys and girls in some cultures show adult-like differences in speech production (e.g., boys producing lower formants) long before laryngeal descent during puberty, and thus before biological factors would explain the difference (Johnson, 2005; Lee et al., 1999; Perry et al., 2001). Second, normalization procedures that are effective in reducing inter-talker variability when applied to data from one language are not necessarily equally effective when applied to corresponding data from another language (see, e.g., Disner, 1980).

These cross-linguistic findings raise questions about how normalization processes come into existence. A priori, there are at least three logically possible scenarios: (a) normalization involves a genetically-determined invariant mapping from genetically-determined cues to categories; (b) normalization involves a variable mapping function from genetically-determined cues to categories, with the mapping function being learned through exposure; (c) normalization is simply the use of an invariant mapping function to relate cues to categories, but both the cues and the nature of the mapping function are learned from exposure (e.g., learning that F0 and F3 are related to the talker’s vocal anatomy/physiology and hence can help normalize source-related variability; and further learning how these cues vary due to cultural factors in the listeners’ target language). The first of these scenarios seems unlikely in light of the cross-linguistic evidence cited above. The other two scenarios involve some degree of learning, which is typically not discussed in the normalization literature and is sometimes taken to be incompatible with accounts of automatic low-level processes. However, there is increasing evidence that even some of the lowest level cellular mechanisms in the human perceptual system appear to learn and adapt (Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). Taken together, these criticisms suggest that learning processes (e.g., learning of language-specific or talker-specific variation) play an important role in how the speech perception system copes with variability and how listeners are able to generalize knowledge of the sound structure of their language across talkers and utterances. Further, these criticisms suggest that there might be no clear division between perception and learning. We turn to the issue of learning in the next section.

4. How Do Listeners Make Use of The Statistical Distribution of Variability?

A prominent line of recent research on talker variability and perceptual constancy capitalizes on the fact that variability in speech is the rule, rather than the exception, by adopting a view of human perception that is dynamic, adaptive, and context-sensitive (see Bradlow & Bent, 2008; Clayards, Tanenhaus, Aslin, & Jacobs, 2008; Eisner & McQueen, 2005; Kraljic & Samuel, 2006; Maye, Werker, & Gerken, 2002; Pisoni, 1997; Pisoni & Levi, 2007). Indeed, this is increasingly how cognitive scientists see all of the brain, even low-level perceptual areas (Gutnisky & Dragoi, 2008; Sharpee et al., 2006; Stocker & Simoncelli, 2006). Instead of searching for inherently invariant properties of speech, this approach seeks to understand how the systems involved in speech perception track, learn, and respond to patterns of variation in the environment (for discussion, see Elman & McClelland, 1986; Kleinschmidt & Jaeger, 2015b; Samuel & Kraljic, 2009). This approach is based on the fundamental belief that the distribution of variability associated with speech categories—and the fact that different talkers can have different distributions (see Figure 1)—is highly informative (see also Heald & Nusbaum, 2015). As Pisoni (1997, p. 10) explains, “stimulus variability is, in fact, a lawful and highly informative source of information for the perceptual process; it is not simply a source of noise that masks or degrades the idealized symbolic representation of speech in human long-term memory.”

A similar point was noted by Liberman and Mattingly (1985): “systematic stimulus variation is not an obstacle to be circumvented or overcome in some arbitrary way; it is, rather, a source of information about articulation that provides important guidance to the perceptual process” (pp. 14–15, emphasis added). For Liberman and Mattingly, who were proponents of motor theory, the primary focus was on the types of information provided by phonological context: e.g., in the case of coarticulation, systematic variation in formant transitions between a stop consonant and vowel provide information about consonant place of articulation. The research discussed below extends beyond sources of information provided by phonological context to include any source of systematic variation in speech: e.g., a talker’s age, sex, gender, accent, speaking rate, or idiosyncratic speech patterns.

We begin by discussing evidence that speech perception is guided by listeners’ knowledge of how variability is distributed in the world (e.g., how patterns of pronunciation variation are distributed across talkers and social groups). We then discuss research concerned with the learning mechanisms that enable listeners to achieve this sensitivity to the distributional aspects of speech.

4.1 Talker Perception and Speech Processing

Sociolinguistic research over the last several decades has shown that listeners have rich and structured knowledge about the distribution of variability across groups of talkers (see e.g,. Campbell-Kibler, 2007; Clopper & Pisoni, 2004b, 2007; Labov, 1966; Preston, 1989). Listeners use this social knowledge to help generalize knowledge of the sound structure of their language across talkers (see Foulkes & Hay, 2015, for a recent overview). This line of research has demonstrated that speech perception can be influenced by expectations about the talker’s dialect background (Hay, Nolan, & Drager, 2006; Niedzielski, 1999), age (Drager, 2011; Hay, Warren, & Drager, 2006; Koops, Gentry, & Pantos, 2008; see also Walker & Hay, 2011), socio-economic status (Hay, Warren, & Drager, 2006), and ethnicity (Staum Casasanto, 2008) in cases where these social attributes covary statistically with patterns of pronunciation variation in the target language.

For example, Hay and colleagues found that unprompted expectations about an unfamiliar talker—based on visually cued social attributes of the talker—influenced perception of vowel variation in New Zealand English (Hay, Warren, & Drager, 2006). In New Zealand English, the diphthongs /iә/ and /eә/ (as in the words near and square, respectively) are in the process of merging. This change-in-progress is most advanced among younger speakers and members of lower socio-economic groups, whereas older and more affluent speakers tend to maintain the vowel contrast. In one study, Hay, Warren, and Drager (2006) presented listeners with minimal pairs like beer and bare produced by New Zealand talkers who maintained the vowel distinction. Photographs were used to manipulate the perceived age and socio-economic status of the talkers. Results of a two-alternative forced-choice identification task (e.g., Did the talker say the word beer or bare?) showed that identification accuracy was worse when the talker appeared to be younger or from a lower socioeconomic group than when the talker appeared to be older or more affluent (see also Drager, 2011). That is, when the talker appeared to belong to a social group with merged vowels, the target stimuli tended to be treated as homophonous, creating uncertainty about the intended word and resulting in a higher rate of identification “errors.” Crucially, since the speech stimuli were identical across conditions, the difference in identification accuracy can only stem from listeners’ expectations based on the visually cued attributes of the talker.

Relatedly, Niedzielski (1999) found that simply informing listeners about a talker’s ostensible regional background led to differences in how the same physical vowel stimulus was perceived (see also Hay & Drager, 2010). In this study, listeners from Detroit, Michigan, heard target words containing a raised variant of the diphthong /aw/ (e.g., about pronounced more like “a boat”), a phenomenon known as Canadian raising. The listeners’ task was to identify the best match between the vowel in the stimulus word and one of six synthesized vowel tokens, which ranged from a standard-sounding realization of /aw/ to a raised vowel variant. When told the speaker was from Canada, rather than Detroit, listeners were more likely to match the target vowel to one of the raised variants on the synthesized continuum, reflecting the fact that Detroit residents attribute Canadian raising to the speech of Canadians and are virtually unaware of this feature in their own speech.

Sensitivity to the covariance between social factors and the realization of speech categories does not stop at the level of social group membership. Listeners have also been found to be sensitive to talker-specific patterns of variation (Creel, 2014; Goldinger, 1996; Kraljic, Brennan, & Samuel, 2008; Kraljic & Samuel, 2006, 2007; for relevant discussion, see Creel & Bregman, 2011). Using an exposure-test paradigm, Nygaard, Sommers, and Pisoni (1994) found that listeners were better able to recognize new words produced by familiar talkers than words produced by unfamiliar talkers, as indicated by identification accuracy at test for words in noise (see also Nygaard & Pisoni, 1998). The fact that the benefits of exposure generalize to new words from the familiar talkers indicates that listeners learn and use knowledge of talker-specific pronunciation patterns to guide processing of new tokens from those talkers. Trude and Brown-Schmidt (2012) found that when listening to multiple familiar talkers who produce different variants of the same speech category, providing listeners with talker-indexical cues on each trial (e.g., a picture of the talker or a snippet of speech that did not contain the target speech category) facilitated the use of knowledge of talker-specific pronunciation variation.

Taken together, the findings discussed above indicate that listeners’ knowledge about the distribution of variability within and across talkers and social groups provides a rich backdrop against which to evaluate speech. These findings have been successfully accommodated in episodic and exemplar-based models of speech perception (Goldinger, 1998, 2007; Johnson, 1997; Pisoni & Levi, 2007; Sumner, Kim, King, & McGowan, 2014), as well as certain Bayesian approaches (e.g., the ideal adapter; Kleinschmidt & Jaeger, 2015a, b). One of the central tenets of these approaches is that listeners draw on their experience with category exemplars to learn how linguistic variability (e.g., the realization of the near and square vowels in New Zealand English) covaries with social factors. Listeners then leverage this knowledge to predict the likelihood with which certain speech cues map to higher-level linguistic categories. On this view, listeners are expected to be less certain about the cue-to-category mapping when social factors suggest that the talker is likely to have the near-square merger, resulting in more categorization errors, which is what Hay, Warren, and Drager (2006) found.

4.2 Adaptation and Perceptual/Distributional Learning

In order for talker-specific or group-based information to be useful in speech perception, listeners must first learn the patterns of variation that are associated with particular talkers or groups of talkers. A large body of research—much of it in recent years—has investigated the mechanisms that track and respond to patterns of variation in speech input (see Aslin & Newport, 2012; Samuel & Kraljic, 2009). The conceptual foundations of this research can be traced in part to seminal work on perceptual learning by James and Eleanor Gibson in the 1950s and ’60s (Gibson, 1969; Gibson & Gibson, 1955). Gibson (1969, p. 3) defined perceptual learning as “an increase in the ability to extract information from the environment, as a result of experience and practice with stimulation coming from it.” The premise of this view is that perception is fundamentally shaped by the perceiver’s existing knowledge and past experiences in such a way as to facilitate processing of the input, rather than being an objective translation of the physical world into units of perception. The appeal of this view for theories of speech perception is that speech categories (e.g., /b/ vs. /d/, /u/ vs. /ʊ/) need not be distinguished by a fixed set of acoustic, articulatory or relational invariants. Rather, through experience with specific talkers or groups of talkers, listeners can learn the cue dimensions and distributions of cue values that are relevant for distinguishing speech categories produced by those talkers (Clayards et al., 2008; Idemaru & Holt, 2011; Kleinschmidt & Jaeger, 2015b; Liu & Holt, 2015; Maye et al., 2002; Theodore & Miller, 2010). In other words, the speech perception system adapts. For the present discussion, we use the term adaptation to refer to the outcome of a learning mechanism (see Goldstone [1998] for an ontology of perceptual learning mechanisms).3

One of the classic demonstrations of perceptual learning for speech is that listeners dynamically recalibrate phonetic category boundaries in response to variation in the speech input (Bertelson, Vroomen, & de Gelder, 2003; Norris, McQueen, & Cutler, 2003). For example, when listeners encounter a talker whose realization of /s/ is acoustically ambiguous between [s] and [f], listeners adjust their category boundary to perceive the otherwise ambiguous stimulus as /s/ (i.e., as an instance of the category intended by the talker). This phonetic recalibration effect can be driven by lexical knowledge, such as hearing the ambiguous sound in a disambiguating lexical context: e.g., hearing “platypu[?sf]” for platypus, an /s/- final word with no /f/- final counterpart (Kraljic & Samuel, 2005; McQueen, Cutler, & Norris, 2006; Norris et al., 2003). Phonetic recalibration can also be driven by visual information: e.g., hearing a sound that is acoustically ambiguous between [b] and [d], but seeing the talker produce the labial closure for [b] (Bertelson et al., 2003; see Vroomen & Baart, 2012, for a recent review) and by statistical knowledge about contingencies among acoustic-phonetic cues (Idemaru & Holt, 2011).

A central question in this line of research concerns the conditions under which pattern abstraction is talker-specific versus talker-independent (see Bradlow & Bent, 2008; Kraljic & Samuel, 2007; Reinisch & Holt, 2014). Both can be beneficial. For example, when a property of speech is idiosyncratic to a talker, an ideal adapter should learn talker-specific expectations, adapting expectations for only that talker. However, when patterns of variation occur across talkers (e.g., dialect or accent variation), an ideal adapter should learn talker-independent but group-specific expectations in order to generalize learning to new talkers with the same dialect or accent (Kleinschmidt & Jaeger, 2015b). There is some evidence that human listeners behave in ways that are qualitatively and quantitatively similar to ideal adapters (Bejjanki, Clayards, Knill, & Aslin, 2011; Clayards et al., 2008; Kleinschmidt & Jaeger, 2011, 2012, 2015a; Kleinschmidt, Raizada, & Jaeger, 2015). For example, exposure to multiple talkers with the same accent or dialect facilitates cross-talker generalization by helping listeners distinguish talker-independent patterns of variation from inter-talker variability in the realization of those patterns. This effect of exposure conditions on learning outcomes has been observed for a range of perceptual learning phenomenon: adapting to foreign-accented speech (Bradlow & Bent, 2008; Gass & Varonis, 1984; Sidaras, Alexander, & Nygaard, 2009); learning new perceptual categories, such as Japanese-learners of English acquiring the /l/ − /r/ contrast (Lively, Logan, & Pisoni, 1993; Logan, Lively, & Pisoni, 1991), and learning to classify talkers by regional dialect (Clopper & Pisoni, 2004a).

Some results suggest that multi-talker exposure is a necessary pre-condition for talker-independent adaptation (Bradlow & Bent, 2008; Lively et al., 1993). For example, Bradlow and Bent (2008) found that listeners who were familiarized to five Mandarin-accented English talkers were subsequently able to generalize learning to a novel talker with this accent, indicating talker-independent adaptation. However, when listeners were initially familiarized to a single Mandarin-accented English talker, adaptation was talker-specific: i.e., listeners were subsequently better able to understand the trained talker’s speech in noise, but accent adaptation did not generalize across talkers. By contrast, several studies concerned with phonetic recalibration have found cross-talker generalization of perceptual learning following exposure to a single talker (Eisner & McQueen, 2005; Kraljic & Samuel, 2006, 2007; see also Weatherholtz, 2015). The likelihood of cross-talker generalization following exposure to a single talker seems to depend largely on the acoustic similarity between the familiar and new talkers (Eisner & McQueen, 2005; Kraljic & Samuel, 2007; Reinisch & Holt, 2014). Thus, when listeners do not have evidence that a particular pattern of variation is systematic across talkers (as in the case of single talker exposure), listeners appear to adapt talker-specifically and only generalize learning to acoustically similar tokens produced by other talkers (i.e., generalization based on stimulus similarity). But when listeners have evidence of variation that is systematic across talkers (as in the case of multi-talker exposure), listeners adapt by learning talker-independent patterns of variation (see Kleinschmidt & Jaeger, 2015b; Weatherholtz, 2015 for additional discussion).

One of the current goals of research in this domain is to provide a formal account of plasticity in speech perception that accounts for adaptation and generalization. One approach, the ideal adapter, is a computational-level framework for understanding the processes involved in reliably mapping acoustic cues to speech categories across talkers (Kleinschmidt & Jaeger, 2015b). According to the ideal adapter framework, speech perception is a process of inference under uncertainty: listeners infer the category that the talker intended to produce based on the observed acoustic cue values and relevant prior knowledge. Critically, prior knowledge is assumed to also comprise distributional statistics that capture how variability in the acoustic realization of speech sounds covaries with indexical information:4 e.g., how the distribution of acoustic cue values associated with /s/ vs. /ʃ/ or with the vowels in near vs. square vary across talkers and social groups. By drawing on implicit knowledge of talker-specific and group-specific distributional statistics, listeners are able to infer whether an observed constellation of acoustic cue values map to /s/ or /ʃ/, for example, based on indexical information about the talker (see Figure 1 for a schematic example of this logic). When listeners lack relevant (or robust) talker-specific knowledge—as when encountering a novel talker—they can generalize based on prior experience with similar talkers (e.g., talkers with the same or similar accent). The ideal adapter model further predicts that listeners continuously update their beliefs about talker-specific and group-specific distributions as they experience new input from familiar and new talkers. Thus, category inferences are predicted to change as the short-term and long-term statistics of the environment change (Kleinschmidt & Jaeger, 2015b).

A closely related view, known as computing cues relative to expectations (C-CuRE), focuses on the cues themselves—rather than the cue-to-category mapping process—and suggests that the perceptual encoding of acoustic cues is a dynamic and talker-contingent process (Cole, Linebaugh, Munson, & McMurray, 2010; McMurray & Jongman, 2011; see also Kleinschmidt & Jaeger, 2015a). As listeners identify other category-relevant sources of variability (e.g., identifying the talker or the talker’s social group membership), the acoustic cue values are recoded in terms of their difference from expected values. C-CuRE can be seen as a specific algorithm (potentially one of several) by which the computational-level ideal adapter framework is implemented. Consider the minimal pair beach-peach. Fundamental frequency (F0) at vowel onset is a secondary cue to the voiced-voiceless (e.g., /b/ − /p/) contrast in English, with relatively higher F0 for voiceless segments. Since F0 varies systematically as a function of vocal tract length, an F0 value that is high for a male talker might be low for a female talker. Thus, knowing the raw F0 value at vowel onset is not particularly informative about whether the preceding segment is voiced or voiceless. If listeners have independently identified the talker’s gender (or the identity of the talker), raw F0 can be recoded in terms of its difference from the expected gender-specific (or talker-specific) mean F0. C-CuRE is thus also related to the normalization accounts discussed above: recoding acoustic values in terms of their difference from expected values emphasizes the spectrally contrastive nature of speech perception (see, e.g., Holt, 2005; Holt & Lotto, 2002; Huang & Holt, 2009), and computing the difference relative to expected talker means partials out (normalizes) talker variability.

5. Open Questions and Directions for Future Research

Research in speech perception has come a long way in understanding how variable speech input is mapped to linguistic categories in memory: from simply assuming invariance to modeling a highly complex, layered system that draws on statistical information in the speech signal. We have discussed several aspects of the speech perception system that enable listeners to cope with talker variability: sensitivity to articulatory gestures (and recovery of articulatory information from the speech signal), normalization, memory for episodic detail, and perceptual and distributional learning mechanisms that are sensitive to patterns of variation in speech. At least some of these mechanisms are automatic and appear to operate during the early stages of processing (i.e., pre-categorical, pre-speech mechanisms). It is an open question to what extent these early processes involve learning: for example, learning early in development that F0 and F3 are correlated with vocal tract length and, hence, learning that these cue dimensions can be used to normalize variability resulting from individual differences in vocal anatomy. Similarly, it is an open question as to how flexible low-level processes are. While there is evidence that low-level auditory processes engage in distributional learning (for discussion, see Kleinschmidt & Jaeger, 2015a), it is not yet known whether the types of distributional learning involved in adapting to talker and accent variability are neurally coded early (for discussion, see Goslin, Duffy, & Floccia, 2012) and further whether sensitivity to covariation between social factors and speech cues is partly due to low-level processes.

The debate between normalization accounts and episodic/exemplar-based accounts of speech perception warrants further discussion. The fact that episodic details of speech stimuli are retained in memory and affect speech perception is one of the primary challenges to normalization accounts (Johnson, 2005). However, it is important to note that, in principle, abstracting away from variance is orthogonal to whether fine acoustic details are retained in memory. That is, normalization accounts that aim to identify relational invariants (e.g., vowel formant ratios that are stable across talkers) do not, in principle, require fine-grained stimulus details to be “filtered out,” forgotten or otherwise inconsequential for speech perception. Thus, the fact that speech perception is sensitive to episodic detail indicates that normalization accounts, as typically formulated, are insufficient, but does not rule out normalization altogether. Likewise, episodic and exemplar-based theories (e.g., Goldinger, 1996; Johnson, 1997) do not provide a straightforward account for some of the strongest evidence of normalization—i.e., that speech sounds are interpreted relative to frequency information in the surrounding context, even when this “context” is non-speech sine-wave tones (Laing et al., 2012). Thus, like normalization accounts, episodic models alone are insufficient to explain how the speech perception system copes with variability, despite evidence that episodic information plays an important role in recognition and categorization processes (Bradlow et al., 1999; Palmeri et al., 1993).

We take the integration of these sometimes conflicting—though not necessarily incompatible—views to be one of the big open questions in research on speech perception. Similar open questions pertain to the relation between abstractionist prototype accounts of speech perception (such as parametric Bayesian accounts) and non-prototype accounts (such as, e.g., exemplar-based accounts). It is now increasingly assumed that the representations that subserve speech perception likely involve both storage of specific exemplars and abstractions over these exemplars (Goldinger, 2007; Kleinschmidt & Jaeger, 2015b; McQueen et al., 2006; Pierrehumbert, 2001). Integrating these views will be critical in understanding how low-level pre-speech and higher-level speech processes jointly achieve relative invariance—the ability to robustly recognize speech categories across talkers.

Acknowledgments

This work was partially funded by a NICHD R01 HD075797 to T. Florian Jaeger and a Graduate Enrichment Fellowship from The Ohio State University to Kodi Weatherholtz. We are grateful to Arthur Samuel, Sheila Blumstein, Beth Hume, Jessamyn Schertz, and two anonymous reviewers for comments on an early version of this article. All errors or oversights are our own. The views expressed here are those of the authors and not necessarily those of the funding agencies.

Koops, C., Gentry, E., & Pantos, A. (2008). The effect of perceived speaker age on the perception of PIN and PEN vowels in Houston, Texas. University of Pennsylvania Working Papers in Linguistics: Selected Papers from NWAV 36, 12, 91–101.Find this resource:

Notes:

(1.)
As we discuss below, biological changes are not sufficient to explain cross-linguistic variance in the formant structure between male and female talkers (Johnson, 2006).

(2.)
The general ratio-based proposal traces to the work of Richard John Lloyd in the late 1800s (Lloyd, 1890a, b, 1891, 1892). To quote from Lloyd’s doctoral thesis: “the great implied postulate of the organic system of phonetics . . . [is that] like articulations produce like sounds . . . For if half-a-dozen human beings, of identical type but widely differing size, all articulate a given vowel exactly in a given way, it is then clear that mathematically speaking, these six examples of the configuration of that vowel will be a series of similar figures. Now if this be true, whether vowel resonance be single or double or even more complex, it is certain that the pitch of that resonance or body of resonances will vary exactly in proportion to the relative size of the configuration from which it proceeds” (Lloyd, 1890a, p. 172, emphasis added). Here, Lloyd outlines the earliest account of F1) normalization, in which vowel formants are scaled by the talker’s fundamental frequency (F0), which is the acoustic correlate of perceived pitch.

(3.)
The terms perceptual learning and adaptation have been variously defined in speech perception research (and more generally in the field of cognitive psychology). Sometimes perceptual learning refers to long-lasting changes in how the perceptual system processes incoming stimulus information, while adaptation is taken to refer to relatively short-term adjustments (see Goldstone, 1998), based on bottom-up information (Eisner, 2012; but see Kleinschmidt & Jaeger, 2015a). In yet other cases, adaptation is considered the behavioral outcome of any type of learning mechanism that tracks and responds to properties of the environment (see Kleinschmidt & Jaeger, 2015b; see also Bradlow & Bent, 2008; Fine, Jaeger, Farmer, & Qian, 2013; Maye, Aslin, & Tanenhaus, 2008).