Purpose: A feedback perturbation paradigm was used to investigate whether prosodic cues are controlled independently or in an integrated fashion during sentence production.

Method: Twenty-one healthy speakers of American English were asked to produce sentences with emphatic stress while receiving real-time auditory feedback of their productions. The fundamental frequency (F0) of the stressed word in each four-word sentence was selectively shifted in a sensorimotor adaptation protocol. Speakers experienced either an upward or a downward shift of the stressed word, which gradually altered the perceived stress of the sentence.

Results: Participants in the Up and Down groups adapted to F0 shifts by altering the contrast between stressed and unstressed words differentially, such that the two groups deviated from each other in the perturbation phase. Furthermore, selective F0 perturbation in sentences with emphatic stress resulted in compensatory changes in both F0 and intensity.

Conclusions: Present findings suggest that F0 and intensity are controlled in an integrated fashion to maintain the contrast between stressed and unstressed words. When one cue is impaired through perturbation, speakers not only oppose the perturbation but enhance other prosodic cues to achieve emphatic stress.
Keywords: prosody, fundamental frequency, auditory feedback, perturbation, sensorimotor adaptation

Prosodic adaptations to pitch perturbation in running speech

Despite the importance of prosody in conveying numerous linguistic and attitudinal contrasts, models of speech production largely focus on segmental and not prosodic control (Guenther, Ghosh, & Tourville, 2006; Saltzman & Munhall, 1989). One such model of speech acquisition and production is known as DIVA (Directions Into Velocities of Articulators; Guenther, 1994, 1995; Guenther et al., 2006). DIVA is a biologically plausible adaptive neural network in which acoustic feedback is used to acquire sensory and motor targets for speech sounds. Currently, DIVA lacks a representation of prosodic control, limiting its scope as a comprehensive model of spoken communication. Furthermore, modeling prosody may lead to improved assessment and intervention of neuromotor speech disorders that are characterized by prosodic deficits (Darley, Aronson, & Brown, 1969, 1975; Duffy, 2005).

The current study is designed to extend the DIVA model to include the control of speech prosody. Minimally, this requires representations of the acoustic cues associated with prosody: fundamental frequency (F0), intensity, and syllable duration, perceived by listeners as pitch, loudness, and length, respectively (Bolinger, 1989; Lehiste, 1970, 1976; Shattuck-Hufnagel & Turk, 1996). It is unclear, however, whether these cues should be represented in an independent or integrated fashion. An Independent Channel Model would posit that F0, intensity, and duration are controlled separately, while in an Integrated Model, two or more acoustic cues would be jointly controlled. The current study aims to distinguish between these opposing models as a first step toward representing the complex phenomenon of prosody.

Such cross-speaker cue trading is consistent with both an Integrated Model and an Independent Channel Model of prosodic feedback control. The two models can be differentiated by examining the effects of auditory perturbations during speech production. Perturbation paradigms show the importance of auditory feedback for online vocal control during speaking tasks. Numerous studies have investigated gradual or sudden perturbations to F0 (Burnett, Freedland, Larson, & Hain, 1998; Chen, Liu, Xu, & Larson, 2007; Jones & Munhall, 2002, 2005; Larson, Burnett, Kiran, & Hain, 2000; Xu, Larson, Bauer, & Hain, 2004), as well as to intensity (Bauer, Mittal, Larson, & Hain, 2006; Chang-Yit, Pick, & Siegel 1975; Heinks-Maldonado & Houde, 2005) and to vowel formant frequencies (Houde & Jordan, 1998; Tourville, Reilly, & Guenther, 2008; Villacorta, Perkell, & Guenther, 2007). A consistent finding in perturbation studies is a compensatory response: speakers alter their production of the perturbed feature in the direction opposite to the perturbation. This opposing response is noted both for adaptation paradigms and for paradigms that use brief, unexpected perturbations to auditory feedback. Adaptation paradigms involve persistent exposure to the same perturbation, allowing subjects to adapt their feedforward commands (“adaptation”) such that they continue to respond to the perturbation even after it has been removed. In contrast, unexpected perturbation studies use one or more brief, unpredictable perturbations to elicit a compensatory response within a given trial (“rapid compensation”).

Most F0 perturbation studies have examined rapid compensations during sustained vowel phonation rather than in linguistic contexts (Burnett et al., 1998; Larson et al., 2000; Xu et al., 2004). While recent work has examined linguistically-relevant perturbations to tones and tone sequences in Mandarin (Jones & Munhall, 2002, 2005; Xu et al., 2004), meaningful prosodic contrasts remain largely unexplored in English. A notable exception is the work of Chen et al. (2007) which examined brief, unexpected upward and downward F0 perturbations as speakers produced the question “you know Nina?” The authors note that upward perturbations, which were not at odds with the rising intonation contour of the target question, resulted in a smaller compensatory response than downward perturbations. Although the perturbation had linguistic relevance, the use of an imitation paradigm may have influenced speaker responses. Further work on eliciting a range of prosodic contrasts in linguistically-motivated communicative contexts is warranted. Additionally, speakers tend to use multiple acoustic cues to signal prosodic contrasts, yet compensatory responses have only been examined within the perturbed parameter, e.g., measuring compensations in F0 for pitch-shifted feedback.

The present study extends the F0 auditory perturbation literature in two main directions. First, meaningful prosodic contrasts in English are elicited by providing contextual scenarios that cue the location of stress within each utterance. Thus, during perturbed trials, speakers must compensate for F0 shifts of the stressed word to preserve the intended prosodic contrast. This linguistically-motivated task may better resemble auditory feedback control during running speech. Second, compensatory responses to F0 perturbation are examined across multiple cues. In light of cue trading relations, changes in intensity and duration may also contribute to the compensatory response, which would be consistent with the Integrated Model. Alternatively, compensatory responses limited to F0 alone would be evidence for an Independent Channel Model.

In summary, the present study aimed to investigate the prosodic cues used to convey emphatic stress under conditions of near real-time pitch perturbation. Specifically, the following research questions were addressed:

1. Do speakers adapt to targeted F0 perturbations of stressed words within an utterance?

2. Does this adaptation response occur in other features besides F0 (e.g. intensity, duration)?
Method

Participants

Twenty-five monolingual speakers of American English with normal hearing and no known speech, language, and neurological disorders between the ages of 20-28 (12 M, 13 F; mean age = 22.0 years) were recruited. Participants were randomly assigned to either the upward shift (Up, hereafter) protocol (6 M, 6 F; mean age = 22.2 years) or the downward shift (Down, hereafter) protocol (6 M, 7 F; mean age = 21.9 years). All participants passed a hearing screening with thresholds at or below 25 dB in at least one ear for 250, 500, 1000, 2000, 4000, and 8000 Hz tones, and reported having vision within correctable limits.

Procedures

Participants were seated in a sound-treated booth and wore a head-mounted cardioid microphone (AKG C420) and over-the-ear headphones (AKG K240), which were used to record productions and present auditory feedback, respectively. The microphone-to-mouth distance was held constant at one inch away from the left-hand corner of the mouth. A customized graphical interface presented stimuli that participants read aloud. Four sentences were used, each consisting of four monosyllabic words. To control for vowel-dependent differences in F0, vowel nuclei were kept relatively constant across the sentence (Lehiste & Peterson, 1961; Peterson & Barney, 1952). In each trial, participants produced the four-word sentence with stress on either the first or the second word. The stressed word was cued visually (i.e. using a capitalized, red font) and by providing a contextual scenario. For example, the context sentence “Who caught a dog?” would prompt the target sentence “BOB caught a dog” on the screen. Conversely, “What did Bob do to a dog?” prompted the sentence “Bob CAUGHT a dog.” (The remaining three sentences were Dick bit a kid, Doug cut a bud, and Dad pat a cat.) Participants were instructed to produce emphatic stress such that a naive listener could identify the intended stress location.

Given that stressed words tend to have a higher F0 than unstressed words (e.g., Cooper, et al., 1985; Eady & Cooper, 1986; Morton & Jassem, 1965; O’Shaughnessy, 1979), participant-specific F0 thresholds allowed for selective F0 perturbation of stressed words alone. For each participant, aA brief pre-test consisting of 16 practice sentences, identical to the experimental stimuli, was used to determine the perturbation threshold for each participant. The threshold was operationally defined as the F0 value that optimally separated stressed words from unstressed words across all 16 trials. The experimenter visually determined the lowest F0 value that exceeded all unstressed F0 values. F0 values below the threshold value were never perturbed.

In the experimental protocol, each participant produced a total of 480 sentences across four phases: a baseline phase with no perturbation; a ramp phase during which the perturbation was applied to the auditory feedback in increments; a perturbation phase involving full feedback perturbation on the stressed word; and a post phase with no perturbation. In the ramp and perturbation phases, F0 of the stressed word was scaled in proportion to the amount it exceeded the threshold. The formulae used to calculate the scaling factors that transformed input F0 to output F0 were:

Up: pitchscale = 1 + ((F0/threshold - 1) * pertval);

Down: pitchscale = 1 - ((F0/threshold - 1) * pertval);

The coefficient pertval was set to 0 during the baseline phase, gradually increased to .5 during the ramp phase, held constant at .5 during the perturbation phase, and reset to 0 during the post phase.

For example, if a subject were assigned to the Down group and her threshold was 200 Hz, a 220 Hz production during the perturbation phase would result in a scaling factor of 1 – ((220/200 – 1) * .5), or 0.95. Scaling the input F0 of 220 Hz by 0.95 would result in an output F0 of 209 Hz, an apparent decrease in F0 which would cause the stressed word to sound less stressed. On the other hand, if the same subject were assigned to the Up group, the scaling factor for the same utterance would be 1.05 and would increase the perceived F0 to 231 Hz, thereby increasing the apparent F0 contrast between the stressed word and the unstressed words (see Figure 1).

Insert Figure 1 about here
Perturbation was implemented using a Texas Instruments (TI DSK 6713) digital signal processing (DSP) board with only minimal processing delay (~26ms). An audio mixer split the subjects’ speech signal into two channels, one sent to a computer for recording and one sent to the DSP board. The DSP board used a near-real-time autocorrelation algorithm to track and shift the F0 of each participant. This F0-shifted output was further split and sent both to the subjects’ headphones and to the recording computer. Thus, each experimental session produced a stereo waveform consisting of one channel of microphone-recorded data (i.e. what the participant produced) and one channel of feedback-perturbed data (i.e. what the participant heard). The two channels were compared with and without perturbation to ensure that the F0 shift had no effect on intensity.

Acoustic analysis

Customized software implemented in Matlab (CadLab acoustic analysis suite (CLAAS)) was used to derive estimates of F0, relative intensity, and duration for each word across all utterances. Each utterance was manually annotated to demarcate word boundaries (r = 0.984 interlabeler reliability for 10% of the data). CLAAS used the Praat autocorrelation algorithm to estimate time-stamped F0 values (Boersma & Weenink, 2009). Similarly, time-stamped intensity values were derived via a root-mean-square calculation of the acoustic waveform. The software operated on the annotations and the time-stamped pitch and intensity values to calculate word duration, average F0, and average intensity across stressed and unstressed words. All analyses were performed on the original spoken utterance, not on the F0-perturbed feedback. The perturbed signal was compared with the microphone-recorded signal to ensure perturbation occurred on the intended trials.

A total of 12,000 utterances were acoustically analyzed (480 trials x 25 participants). A subset of the utterances was examined by hand to ensure correct pitch tracking of all words. Pitch tracking errors, when found, were manually corrected. Errors in pitch tracking were especially problematic for females, particularly for the third and fourth words, which were often in the glottal fry register. Manual correction of automatically generated F0 values was required on 8.3% of the total dataset; 2.7% were excluded. Two female subjects had greater than 100 mistracked trials (>20%) and were excluded from further analysis. Furthermore, one male subject was excluded due to corrupted acoustic data, and one female subject was excluded because she produced incorrect stress on greater than 40% of trials (incorrect stress for all other subjects ranged from 0-13 tokens (0-2.7%), with an average of 2.46 (0.5%) errors in the Down group and 1.33 (0.3%) errors in the Up group). The resultant dataset after exclusions was 9752 utterances from 21 participants (Up: 6 M, 5 F, mean age = 22.0 years; Down: 5 M, 5 F, mean age = 22.2 years).

Although acoustic measures were obtained for all four words within an utterance, analyses were restricted to the first and second word (W1 and W2) for two main reasons. First, W1 and W2 were the only two word locations that were counterbalanced in both stressed and unstressed conditions. Second, the word length and sentence position of W3 and W4, respectively, led to variable and imprecise acoustic measurements. Specifically, W3, which was the word “a” in all stimuli, was often reduced or even omitted, while W4 was often glottalized or excessively lengthened owing to phrase-final boundary effects.

Results

Speaker responses to F0 perturbation were examined in three acoustic variables: mean F0, mean intensity, and word duration. For each trial, the dependent measure was the difference in a given acoustic variable between the stressed word (W1 or W2) and the unstressed word (W2 or W1). This difference was normalized by the mean stressed–unstressed difference in baseline. For simplicity, this normalized value will be referred to as the contrast distance, since it represents the degree to which speakers contrasted the stressed and unstressed words within an utterance.

Broadly, the Up and Down groups responded differently to the F0 perturbation, altering their contrast distances in opposite directions with only a short delay from perturbation onset. However, individual speakers were noted to use differing strategies to compensate for perturbations and had individualized time courses for adaptation. Additionally, there was a group-wise trend of a slow increase in F0 and intensity of stressed words across the experiment. Covariance of F0 and intensity due to tracheal pressure changes would result in a trial-by-trial correlation of these two variables, since this physiological coupling would necessarily be present for every trial. To ensure that variations in intensity were not simply the result of a passive physiological correlation with F0, the correlation of these two measures was calculated on a trial-by-trial basis for each participant, and the resulting r scores were Fisher z-transformed before averaging across the group. This analysis yielded weak correlations (z = 0.14 averaged across participants; back-converted to r = 0.139), suggesting that changes in subglottal pressure required to modulate F0 intensity had had little direct influence on intensityon F0 changes in this study.

To quantify the changes in contrast distance between and within subject groups, paired and independent samples t-tests were conducted on conditions of interest. Between-group (Up vs. Down) differences were compared at all four experimental phases. Because of Taking into account the upward drift of both F0 and intensity over the course of the experiment, which affects interpretation of within-group changes, the analysis focuses on these between-group differences. Additionally, within each perturbation direction (Up or Down), differences between all phases (baseline, ramp, perturbation, and post) were compared; therefore, there were six comparisons for each perturbation direction, or twelve within-subjects comparisons. In total, sixteen t-tests were carried out for each acoustic variable. To account for multiple comparisons, the Bonferroni correction factor was used to adjust the -level to 0.003.

Mean fundamental frequency (F0)

Between-group comparisons show evidence of adaptation to the upward and downward F0 perturbations (Figure 2). In the baseline phase, in which no perturbation was applied, there was no significant difference between the Up and Down groups (p = 0.45). However, the two groups diverged in the ramp phase (p = 0.0014) and remained significantly different in the perturbation phase (p < 0.0001) before falling back below the adjusted significance level in the post phase (p = 0.02). Thus, the perturbation resulted in a difference in F0 contrast distance between the two groups. Specifically, speakers altered F0 to enhance or reduce emphatic stress, with the Down group increasing the F0 difference between stressed and unstressed words as compared to the Up group.

Insert Figure 2 about here
Contrasts between phases were used to examine the time course of adaptation within a perturbation direction group. In the Down group, there was a difference between baseline and every other phase (ramp, p = ***; perturbation, p = ***; and post, p = ***), as well as between ramp and every other phase (perturbation, p = ***; and post, p = ***). There was no difference between perturbation and post phases (p = ***). Thus, the F0 contrast distance increased from the baseline through to the perturbation phase and then stabilized in the post phase. In the Up group, only differences between the post phase and every other phase (baseline, ramp, and perturbation) were statistically significant. In other words, the F0 contrast distance did not change from baseline values until the upward perturbation was removed in the post phase. Patterns of adaptation within groups may be due in part to the overall upward drift of F0 during the course of the experiment.

Mean intensity

As with fundamental frequency, there was evidence of adaptation in intensity (see Figure 3). Speakers who received a downward perturbation increased the intensity contrast between stressed and unstressed words more than speakers who received an upward perturbation, even though speakers’ intensities were unaffected by the perturbation. The two perturbation direction groups significantly differed in intensity contrast distance during the perturbation phase (p < 0.0001); however, they were not significantly different in any other phase (baseline: p = 0.5; ramp: p = 0.018; post: p = 0.006).

Insert Figure 3 about here
With regard to within-group contrasts, in the Down group only the baseline phase was significantly different from other phases (ramp, perturbation, and post). In other words, the intensity contrast distance increased during the ramp phase and remained increased throughout the experiment. In the Up group, the only significant phase contrast was that between baseline and post phases, again suggesting a slow drift in intensity contrast distance over the course of the experiment.

Word duration

Unlike in fundamental frequency and intensity, there were no significant between-subjects differences (Up vs. Down) in duration contrast distance (see Figure 4). In other words, the perturbation did not effect a durational change between the stressed and unstressed words.

Insert Figure 4
While there were no between-subject differences, there was a difference in the Down group between baseline and perturbation phases (p = 0.0003), as well as between baseline and post phases (p < 0.0001). In the Up group, however, experimental phase had no effect on duration contrast difference.
Discussion

A major aim of the current study was to distinguish between two representations of speech prosody, one involving independent control of pitch, loudness, and duration, and one involving integrated control of these cues. The emphatic stress task required speakers to modulate prosody while keeping segmental units constant. Introducing a gradual F0 perturbation altered the influence of F0 as a stress-bearing cue. To maintain the appropriate degree of contrast between stressed and unstressed words, speakers might alter only F0, consistent with the Independent Channel Model, or they might alter a combination of prosodic cues to oppose the F0 shift, consistent with the Integrated Model.

Speaker responses to pitch-shifted auditory feedback were measured in three acoustic variables: F0, intensity, and duration. Results indicated that the Up and Down groups adapted to shifts in F0 by altering the contrast between stressed and unstressed words. Specifically, when participants heard their stressed F0 shifted downwards, they increased F0 contrast compared with when they heard their stressed F0 shifted upwards. The interaction between the two F0 manipulations supports the conclusion that speakers are sensitive to upward and downward shifts of F0 in a meaningful prosodic context. Furthermore, compensatory effects were not restricted to F0 but also extended to intensity: speakers altered intensity contrast distance during perturbation, making stressed words relatively louder in the Down group compared with the Up group. This change occurred even though (i) the auditory feedback preserved intensity differences between stressed and unstressed words, and (ii) intensity and F0 were only very weakly correlated in individual trials. These findings provide support for the Integrated Modelin that speakers modified both F0 and intensity (although not duration) in response to F0 shifts.

While we did not observe perturbation-induced changes in word duration, we do not rule out the possibility that stress-related timing may be affected by F0 shifts. It is possible that an effect of duration would be seen in other prosodic contexts, or in other measures of duration (e.g. in the length of the pauses between words).

In both experimental groups, Up and Down, speakers’ F0 in the perturbation phase was the same or higher than their initial baseline values. This is consistent with the gradual increase in F0 noted in past sensorimotor adaptation studies (Jones & Munhall 2000, 2002; Villacorta et al., 2007), where it was attributed to vocal fatigue. During F0 perturbations to single-word productions, subjects were found to increase F0 over many trials (Jones and Munhall 2000, 2002), even in the baseline phase, with no perturbation yet introduced. In the present sentence production task, only the stressed word showed a similar drift, resulting in an apparent increase in the contrast distance over the course of the experiment. Given that the contrast distance gradually increasedan expected gradual increase in contrast distance, it appears that the Up group was in fact opposing the perturbation by decreasing F0 relative to an upward drift. Because of the use of the Hertz scale, which is acoustically and not psychophysically defined, the F0 shifts were slightly greater perceptually in the Down condition than in the Up condition. However, the differences in this region of acoustic space are small relative to the cross-subject variation in threshold values and in upward deviations from those thresholds.

Unlike Jones & Munhall (2002), who found no significant differences in intensity between Up and Down groups, our present results show an increased intensity for the Down group as well as a trend towards increasing intensity over many trials. This may be due to the prosodic nature of the emphatic stress task. Loudness can be used as a stress cue, and thus participants had reason to manipulate intensity in line with F0 to maintain emphatic stress (Fry, 1955; Kochanski et al., 2005).

While the gradual upward drift in F0 and intensity is consistent with an adaptive response from both Up and Down groups, an alternative explanation is that the two groups adapted to different degrees. The increase in F0 by the Down group may be indicative of a larger compensation than that of the Up group, whose F0 did not decrease from baseline values. The perturbation experienced by the Up group did not interfere with the planned intonation pattern of a stressed–unstressed contrast; thus, there may have been less of a need for a corrective response. That is, there may be more of an incentive to restore a stress contrast that has been attenuated (Down group) than to decrease a stress contrast that has been enhanced (Up group), since the enhancement is aligned with the speaker’s goals. This interpretation is also supported by previous work in which downward perturbations elicited a larger rapid compensation response than upward perturbations during question productions (Chen et al., 2007).

While the compensatory changes in both F0 and intensity are in line with the Integrated Model, they might also be attributed to a correlation between F0 and intensity (Gramming, Sundberg, Ternström, Leanderson, & Perkins, 1988; Dromey & Ramig, 1998), as F0 has been found to increase at higher than habitual loudness levels associated with a high tracheal pressure (though not at soft levels). However, our analysis of the trial-by-trial F0-intensity correlation suggests this was not the case. The correlation explains less than 2% of the variance observed in F0; thus, physiological dependencies alone are unlikely to explain the commensurate intensity increase in the group data.

The present results support an Integrated Model of prosodic control in which the motor system modulates F0 and intensity in combination to convey stress. However, extending this protocol to perturbation of other prosodic cues will help to generalize the findings. In a planned follow-up to the current study, subjects will undergo intensity perturbations rather than F0 perturbations, allowing a direct comparison of the F0-intensity interaction in each direction. Future experiments will also help disambiguate the roles of word type and word position within an utterance. All of the stimuli in the current study followed the same pattern of word types across the sentence (Name verbed a noun). It will be important to vary this pattern to assess the influence of different parts of speech and sentence positions on adaptation responses. Finally, given that perturbation paradigms can induce enhanced linguistic contrasts in healthy talkers, it may be possible to leverage this paradigm for therapeutic benefit in disordered populations.

Acknowledgements

This research was supported by NIH grant # R01 DC02852. We thank Katherine Alexander, Pamela Campellone and Demetris Roumis for their time and dedication to data collection and annotation, as well as Matthew Fortier, Mark Tiede and Jonathan Brumberg for their efforts in implementing the customized acoustic and statistical analysis tools.