Neuroscience and Cognitive Science Program, University of Maryland College Park, MD, USA ; Department of Linguistics, University of Maryland College Park, MD, USA.

2

Silicon Speech Hidden Valley Lake, CA, USA.

3

Department of Information and Communication Sciences, Sophia University Tokyo, Japan.

4

Neuroscience and Cognitive Science Program, University of Maryland College Park, MD, USA ; Department of Biology, University of Maryland College Park, MD, USA ; Department of Electrical and Computer Engineering, University of Maryland College Park, MD, USA ; Institute for Systems Research, University of Maryland College Park, MD, USA.

5

Neuroscience and Cognitive Science Program, University of Maryland College Park, MD, USA ; Department of Linguistics, University of Maryland College Park, MD, USA ; Department of Psychology, New York University New York, NY, USA ; Department of Neuroscience, Max-Planck-Institute Frankfurt, Germany.

Abstract

How speech signals are analyzed and represented remains a foundational challenge both for cognitive science and neuroscience. A growing body of research, employing various behavioral and neurobiological experimental techniques, now points to the perceptual relevance of both phoneme-sized (10-40 Hz modulation frequency) and syllable-sized (2-10 Hz modulation frequency) units in speech processing. However, it is not clear how information associated with such different time scales interacts in a manner relevant for speech perception. We report behavioral experiments on speech intelligibility employing a stimulus that allows us to investigate how distinct temporal modulations in speech are treated separately and whether they are combined. We created sentences in which the slow (~4 Hz; Slow) and rapid (~33 Hz; Shigh) modulations-corresponding to ~250 and ~30 ms, the average duration of syllables and certain phonetic properties, respectively-were selectively extracted. Although Slow and Shigh have low intelligibility when presented separately, dichotic presentation of Shigh with Slow results in supra-additive performance, suggesting a synergistic relationship between low- and high-modulation frequencies. A second experiment desynchronized presentation of the Slow and Shigh signals. Desynchronizing signals relative to one another had no impact on intelligibility when delays were less than ~45 ms. Longer delays resulted in a steep intelligibility decline, providing further evidence of integration or binding of information within restricted temporal windows. Our data suggest that human speech perception uses multi-time resolution processing. Signals are concurrently analyzed on at least two separate time scales, the intermediate representations of these analyses are integrated, and the resulting bound percept has significant consequences for speech intelligibility-a view compatible with recent insights from neuroscience implicating multi-timescale auditory processing.

Signal processing block diagram. Signals were low-pass filtered at 6 kHz, sampled at 16 kHz, and quantized with 16-bit resolution. The frequency spectrum of the speech signal was partitioned into 14 frequency bands with a linear-phase FIR filter bank (slopes 60 dB/100 Hz or greater), spanning the range 0.1 and 6 kHz, spaced in 1/3 octave steps (approximately critical band-wide) across the acoustic spectrum. The Hilbert transform was used to decompose the signal in each band into a slowly varying temporal envelope and a rapidly varying fine structure. The temporal envelope was subsequently low-pass filtered with a cutoff frequency of 40 Hz and then either low- (0–4 Hz; blue blocks) or band- (22–40 Hz) pass filtered (red blocks). The time delays, relative to the original signal, introduced by the filtering, were compensated by shifting the filter outputs. After the filtering, the envelope was combined with the carrier signal (fine structure) by multiplying the original band by the ratio between the filtered and original envelopes. The result for each original signal (S) is Slow and Shigh, containing only low or high modulation frequencies. The inset shows the effect of the signal processing on a sample sentence. The attenuation caused by the filtering is plotted as a function of the frequency content of the envelopes. Blue: envelopes (for each of the 14 frequency bands) of Slow; Red: envelopes of Shigh.

Results of Experiment 1 (analysis by subjects). Shigh and Slow, when presented separately (HIGH and LOW conditions, respectively), have low intelligibility. Dichotic presentation of Shigh with Slow (BOTH) results in significantly better performance than what could be predicted from the combined performance on the HIGH and LOW conditions (“Linear combination”). Error bars are 1 std. error.

Results of Experiment 2. Top: Intelligibility performance as a function of onset asynchrony. The “Slow leading” and “Shigh leading” conditions resulted in qualitatively similar performance and data were therefore collapsed across the two conditions in subsequent analyses. Bottom: Solid black line: Intelligibility performance as a function of onset asynchrony (collapsed over ear of presentation and Shigh/Slow leading). Performance can be divided into three intervals: Asynchronies <45 ms have no effect on intelligibility, performance declines sharply between 45 and 150 ms, remaining constant beyond that interval. Yellow lines reflect repeated (100 iterations) computations of the above measure using random-subsets of 50% of the data points that contributed to each delay condition. Dashed lines are the standard error of the mean derived from this procedure. The inset presents the same analysis but over 20% of data points in each delay condition. The main effects are preserved in these analyses indicating that they are a stable phenomenon.