The Joy of Peer Review

If you’ve submitted academic papers for publication in peer-reviewed journals, you’ve very likely received a frustrating review or three. It’s all too easy to feel anger and frustration at the apparent sloppiness or boneheadedness of a reviewer. But if a review has something of substance (and they usually do, in my experience), my approach is to put the burden on myself as much as I can. Unless compelled to believe otherwise, if a reviewer takes issue with something I write, I assume that I have failed to communicate an idea as clearly as I could have, and I revise (and resubmit or repackage for another journal) accordingly. This is emotionally more satisfying, and it not infrequently produces better papers.

It shouldn’t surprise you that I am writing this because I have recently received a frustrating review (to my paper “Syllable structure and perception of voicing and place of articulation in English stop consonants”, pdf linked on my cv page, though no direct link from here because, eventually, it will change from being a preprint draft to being a publication, at which time the link would die). Actually, two of the reviews I received were frustrating (if also substantive), though I will focus narrowly here on one small part of one of them.

First, some background. In the paper in question, I used general recognition theory to model phonological feature perception. I was looking at how voicing and place of articulation interact (or not) in phoneme identification. More concretely, I had people identify simple syllables consisting of a single consonant – either [p], [b], [t], or [d] – and a single vowel – [a] as in ‘father’. Abstractly, voicing distinguishes [p] from [b] and [t] from [d]; [p] and [t] are voiceless, while [b] and [d] are voiced. Analogously, place of articulation distinguishes [p] from [t] and [b] from [d]; [p] and [b] are labial, while [t] and [d] are alveolar. The set of four consonants consists of the factorial combination of {voiceless, voiced} and {labial, alveolar}, which maps nicely onto a two-dimensional perceptual space (voicing x place) with two levels on each dimension. (Naturally, the paper has as much additional detail as you are ever likely to want, so you know where to go if you’re short on technical reading material.)

All too often, in my opinion, when phoneticians go to study perception of, say, the voicing distinction between [t] and [d] or [p] and [b], they use synthetic stimuli. Of course, there are good reasons to use synthetic stimuli, the most obvious being that they enable isolation of the effects of acoustic properties of interest. However, creating synthetic stimuli requires the experimenter to make strong assumptions about the set of acoustic cues that are relevant to a phonological distinction. Among the many assumptions that must be made in synthesizing [pa], [ba], [ta] and [da] syllables, for example, the experimenter has to determine voice onset time (VOT) duration, release burst amplitude, aspiration noise amplitude, various spectral properties of the burst and aspiration noise, fundamental frequency (f0) at the consonant-vowel boundary, f0 in the rest of the vowel (and how it changes over time), vowel duration, and formant values (and changes therein). We have to make exactly the same kinds of assumptions if we’re interested in, say, the place difference between [pa] and [ta] or between [ba] and [da], or pretty much any other pair of (minimally) contrastive phones.

If you’re synthesizing ‘from scratch’, you have to tell your speech synthesizer exactly what you want for all of these acoustic properties (and more), or you have to trust that it has reasonable default settings. Of course, you don’t have to synthesize from scratch. You can manipulate natural speech, as well. You could re-synthesize natural speech with STRAIGHT, or you could (again focusing on voicing and VOT) take a token from one category (e.g., [ta]) and systematically shorten the VOT to produce, bit by bit, stimuli that sound more like [da]. But neither of these get you off the hook from making strong assumptions. Some of what you feed into STRAIGHT is left alone, which is to say that you’re (at least implicitly) assuming that the values on these acoustic dimensions are appropriate to your goals. And if you start from [ta] and only manipulate VOT, your [da] stimuli aren’t going to sound very [da] like, since you’ll be stuck with the release burst, aspiration noise, and f0 from the [t]. Starting with [da] and lengthening VOT has the same basic problem.

So, on the one hand, if you’re interested in probing a single phonological distinction, you have to deal with the fact that any given distinction is cued by differences in multiple acoustic properties. Fixing all but one of these at a predetermined value is just one way to do so. On the other hand, if you’re interested in probing multiple phonological dimensions simultaneously, as I am, then the situation is even more complicated, because it’s not just that any given contrast maps onto multiple cues, but that any given acoustic cue can also provide information about multiple phonological distinctions.

The fact that there is a many-to-many mapping between abstract distinctions and associated acoustic cues is important. VOT is longer for voiceless sounds (p, t) than for voiced (b, d), but it’s also longer for alveolars (t, d) than it is for labials (p, b). Similarly, release burst amplitude is lower for voiced sounds than it is for voiceless, but it’s also lower for labials than alveolars. Some acoustic cues are relevant to one but not another contrast, some are irrelevant to both voicing and place, and some of them have, as far as I know, an unknown relationship to these phonological distinctions.

When you use synthetic, re-synthesized, or manipulated stimuli, you get experimental control, and you make choices about the dimension(s) you’re interested in. But you also make choices about every other dimension, some of which are relevant to the contrast of interest, some of which are not, and some of which play an unknown role.

My approach to addressing this issue in the experiments under consideration was to be “cue agnostic.” I produced four tokens of each syllable, letting all the relevant acoustic cues do what they do, mapping many-to-many between the phonological and the acoustic-phonetic levels, and letting the irrelevant acoustic properties do whatever they do when they’re busy being irrelevant (e.g., vary randomly from token to token). I used four tokens (i.e., more than one) per category because I found in piloting the experiments that with only one token per category, a unique and phonologically irrelevant acoustic feature of a stimulus (e.g,. a slight uptick in f0 at the end of the [pa] syllable) could be used to correctly identify that stimulus without paying any attention to any of the phonologically relevant acoustic cues (e.g., VOT, burst amplitude, formant transitions, etc…). I used four tokens (i.e., not tens or hundreds) because I wanted enough data in response to each token to feel confident in the inferences drawn from the fitted statistical models (which map stimulus categories onto distributions in perceptual space).

I also provide a number of acoustic measurements of the stimuli in Appendix B of the paper, so that the reader can get a sense of the variability between tokens and the extent to which the stimuli are typical of the intended categories. (As an aside, Appendix A is also indirectly relevant to the discussion, as it provides the raw data and statistical evidence that there are few, if any, consistent differences between response patterns to the different tokens within each category.)

All of which brings me, finally, to the frustrating bit of Reviewer #4’s comments that precipitated this post (emphasis mine):

The author in this case only has one speaker and not very many tokens. This creates a confound in that the results cannot be generalised to other speakers. On p. 9, the author states, “multiple tokens of naturally produced (and so naturally variable) speech stimuli are presented (embedded in noise; cf. the array of synthetic vowel stimuli presented in quiet in Nearey, 1997).” While this is definitely “cue agnostic”, it is certainly not speaker agnostic and it is likely that idiolectal properties of the speaker compromise the validity of the conclusions. A synthetic stimulus series is likely to contribute much more information than a small number of naturally produced tokens. Later, on p. 25, the author writes, “In order to ensure that the subjects were not simply able to attend to some irrelevant acoustic feature(s) of a particular token of a particular category, a small degree of within-category variability was introduced by using four tokens of each type.” However, the subjects are still more likely to attend to some irrelevant acoustic feature that is particular to a single speaker.

To be clear, these comments are substantive. The single speaker and small number of tokens per category limit the generality of the results. But the same basic problem puts equal or even more severe limits on the generality of studies that use synthetic or manipulated stimuli.

If you’re using manipulated stimuli from a single speaker, how do you know if your results generalize to other speakers? If you’re using stimuli synthesized from scratch, how do you know if your results generalize to any speakers? You can test the generality of your findings directly, of course, or you can make the case that the myriad fixed and presumed irrelevant acoustic properties of your stimuli are sufficiently representative of speech in the language you’re studying. With my non-synthetic stimuli, I chose the latter approach, providing the reader with a number of acoustic measurements in an effort to establish that the stimuli are, in fact, reasonably typical of the categories in question. And then Reviewer #4’s final comment was this gem: “One final note: What purpose does Appendix B serve?”

I appreciate that Reviewer #4 took the time to think carefully about my paper, but it’s frustrating that the careful thinking stopped where it did. My results have limited generality, without a doubt. But limited generality isn’t at all atypical of speech perception research (or, for that matter, experimental research of any kind). And the same issues that limit my findings also limit the generality of findings based on stimuli proffered as a solution to these issues by Reviewer #4.

So, how do I place this burden on myself? I need to communicate, clearly and concisely, essentially what I wrote in this post. The fact of the matter is that any choice you make about stimuli in speech perception research forces you to limit your findings in one way or another.