The "Text-to-Speech Synthesis Technology" ASA Standards working group (S3-WG91) is conducting a web-based test that applies the method it will be proposing as an ANSI standard for evaluating TTS intelligibility. It is an open-response test ("type what you hear"). The test uses syntactically correct but semantically meaningless sentences, Semantically Unpredictable Sentences (SUS).

+ For each TTS system, intelligibility will be evaluated across a wide range of six speaking rates.

+ There are 60 short sentences presented to a listener during a test session in blocks of 10 sentences at each of six speaking rates. The test generally takes 15-20 minutes.

+ Several synthesis techniques will be tested, including formant synthesis, small inventory diphone concatenation, unit selection, and HMM-based synthesizers. Each synthesis technique is represented by at least two TTS systems. Both female and male American English voices will be tested for each system.

+ A listener will hear only one synthesizer and voice during a test session. The rate of speech will change from its default rate to increasingly faster rates as the test session progresses.

+ Synthesizers will be varied over different test sessions.

+ The set of 60 sentences tested will be varied over different test sessions. Eventually, across sessions and listeners, all systems will have been tested with the same larger set of test sentences.

+ Two human speech reference conditions will also be tested following the same procedures as used for the TTS systems. In one, the talker spoke at different speaking rates, and in another, natural speech originally spoken at a moderate rate was speeded through signal processing.

This study represents a large collaborative effort. We plan to share results with the synthesis community in the form of conference papers and a journal article. Some related studies are being conducted with blind listeners and with aided hard-of-hearing listeners.

31 Comments

Sandy Nicholson said,

‘Some of the sentences may be difficult to understand and remember’, it says. Some?! I could hardly make out any of it! I’d be interested to know how other people found this exercise.

It could just be because I’m Scottish and therefore not a native speaker of any variety of American English. (In that case, I hope I haven’t skewed the results by self-identifying as a native speaker of English.) American accents are usually pretty comprehensible to me, though, given the exposure that we get to them through the media in particular, so I wonder whether my non-native status is really enough to tip me over the edge for comprehension. Perhaps, when listening to AmE speakers, I rely to a far greater extent than I thought on contextual cues, which are largely absent in these semantically anomalous utterances.

[(myl) Note that "A listener will hear only one synthesizer and voice during a test session." As a result, the luck of the draw may assign you a rather bad one. Also, in each case the speaking rate is adjusted over a range whose high end is likely to be well out of the system's comfort zone.

And remember that the point of the exercise is to validate a proposal for rating the intelligibility of synthetic speech. In order to accomplish this, it's appropriate for roughly equal portions of the test materials to be highly intelligible, marginally intelligible, and unintelligible. This was my experience as a listener — though the overall quality of different system choices is obviously likely to be different. In any case, the test's authors are not trying to test any particular system, but rather to demonstrate that they can use a method of this sort effectively to characterize the expected performance of any system at all.]

mary said,

I began wondering whether they were testing something else, like to what extent people create context and meaning, and possibly thereby distort the sent message, in the face of an unintelligible utterance. In some sentences it was perfectly clear until the last word or syllable, and I wondered if the last formant had been cut off and the effect of that, after having heard the preceding intelligible words, was somehow being tested. Or maybe it just takes longer to process the distorted speech so by the time you get to the end, you don't have enough cognitive energy to hear, remember, and process it. At times I was 'listening' to a short term memory iteration of what I had heard; I guess the space between hearing the original, and remembering and retrieving the string of words, is where the 'interpretation' would come in.

Peter Taylor said,

Doesn't seem to work in Chromium. My apologies for posting it here, but it doesn't have a button to press to report brokenness and at least here there's a chance that the person who asked you to help will read the comments.

[(myl) I'll send her a note. But it worked for me in Chrome on Ubuntu Linux — can you be more specific about browser and OS details, and also about what "doesn't seem to work" means?]

Mia said,

Yeah, I would have liked to see what some of those were supposed to be…and I think it would be amusing to see the range of what other people said (a bit like playing Telephone). At any rate, some of the ones I heard were truly hilarious.

I've done the first fifteen of sixty so far, after which a break seemed appropriate.

Having read in the comments here that you don't get to see your answers side-by-side with the real ones at the end, I am not sure I will bother to resume.

I was dealt a pretty good voice. The only times I needed to resort to xxx's were in the superfast recordings, where I could often recall the first few syllables, the last few syllables, but not the syllables between. (For the practice recordings, I used xxx only for #6.)

Deborah Pickett said,

Technical difficulties prevented me from taking the test too, like Peter. My setup is Firefox 8 on Mac OS X 10.6. The samples are all completely silent. Quicktime tells me that it knows it's missing a component, but there's no suggestion as to what component I need to install in order to hear the samples.

I suggest that before the test proper, there be the ability to play a fixed sample so that I can weed out my audio problems before I get to the unrepeatable bits.

(On doing a bit of hack-work, I see that the samples are WAV files, and they do play if I paste their URL from the source directly into Firefox. So either it's a JavaScript compatibility problem, or my ISP's transparent proxy getting in the way, or something along those lines.)

pj said,

@Peter Taylor & myl – it wouldn't play the audio for me using Chrome on Vista earlier: it showed I was missing a plugin and wanted me to install QuickTime, which I wasn't about to.
I've just been back using IE instead and it was fine*. To be fair, the opening page does say 'Best viewed on Internet Explorer or Firefox'; and if I were running QuickTime I dare say it'd be ok in Chrome too.

*that is, I could hear the audio; quite a bit of what I heard I couldn't be sure was words, let alone which ones, but that's another matter.

Eric P Smith said,

@mary: "At times I was 'listening' to a short term memory iteration of what I had heard." I have a good short-term auditory memory, and that is what I was doing throughout the test. It was as though I could hear everything two or three times. My speech perception is non-standard and slightly sub-standard; I tumbled to that at the age of about 10 (I am 62). I have Asperger syndrome, which may be related. I wish there had been an opportunity to say so on the test.

Mel said,

Four of them didn't play anything at all for me. I put some variation of xxx or xxxxxxxxx in those four boxes. I think number 37, 20 something, 40 something, and 50 something. Not very helpful, but they were spread out. Otherwise, I still had a lot of trouble with some of the fast ones. I'm using IE.

@Eric: Yeah, I'm surprised they didn't ask about that either as an additional factor. I have an auditory processing disorder that affects my perception of speech, and especially gives me trouble with fast-paced speech and with certain synthesizers. (The synthesizer that I got in the test had a very mumbled quality to its consonants, which does not help matters!)

And like pretty much everyone else, I'd *love* to see a data file released comparing people's interpretations to the actual sentences once this survey is finally closed, because I have a feeling it would be quite hilarious.

David Fried said,

Did anyone else find, as I did, that a fair number of the utterances did not seem to be syntactically coherent? It was not clear to me whether I was supposed to designate a word as "unintelligible" (xxx), when what I seemed to be hearing could not be correct, because it was the wrong part of speech, the wrong number, etc. In practice I wrote down what I heard regardless of syntax.

These problems made me feel, like other commenters here, that the test was really about psychological aspects of perception, rather than the intelligibility of TTY. I suspected that what was being tested was my willingness to confabulate to arrive at syntactically correct utterances, rather than reproduce what I heard.

Ethan said,

@David Fried: Yes, I found that for many of the samples I could either place best-guess words into a series of the right length or I could create a syntactically coherent partial string with xxx's. But I could not complete a coherent syntactic string even allowing for nonsense.

h said,

Squish said,

How bizarre. I think I heard about 3 words I was actually sure of (10 if I'm not exaggerating) and no full sentences. One was Onceler, and one was Bowdlerise. I think there must've been too much American in it, I couldn't tell if one word was laughs or laps.

Peter Taylor said,

Sorry, I should have written a full bug report. Chromium 9.0.597.107 (75357) Ubuntu 10.04. The button which was supposed to play the sample instead opened a new window containing about:blank the first time I pressed it, and the second time it told me I'd already heard the sample.

Ellen K. said,

Mine sounded a lot like those animated videos with TTS voices. So easy to understand except when speed (too fast) was an issue. Alas, I'd forgotten about those and answered "never" to the question about listening to TTS in the questions beforehand.

Sili said,

I had trouble making even the human utterances grammatical when they were sped up.

I began wondering whether they were testing something else, like to what extent people create context and meaning, and possibly thereby distort the sent message, in the face of an unintelligible utterance.

I don't think that's the intention, but it's going to be a confounding factor.

Galaxy Zoo had a similar issue in the first version, when the asked people to judge the sense of rotation of galaxies. Turns out that our brains are biased to see one sense over the other in blurry images (I don't recall which one). The second generation of the test corrected for this by randomly mirroring the images before presenting them to the human judge – they also removed colour randomly.

I don't know what one would do to improve this test, though.

Incidentally, the lack of reward to the listener is going to make it harder to make this test go viral. The beauty of the Zooniverse is that they manage to trigger our reward circuitry somehow.

The distribution shows a strong relationship between the proportion of correct sentences (ps) and of correct words (pw). The ratio r = Log(ps)/Log(pw) seems to be a powerful index for measuring the complexity of a spoken message. Data replotted from the literature confirm the hypothesis that the higher the contextual (semantic, syntactic, etc.) information in a sentence, the lower this index r. A sentence can be considered as a sequence of more or less linguistically related symbolic units (phonemes, syllables, words, etc.), but the comprehension of a message by listeners depends on an unknown number of subjective units, which Miller called “decision units in the perception of speech”, and which result from various bottom-up and top-down strategies of identification and verification at the acoustic-phonetic level.

As a result, for a given set of sentences, the sentence-level score can be used without much loss of information; and because this is a rather hard test, even with recordings of human speakers, it is generally possible to avoid the ceiling effects that otherwise tend to plague attempts to make intelligibility comparisons.

The claim of the 1996 paper is that

The various experiments that were run with the SUS test in order to evaluate the intelligibility of speech synthesizers lead us to believe that it is a highly valuable test for the assessment of text-to-speech synthesizers at the sentence level. This test is suitable for detecting even subtle differences in intelligibility, for example between different prosody modules for a given synthesiser and between two diphone-based synthesisers based on the same speaker. However, like with any kind of ‘standardized test’, we recommend that great care be taken when setting up the SUS experimental procedure, since even a slight difference between two protocols might affect overall intelligibility scores. One of our objectives is that we encourage researchers in the area of text-to-speech synthesis to use a common standardized procedure in the evaluation of any new version of a system, or of a module of it.

Sili said,

Sorry for not reading the references before spouting off. Thank you for the education.

–o–

That's how I interpreted it, Anand. I started having English lessons around 12, but I never put in any actual effort to use the language till around 18, and I certainly did not aquire anything resembling fluency until some time in my mid twenties.

Victor Maurice Faubert said,

Mac OS 10.6.8 using Firefox 8.0.1: Got a message directing me to install some unknown QuickTime component on the first test sample, which I did not do. The first test sample then played, but was so badly garbled I understood nothing (using headphones); none of the remaining test samples produced any sound at all, so I quit after the 6th test sample.

Mariann Davis said,

1. i found that i needed a brief period of silence and preparation before playing each clip. if i tried to go too fast i wasn't prepared for the phrase to go into my brain in addition to my ears. 2. in some cases i was more willing to construct words that i wasn't sure i heard because they made sense; other times i didn't have a clue and didn't try to make any sense.

it was fun! thanks!

Shiny said,

I too really would have liked to be able to compare what I heard/wrote and the correct answer after the test. I also wonder if there are patterns in the incorrect answers – mistakes people are likely to make, what sounds are mistaken for each other, etc.