English is the lingua franca of business, medicine, government, technology, and many other industries; and the need to speak intelligibly is increasing across the world with the expansion of global business. Therefore, the poor intelligibility of non-native speakers is a problem in both the academic world and in the workforce.

In order to train for better intelligibility, we need to be able to quickly and accurately judge that intelligibility. Human judges of intelligibility need extensive training and their judgments are often biased and inconsistent. Technology is stepping in here to provide quicker and more objective ratings of speaker intelligibility. This article introduces a variety of such technologies available today and the areas in which they are particularly critical.

Reduced Intelligibility Can Lead to Fatal Miscommunications

Miscommunication can occur in any human interaction, as medical institutions know to their cost. Anecdotes of such miscommunications are very common, particularly in the airline industry, where the results can be fatal.

Communication in the air is generally carried out in English. Indeed, nothing underscores the subtle complexities of speech communication more strikingly than the mis­communications that occur among pilots, crewmembers, and air traffic controllers. When different words or phrases sound exactly or nearly alike, it can problematic. Confusion is possible, for example, because “left” can sound very much like “west”.

According to a Federal Air Surgeon’s Medical Bulletin, entitled, Thee…Uhhmm…Ah… , ATC-Pilot Communications, “When you produce these hesitations while speaking, you are using … ‘place holders,’ or ‘filled pauses’, a type of speech dysfluency especially common in pilot-controller exchanges”, says Mike Wayda. Until recently, such speech dysfluencies and other mistakes were not considered to be important; however, new research suggests that there is a correlation between miscommunications and mistakes.

What is Intelligibility?

How do we define intelligibility and how is it measured? Intelligibility refers to the ability of a listener to recognize and understand a word, phrase, or sentence of a non-impaired speaker. Intelligibility is influenced by the social and linguistic context of the speech. If the listener is familiar with the topic under discussion, intelligibility will be higher. In addition, intelligibility is higher if the speaker is in a noise-free background. Finally, intelligibility varies according to how familiar the listener is with the speech pattern of the speaker. (A well-known phenomenon is the miraculous improvement in intelligibility of a non-native speaker over time in the view of his/her teacher, when objective testing shows no real improvement!)

Intelligibility is often measured by the number of phonemes that can be accurately transcribed from listening to recorded speech. It is also often also rated on Likert scales, where the listener selects from options ranging from, for example, “totally unintelligible” to “completely intelligible.”

What is a “Foreign Accent”?

We are interested in foreign accent to the extent that it reduces intelligibility. (We concentrate only on pronunciation and ignore vocabulary and grammar.) Non-native speakers are often unintelligible because the speech patterns of their first language interfere with their pronunciation of American English. Indian speakers, for example, often substitute /v/ for /w/. Some languages, such as Mandarin Chinese, do not allow obstruents (sounds created by restricting air-flow through the oral cavity) at the end of a word or syllable so the final consonant is omitted – in the word “rice” the final /s/ sound is left off. In some languages the /t/ sound is produced more like a /d/. This can lead to meaning confusions such as English listeners hearing “die” instead of “tie”!

Prosodic effects are also important. Prosody covers a number of systems that affect intelligibility, including intonation, and sentence stress or accent, determined in English mostly by the speaker’s focus and whether this is the first mention of an item to the conversation. Unfortu­nately, there are few simple rules to guide the learner of English; word stress patterns must generally be learned on a word-by-word basis. In addition, speakers of tone languages, such as Chinese and Korean, have difficulty carrying an uninterrupted pitch contour over an utterance and assigning correct sentence stress to the most important word/s in a sentence. To the ears of native speakers, their productions sound “jerky”.

How Did Speech Assessment Evolve?

Human-Scored Testing

Initially, all speech testing relied on the judgments of a human listener, who is, of course, prone to fatigue, bias, and unreliability. This is probably still the most common way to evaluate speaking effectiveness and intelligibility. Speakers are evaluated in reading, responding to prompts, or in free conversation.

The SPEAK Test (www.toeflgoanywhere.org)

The Speaking Proficiency English Assessment Kit (SPEAK) is an oraltest developed by the Educational Testing Service (ETS) and perhaps epitomizes the traditional way of evaluating speech. Its aim is to evaluate the examinee’s proficiency in spoken English. ETS developed the four skills (listening, reading, speaking, and writing) TOEFL iBT test. The Speaking portion of the test is scored by human listeners and, according ETS, has undergone extensive statistical and reliability analysis. The Speaking section of the TOEFL is not available separately from the other sections, but institutions wishing to test speaking skills only may choose to use the TOEIC (Test of English for International Communication) Speaking Test, also developed by ETS, and available as a stand-alone assessment.

Acoustic Analysis of Speech

Since acoustic analysis methods became readily available in the 1960s, there has been a steady stream of research documenting particular features of standard American English speech in single words and sentences and, more recently, of non-native speech, allowing comparison of the two. These studies have allowed the computer analysis of speech in programs such as the Versant Testing System, Carnegie Speech Assessment, and the Automated Pronunciation Screening Test (APST). These use large-scale statistical studies on native and non-native speech as the basis for assessments. Because of the difficulty of training listeners to achieve reasonable reliability with each other, and the time it takes to score spoken tests, computer-based testing offers the hope of more rapid and reliable intelligibility assessment. The three tests noted above that use computer analysis, are further described below.

The Versant Testing System (www.versant.com)

Versant Technology originally developed a telephone-based test in which the speaker repeated items or responded to prompts. This first test primarily evaluated speaker fluency. More recently, Versant has developed a system presented on a computer, described on the website:

“The Versant testing system, based on the patented Ordinate® technology, uses a speech processing system that is specifically designed to analyze speech from native and non-native speakers of the language tested. In addition to recognizing words, the system also locates and evaluates relevant segments, syllables, and phrases in speech. The Versant testing system then uses statistical modeling techniques to assess the spoken performance.”

“Base measures are then derived from the linguistic units (segments, syllables, words), based on statistical models built from the performance of native and non-native speakers. The base measures are combined into four diagnostic sub-scores using advanced statistical modeling techniques. Two of the diagnostic sub-scores are based on the content of what is spoken, and two are based on the manner in which the responses are spoken. An Overall Score is calculated as a weighted combination of the diagnostic sub-scores.”

Carnegie Speech Assessment (www.carnegiespeech.com)

This system “uses speech recognition and pinpointing technology under license from Carnegie Mellon University to assess an individual’s speech. By pinpointing exactly what was correct and incorrect in the speaker’s pronunciation, grammar and fluency, accurate and objective English assessments can be made. Specific features, as described on the website, include:

Rapid assessment of spoken English by analyzing each student’s speech against a statistical composite voice model of native speakers.

APST uses knowledge-based speech analysis and is based on the careful study and acoustic analysis of the target — speech. It is designed to test large groups of non-native speakers quickly, accurately, and objectively. Speakers first practice recording items and then read words and sentences, which are recorded into the computer. These recordings are sent to Phonologics via the web, where they are automatically scored and a report is made available to the test administrator within minutes. The test provides sub-scores on particular aspects of speech and a summary score that indicates the intelligibility of the speaker to American English listeners.

The initial human-scored version of APST was developed to screen the large numbers of non-native speakers at Northeastern University in Boston, MA. The program provided a summary and sub-scores and was used with standard TOEFL scores to determine whether international teaching assistants should be allowed into the lab or classroom or first receive intelligibility training. This first version showed the need for a more objective and quickly scored version of the test. A second automated prototype was developed with funding from NIH. Further development of APST has been under the auspices of Speech Technology and Applied Research Corp.

How Well Do Automated Intelligibility Tests Correspond with Human Judgments?

It is important to test how well automated tests correspond with the judgments of human listeners. To check this, the authors first got intelligibility rankings using APST of three non-native speakers and one native speaker. Then they took recordings used for the APST analysis and asked five native English listeners to judge their speech. The judges were asked to do two things: rank speakers on a nine-point intelligibility scale and place them for intelligibility in the top, middle, or bottom positions. On both measures, the human evaluators all rated the speakers consistently with their APST scores. (A full version of this study is available on the Phonologics website.) So the study showed that APST agrees favorably with human judges and that the test does what it says it does and may be used with confidence

These new technologies offer the prospect of accurate results that agree with the judgments of human listeners, but without the labor and time commitments, and with the promise of more objective results. This allows us to place speakers in classes or positions more quickly and accurately, and without the bias that unfortunately can often creep into the human-scored process.

English is the lingua franca of business, medicine, government, technology, and many other industries; and the need to speak intelligibly is increasing across the world with the expansion of global business. Therefore, the poor intelligibility of non-native speakers is a problem in both the academic world and in the workforce.

In order to train for better intelligibility, we need to be able to quickly and accurately judge that intelligibility. Human judges of intelligibility need extensive training and their judgments are often biased and inconsistent. Technology is stepping in here to provide quicker and more objective ratings of speaker intelligibility. This article introduces a variety of such technologies available today and the areas in which they are particularly critical.

Reduced Intelligibility Can Lead to Fatal Miscommunications

Miscommunication can occur in any human interaction, as medical institutions know to their cost. Anecdotes of such miscommunications are very common, particularly in the airline industry, where the results can be fatal.

Communication in the air is generally carried out in English. Indeed, nothing underscores the subtle complexities of speech communication more strikingly than the mis¬communications that occur among pilots, crewmembers, and air traffic controllers. When different words or phrases sound exactly or nearly alike, it can problematic. Confusion is possible, for example, because “left” can sound very much like “west”.

According to a Federal Air Surgeon’s Medical Bulletin, entitled, Thee…Uhhmm…Ah… , ATC-Pilot Communications, “When you produce these hesitations while speaking, you are using … ‘place holders,’ or ‘filled pauses’, a type of speech dysfluency especially common in pilot-controller exchanges”, says Mike Wayda. Until recently, such speech dysfluencies and other mistakes were not considered to be important; however, new research suggests that there is a correlation between miscommunications and mistakes.

What is Intelligibility?

How do we define intelligibility and how is it measured? Intelligibility refers to the ability of a listener to recognize and understand a word, phrase, or sentence of a non-impaired speaker. Intelligibility is influenced by the social and linguistic context of the speech. If the listener is familiar with the topic under discussion, intelligibility will be higher. In addition, intelligibility is higher if the speaker is in a noise-free background. Finally, intelligibility varies according to how familiar the listener is with the speech pattern of the speaker. (A well-known phenomenon is the miraculous improvement in intelligibility of a non-native speaker over time in the view of his/her teacher, when objective testing shows no real improvement!)

Intelligibility is often measured by the number of phonemes that can be accurately transcribed from listening to recorded speech. It is also often also rated on Likert scales, where the listener selects from options ranging from, for example, “totally unintelligible” to “completely intelligible.”

What is a “Foreign Accent”?

We are interested in foreign accent to the extent that it reduces intelligibility. (We concentrate only on pronunciation and ignore vocabulary and grammar.) Non-native speakers are often unintelligible because the speech patterns of their first language interfere with their pronunciation of American English. Indian speakers, for example, often substitute /v/ for /w/. Some languages, such as Mandarin Chinese, do not allow obstruents (sounds created by restricting air-flow through the oral cavity) at the end of a word or syllable so the final consonant is omitted – in the word “rice” the final /s/ sound is left off. In some languages the /t/ sound is produced more like a /d/. This can lead to meaning confusions such as English listeners hearing “die” instead of “tie”!

Prosodic effects are also important. Prosody covers a number of systems that affect intelligibility, including intonation, and sentence stress or accent, determined in English mostly by the speaker’s focus and whether this is the first mention of an item to the conversation. Unfortu¬nately, there are few simple rules to guide the learner of English; word stress patterns must generally be learned on a word-by-word basis. In addition, speakers of tone languages, such as Chinese and Korean, have difficulty carrying an uninterrupted pitch contour over an utterance and assigning correct sentence stress to the most important word/s in a sentence. To the ears of native speakers, their productions sound “jerky”.

How Did Speech Assessment Evolve?

Human-Scored Testing

Initially, all speech testing relied on the judgments of a human listener, who is, of course, prone to fatigue, bias, and unreliability. This is probably still the most common way to evaluate speaking effectiveness and intelligibility. Speakers are evaluated in reading, responding to prompts, or in free conversation.

The SPEAK Test (www.toeflgoanywhere.org)

The Speaking Proficiency English Assessment Kit (SPEAK) is an oral test developed by the Educational Testing Service (ETS) and perhaps epitomizes the traditional way of evaluating speech. Its aim is to evaluate the examinee’s proficiency in spoken English. ETS developed the four skills (listening, reading, speaking, and writing) TOEFL iBT test. The Speaking portion of the test is scored by human listeners and, according ETS, has undergone extensive statistical and reliability analysis. The Speaking section of the TOEFL is not available separately from the other sections, but institutions wishing to test speaking skills only may choose to use the TOEIC (Test of English for International Communication) Speaking Test, also developed by ETS, and available as a stand-alone assessment.

Acoustic Analysis of Speech

Since acoustic analysis methods became readily available in the 1960s, there has been a steady stream of research documenting particular features of standard American English speech in single words and sentences and, more recently, of non-native speech, allowing comparison of the two. These studies have allowed the computer analysis of speech in programs such as the Versant Testing System, Carnegie Speech Assessment, and the Automated Pronunciation Screening Test (APST). These use large-scale statistical studies on native and non-native speech as the basis for assessments. Because of the difficulty of training listeners to achieve reasonable reliability with each other, and the time it takes to score spoken tests, computer-based testing offers the hope of more rapid and reliable intelligibility assessment. The three tests noted above that use computer analysis, are further described below.

The Versant Testing System (www.versant.com)

Versant Technology originally developed a telephone-based test in which the speaker repeated items or responded to prompts. This first test primarily evaluated speaker fluency. More recently, Versant has developed a system presented on a computer, described on the website:

“The Versant testing system, based on the patented Ordinate® technology, uses a speech processing system that is specifically designed to analyze speech from native and non-native speakers of the language tested. In addition to recognizing words, the system also locates and evaluates relevant segments, syllables, and phrases in speech. The Versant testing system then uses statistical modeling techniques to assess the spoken performance.”

“Base measures are then derived from the linguistic units (segments, syllables, words), based on statistical models built from the performance of native and non-native speakers. The base measures are combined into four diagnostic sub-scores using advanced statistical modeling techniques. Two of the diagnostic sub-scores are based on the content of what is spoken, and two are based on the manner in which the responses are spoken. An Overall Score is calculated as a weighted combination of the diagnostic sub-scores.”

Carnegie Speech Assessment (www.carnegiespeech.com)

This system “uses speech recognition and pinpointing technology under license from Carnegie Mellon University to assess an individual’s speech. By pinpointing exactly what was correct and incorrect in the speaker’s pronunciation, grammar and fluency, accurate and objective English assessments can be made. Specific features, as described on the website, include:

Rapid assessment of spoken English by analyzing each student’s speech against a statistical composite voice model of native speakers.

APST uses knowledge-based speech analysis and is based on the careful study and acoustic analysis of the target — speech. It is designed to test large groups of non-native speakers quickly, accurately, and objectively. Speakers first practice recording items and then read words and sentences, which are recorded into the computer. These recordings are sent to Phonologics via the web, where they are automatically scored and a report is made available to the test administrator within minutes. The test provides sub-scores on particular aspects of speech and a summary score that indicates the intelligibility of the speaker to American English listeners.

The initial human-scored version of APST was developed to screen the large numbers of non-native speakers at Northeastern University in Boston, MA. The program provided a summary and sub-scores and was used with standard TOEFL scores to determine whether international teaching assistants should be allowed into the lab or classroom or first receive intelligibility training. This first version showed the need for a more objective and quickly scored version of the test. A second automated prototype was developed with funding from NIH. Further development of APST has been under the auspices of Speech Technology and Applied Research Corp.

How Well Do Automated Intelligibility Tests Correspond with Human Judgments?

It is important to test how well automated tests correspond with the judgments of human listeners. To check this, the authors first got intelligibility rankings using APST of three non-native speakers and one native speaker. Then they took recordings used for the APST analysis and asked five native English listeners to judge their speech. The judges were asked to do two things: rank speakers on a nine-point intelligibility scale and place them for intelligibility in the top, middle, or bottom positions. On both measures, the human evaluators all rated the speakers consistently with their APST scores. (A full version of this study is available on the Phonologics website.) So the study showed that APST agrees favorably with human judges and that the test does what it says it does and may be used with confidence

These new technologies offer the prospect of accurate results that agree with the judgments of human listeners, but without the labor and time commitments, and with the promise of more objective results. This allows us to place speakers in classes or positions more quickly and accurately, and without the bias that unfortunately can often creep into the human-scored process.

The goal of this paper is to describe the current version of an Automated Pronunciation Screening Test (APST) that analyzes digitized speech samples from Non-Native Speakers of English (NNS) who speak with a foreign accent, and provides a norm-referenced intelligibility score. The test, which has been under development for over ten years, determines whether a speaker has reached an acceptable level of intelligibility in standard American English for a given communication setting or whether s/he requires intelligibility training. Other tests generally used for this purpose rely almost exclusively on perceptual rating scales of different dimensions of speech intelligibility. Such rating scales are subjective, time-consuming, and unreliable.

The innovations in APST allow us to establish an objective standard of intelligibility in speakers of accented English. Our knowledge-based technology may also provide the basis for future diagnostic tests and intervention programs. The project uses data initially collected primarily on Chinese and Spanish-accented speakers, and later augmented with many Ukrainian, Vietnamese, and South Asian speakers. It employs correlations among judgments of American English listeners listening to conversation, with objective acoustic measures of pronunciation of test items. Newer research tests the current version of the program using a small number of naïve listeners. This research utilizes the statistical findings of a previous NIH funded study, and in an iterative design, established the validity of the current test.

Literature Review

Population of Non-Native Speakers of English

The foreign-born population of the United States exceeded 33 million in 2002, slightly more than the entire population of Canada, according to the U.S. Census Bureau’s latest American Community Survey (ACS). In particular, there is a huge flux of medical professionals worldwide with the largest flow being towards the Anglophone countries, with the US the largest magnet.

These global population flows are projected to increase over the next 20 years, because population ageing and changing technologies are likely to contribute to an increase in the demand for health workers, while workforce ageing will decrease the supply as the baby- boom generation of health workers reaches retirement age, according to the International Migration Outlook: SOPEMI 2007 Edition.

Students studying in US colleges and universities

Foreign enrollment in U.S. universities and colleges increased by 3% in fall 2009 to 586,000, rising 2% for non-science/engineering (S&E) fields (to 327,000) and 4% for science and engineering (to 259,000) (table 1). The increase in S&E enrollment was larger than in recent years, but for the 2006–09 period, S&E students accounted for a steady 44% of total foreign enrollment.

Total number of Workers and Percentage of Foreign Born in Health-care Occupations 2005

Total number of

Workers

Share of Foreign Born

(%)

Share of foreign born

who arrived since

1990 (%)

Physicians and

Surgeons

803,824

14.5

43.5

Registered Nurses

2,457,701

13.2

39.5

Health diagnosing and

treating practitioners

1,356,884

12.3

38.3

Health-care

technologists and

technicians

2,286,571

10.4

43.9

Nursing, psychiatric

and home health

1,998,000

19.2

52.3

Other health care

support

1,250,000

11.6

45.2

Total

10,025,000

14.5

43.5

Total number of Workers and Percentage of Foreign Born in Health-care Occupations 2005

Total number of

Workers

Share of Foreign Born

(%)

Share of foreign born

who arrived since

1990 (%)

Physicians and

Surgeons

803,824

14.5

43.5

Registered Nurses

2,457,701

13.2

39.5

Health diagnosing and

treating practitioners

1,356,884

12.3

38.3

Health-care

technologists and

technicians

2,286,571

10.4

43.9

Nursing, psychiatric

and home health

1,998,000

19.2

52.3

Other health care

support

1,250,000

11.6

45.2

Total

10,025,000

14.5

43.5

US Based – Estimated Number, Foreign Born Workers by Job Title

Occupations

Total Population

Foreign Born

Physicians and Surgeons

803,824

201,000

Nursing

TBD

31,000

Pharmacy

269,000

57,000

Universities and Colleges

TBD

20,000

Engineers

7,990,000

41,000

Foreign Born Engineers

1,949,000

1,250,000

Military

5,000,000

68,711

Technical Support &

Customer Service

TBD

500,000

Airlines

20,000

2,000

Total

2,170,711

Foreign Accent

As described by Chen, 2010, p 183, “The term “foreign accent” might be characterized as the subjective impression of a native listener or an advanced student of a foreign language. The precise nature of a foreign accent remains mostly unexplored even though it has intrigued an increasing number of second language acquisition researchers.

We are interested in foreign accent to the extent that it reduces intelligibility. This may be caused by language interference (the substitution of features of one phonology for that of a second), processing effects (the slowing or other effects of listening to speech in real time due to increased processing load), irritation or prejudice against the perceived speaker group that may reduce attention to the message. The relative effects of these various factors are just beginning to be determined in the research literature

A. Segmental effects
NNSs are often unintelligible due to a variety of interference effects between their native languages (L1) and English (L2). These effects depend on the characteristics of the first language and affect multiple aspects of pronunciation of L2: syllable structure, phonemes, subphonemic characteristics, and suprasegmentals. In particular, NNSs often distort L2 by

substituting familiar L1 phonemes for those in L2. For example, they omit or insert phonemes at certain points in the syllable in order to produce a structure that conforms to a familiar pattern. They also produce phonemes that occur in both languages, but with acoustic parameters (such as voice onset time (VOT), vowel length, and stress) more appropriate to L1.

B. Interference effects
This effect also influences discrimination of sounds in L2. For example, the identification of vowels in L2 depends partly on the number and type of vowels in L1. The number and type of vowels in L1 also affect the capacity to respond to training (Flege, 1986; Best & Strange, 1992). Similarly, with respect to consonants, Flege and Wang (1989) found that Cantonese Chinese speakers performed better after training on discrimination of word-final /t/ and /d/, as in bat vs. bad, than speakers of the Shanghai dialect of Chinese or Mandarin speakers, because Cantonese, like English, permits unreleased unvoiced stops /p, t, k/ in final position. The Shanghai dialect of Chinese allows only final glottal stops and Mandarin allows no final obstruents at all. The performance of speakers of the Shanghai dialect was, as might be expected, better than that of Mandarin speakers.

Research on speech production (Flege, 1987, 1988) indicates that the phonetic space of bilingual speakers is restructured during L2 learning. This restructuring can negatively affect the production of L2 sounds, because the speaker sometimes classifies them as equivalent to ones in L1, a process referred to by Flege (1991) as identification. Conversely, sounds in L2 that are perceived as “new” or outside those in L1 are more likely to be produced correctly.

An example of a phoneme that can be assimilated to an English sound is Spanish /t/. Spanish /t/ is implemented as an unaspirated stop with short-lag VOT, similar to an English /d/; English /t/ is aspirated and has a long-lag VOT (Flege,1987, 1995). Most Spanish speakers produce English /t/ with “compromise” VOTs, which are between the values typical for English and Spanish, with the result that native English listeners often hear the /t/ as a /d/. An example of a “new” sound is the /y/ French vowel for English speakers. Flege (1987) found that native English speakers with different degrees of exposure to French produced this vowel with only slightly different F2 frequencies than a group of French monolinguals, suggesting that totally new sounds are less difficult to acquire.

Phonotactic differences also exert an effect. For example, Chinese dialects differ markedly from English with respect to syllable structure (Cheng, 1987). Chinese generally prefers open syllables, which end in vowels or sonorant consonants such as nasals, whereas English frequently closes syllables with a final consonant. Chinese speakers of English frequently omit word-final consonants, particularly non-sonorants; e.g., producing /experti/ for expertise.

English includes a number of phonemes that rarely occur in other languages but are easily identified with phonemes that are common. For example, the dental fricatives in words like “this” and “three” are often identified with [d] and [t]. Consequently, NNSs with different first languages may all experience difficulty with such sounds. Compton (1983), using groups of 28- 40 speakers each, who spoke Spanish, Filipino, Cantonese, Mandarin, and Japanese, found that all had difficulty with these sounds in word-initial and -final positions. Cardosa, 2011, found that

Portuguese NNSs of English, as well as inserting an epenthetic /i/ vowel after a word-final voiced consonant, also perceived such insertions in a forced-choice phone identification task, suggesting a strong link between perception and production in this regard.

Such interference effects appear to be related to the age at which the speaker was exposed to L2 (Flege, 1990, Flege & Schmidt 1995; Flege, Munro, & MacKay 1995) and to the amount of conversational experience they had (Best & Strange, 1992). A more recent study by Huang and Jun, 2011 indicates that the age of arrival in an English speaking country (and hence years of exposure to a language) effects aspects of prosody differently.

In summary, the type and number of interference effects depend on the L1, the speaker’s discrimination of sounds in L2, his/her reclassification of L2 sounds in production according to the grid imposed by the phonology of L1, and the amount and age of exposure to L2. In addition, there is wide individual variation in the ability to discriminate and produce new sounds — an area still largely unexplored. From the perspective of this research, one task is to determine whether a given speaker’s phonemes deviate from standard American English productions enough to be labeled non-standard to a native listener. Such aberrant segmental productions will reduce the speaker’s overall intelligibility, as found by Goldhor (1989), who writes, “Phonetic distortion and speaker intelligibility are intimately connected.” This observation was key to our initial focus on phonetic measures.

C. Prosodic effects
Prosody covers a number of suprasegmental systems that affect intelligibility (Crystal, 1969). These include stress in polysyllabic words (lexical stress), sentence stress or accent, and intonation. English has a complex derivational system which results in alternations such as “SIMple~simPLIcity~SimplifiCAtion.” NNSs frequently produce such words with lexical stress assigned to the wrong syllable, a source of confusion to native listeners. Unfortunately, there are few simple rules to guide the learner of English; word stress patterns must generally be learned on a word-by-word basis (Flege & Bohn, 1989). To complicate matters, even among linguists there is no agreement on the number of stress levels that should be used to describe this phenomenon. A generally accepted description appears to be a ternary distinction: 1° stress vs. 2° stress vs. unstressed (Ladefoged, 2001).

The three most important acoustic correlates of stress are changes in pitch or fundamental frequency (F0), syllable duration, and loudness. From an articulatory perspective, there is greater articulatory effort associated with stressed than with unstressed syllables (Kent & Netsell, 1971). In addition, vowel duration is longer in stressed than unstressed position. (At the phonological level, this is related to vowel reduction, the centralization of unstressed vowels.) While these three acoustic factors interact in production (Couper-Kuhlen, 1986), it appears that listeners attend primarily to change in F0, secondly to increased duration, and finally to increased intensity (Lehiste, 1970). This suggests a focus on F0 change as a measurement of stress, with other factors coming into play more in words-within-sentences.

There has been an increasing research interest in prosody as it relates to the perception of foreign accent in the last twenty years, particularly. In a study of Korean, and Chinese-accented speech versus that of American English (AE) speakers, Baker, R., Baese-Gahot, L., Kim, M., Van

Engen and Bradlow (2011) found that “AE speech had shorter durations, greater within-speaker word duration variance, greater reduction of function words [such as prepositions and articles], and less between-speaker variance than non-native speech. However, both American English (AE) and non-native speakers showed sensitivity to lexical predictability by reducing second mentions and high-frequency words. Non-native speakers with more “native-like” word durations, greater within-speaker word duration variance, and greater function word reduction were perceived as less accented”. This durational factor contributes to the perception of AE speakers that the speech of Chinese and Korean speakers is “jerky” or lacking in a feature we call “smoothness.”

D. Effects on the listener
Researchers have examined a number of listener variables, including intelligibility, speech naturalness, perception of foreign accent, and irritation. While many of these variables are clearly related, we focused most closely on the issue of intelligibility. It is our position (and that of the American Speech-Language-Hearing Association, 1994) that social and geographic dialects should be accepted as forms of linguistic variation. Intervention is required only to remedy deficiencies in intelligibility. For those individuals with reduced intelligibility, a strong dialect can be a handicapping condition. (Individuals may choose, however, to speak with strong dialects for whatever purpose.) Evaluations of foreign accent will, of course, frequently be associated with different levels of intelligibility and with affective responses, including irritation (Flege, et al. 1995), and social prejudice (Hewstone, 2002).

Intelligibility

Intelligibility refers to the ability of a listener to recognize and understand a word, phrase or sentence of a normal speaker. (In this case, we use “normal” to refer to a speaker who is not hearing impaired and has no impairment of voice, fluency, or articulation.) Intelligibility is impacted by the social and linguistic context of the speech, as well as the clarity of the speech signal in relation to the level of background noise. In this instance we are mostly concerned with the decontextualized speech that a listener might hear on the telephone. Intelligibility can be defined as what is understood by the listeners of the phonetic realization of speech (Yorkston, Strand, & Kennedy, 1996). It is often measured by the number of phonemes that can be accurately transcribed from listening to recorded speech. It is often also rated on Likert scales.

Next to the ratings of the four dimensions, the judges indicated the most dominant dimension affecting intelligibility for each patient. The judgments of overall intelligibility and each of the dimensions were compared. Again, articulation and prosody show the strongest correlation with intelligibility, with nasality the lowest.

While voice quality and nasality are frequently disordered in dysarthric speech, as opposed to the speech of non-dysarthric NNSs, it is still perhaps significant that articulation and prosody were the major contributors to reduced intelligibility. This research also suggests that a complete evaluation of speech intelligibility should include an evaluation of prosody. It also suggests that intelligibility can be captured by a combination of a number of ratings of dimensions that make up intelligibility.

Intelligibility is also related to the familiarity of the listener with the speech pattern of the speaker (Kennedy, Sara ; Trofimovich, Pavel, 2008). (A well-known phenomenon is the miraculous improvement in intelligibility of a non-native speaker over time in the view of his/her teacher, when objective testing shows no real improvement!) We therefore select listeners, in our intelligibility testing, who are dialectally naïve in the sense of having limited exposure to dialected speech. (Given the large number of non-native speakers to the USA, this is a theoretical ideal to which we approximate.)

A further important issue is that meaning, once decoded from an utterance, is then retrievable from the listener’s memory for a considerable period of time. It is this initial ability to abstract meaning that we are interested in, rather than the ability to subsequently recall the utterance itself. Hence, subjects in the native-speaker listener group of our research can only be employed once to listen to each listener utterance.

While our concern is with intelligibility, rather than directly with level of accent, or accentedness, we must assume they are related, both objectively in terms of the number and type of errors the speaker makes and subjectively in terms of the ease, speed, and accuracy with which a native listener decodes a word or utterance. It awaits a larger study than ours to determine the relation between the two.

Comprehensibility

A related variable is comprehensibility. We take it that intelligibility and comprehensibility both describe the ability of the listener to extract meaning from or understand speech but differ in scope and focus. Comprehensibility is usually used to refer to understanding of larger units of speech such as discourse; intelligibility is used to refer to smaller units — the sound, sentence, or word. Comprehensibility is generally used to refer to the perception that speech is understandable rather than an actual measurement of the amount that is understood. It also takes into account the semantic and linguistic context. Sentences and discourse tend to be evaluated for vocabulary, morphosyntactic correctness, and fluency, as well as for the intelligibility of single words. Words and sentences are generally judged for intelligibility in relation to their phonological or phonetic correctness at the levels of segments and prosodic features.

Since most real-life intelligibility “tests” occur in spontaneous-speech situations (lectures, conversations, air-traffic dialogues), APST’s results must correlate well with larger measures of intelligibility. A screening test like ours must be easy and quick to administer and take, so single words are sometimes appropriate test items. APST’s intelligibility score must also correlate with measures of conversational and sentential intelligibility.

Understanding is subjective and depends on a variety of contextual variables, including the speaker’s level of language proficiency. It can be measured and defined in a variety of ways. Measurements of comprehensibility have largely used rating scales and error hierarchies. Gynan (1985), in a study of Spanish-speaking English learners, found that morphosyntactic errors such as pluralization endings contributed less to comprehensibility than phonological factors, at least for more proficient speakers. Elsewhere, Gass and Veronis (1983) investigated the contribution to comprehensibility of familiarity with topic, with nonnative speech in general, with a particular accent, and with the accent of a particular speaker. Topic familiarity was most significantly related to comprehensibility, but all had an effect. Clearly, in situations such as classrooms or clinics, where listeners will not necessarily be familiar with the topic at hand, accent issues will be more prominent.

In another study that employed rating scales of a variety of features, including grammar, pronunciation, intonation, word-choice, and intelligibility, Fayer and Krasinski (1987) found that pronunciation and hesitations were the most distracting features to both non-native and native listeners. Anderson, Johnson, and Koehler (1992) used the SPEAK test to investigate the relationship between raters’ judgments of pronunciation and deviance in segmentals, prosody, and syllable structure. They found that, while all variables were significantly related to pro- nunciation ratings, prosody had the greatest effect.

In an important study, Munro and Derwing (1995) used a sentence-verification task to determine the effects of a foreign accent on sentence processing time in native speakers. Response latency times were longer when American English listeners were required to evaluate Mandarin- accented utterances than those produced by native English speakers. In a later study, (2006) these authors studied the impact of functional load (FL) of segmental errors on perceptions of accentedness and comprehensibility. Thirteen native English listeners judged 23 Cantonese- accented sentences that exhibited various combinations of high and low FL errors. The high FL errors had relatively large effects on both perceptual scales, while the low FL errors had only a minimal impact on comprehensibility. The only cumulative effects of errors seen in the data occurred with high FL errors in the judgments of accentedness. While this research is helpful in suggesting that high FL load phonemes be included in our test, the study did not address intelligibility directly.

It appears, then, that phonological factors contribute to comprehension and distraction on the part of listeners, that prosody is an important component of these in addition to the functional load of the content phonemes, and that these factors impose a processing burden on the listener. More research is needed, however, to determine the relative weightings given to segments versus prosodic factors and to processing factors.

Potentially Fatal Miscommunications

Miscommunication can occur in any human interaction, as medical institutions know to their cost. There is regrettably little systematic data on miscommunications that occur in medical settings, such as during surgical procedures, but the results can be clearly of great consequence. Anecdotes of such miscommunications are, however, legion.

Likewise, according to the recent Federal Air Surgeon’s Medical Bulletin, entitled, Thee…Uhhmm…Ah.. , ATC-Pilot Communications by Mike Wayda, “When you produce these hesitations while speaking, you are using … ‘place holders,’ or ‘filled pauses’, a type of speech dysfluency especially common in pilot-controller exchanges. Until recently, such speech dysfluencies and other mistakes were not considered to be important; however, new research suggests that there is a correlation between miscommunications and mistakes, says CAMI scientist Dr. Veronika Prinzo-Roberts”.

Indeed, nothing underscores the subtle complexities of speech communication more strikingly than the miscommunications that occur among pilots, crew members, and air traffic controllers. Problems stemming from mistaking one word for another happen to even native speakers; the difficulties are only compounded when accented speech is involved.

There are many relevant examples. Homophony and, more commonly, near-homophony, in which different words or phrases sound exactly or nearly alike, can be just as problematic as prosody. Phonological confusion is possible, for example, because “left” can sound very much like “west”. An outbound pilot who was told to receive his clearance from the Air Traffic Control Center when he was “on the deck” misheard this as “off the deck” and proceeded with his takeoff, consequently finding himself head-on with an inbound aircraft. One wide-body airplane barely missed colliding with another after landing, because the pilot heard “Hold short” as “Oh sure” in response to his asking the controller “May we cross” in reference to a runway. The words “to” and “two” (a stress issue) are especially problematic. Confusions between the differently stressed “two” and “to” led to a fatal accident in another incident when a pilot read back the instruction “Descend two four zero zero” as “O.K. Four zero zero” and then proceeded, without correction from the controller, to descend to four hundred, rather than twenty-four hundred, feet.

Existing Knowledge of Acoustic Features of Standard American English Speech

Since acoustic analysis methods became readily available in the 1960s, there has been a steady stream of research documenting features of standard American English speech in single words and sentences. These studies have examined phonetic features and their effects on perception with subjects reading the stimuli, as in (Clear) SpeechWorks. The alveolar stops /t,d/, for example, which are among the most frequently used and which are produced with different lag times in different languages, have been extensively researched in all word positions and many languages (Lisker & Abramson, 1964; Fischer-Jorgensen, 1954; Klatt, 1975; Zue, 1976; Sharf, 1962; Umeda, 1977; Zue & Laferriere, 1979).

Research on the acoustics of American English was further encouraged by the practical goal of developing speaker-independent automatic speech-recognition systems. The goal of achieving isolated word recognition provided the motivation for exact measurement of the statistical properties, and constraints on the phonemic properties, of single words. To give just a few examples of specific measurements from Phase I: the average duration of a stressed vowel is about 130 msec; the average duration for unstressed vowels, including schwa, is about 70 msec; and the average duration for a consonant is 70 msec (Klatt, 1976). An acoustic correlate of emphatic or contrastive stress is an increase in the duration of a word by 10% to 20% (Coker, Umeda, & Browman, 1973).

Some researchers have used template-building techniques (see, for example, Blumstein (1986) and references contained therein). Labial consonants, for instance, are characterized by either a flat distribution of spectral energy between the release burst and onset of voicing, or by sustained spectral energy at low frequencies (about 1500 Hz). By contrast, dental and alveolar consonants evince greater concentration of energy in high frequencies (about 3500 Hz) at release. These researchers were dealing with the problem of variability in speech production across different speakers and with the careful measurement of contextual effects (Oshika, et al., 1975). However, speech recognition has the added problem of not knowing the target word the speaker intends. APST does not need to deal with this complex issue, since the words tested are from a short list and known in advance.

More recently, Chen 2010, in an extension of Chen and Chung (2008), investigated the difficulties encountered by Taiwanese learners in English speech timing patterns and identified critical variables that affected native listeners’ perceptions of foreign accents. Thirty Taiwanese learners and 10 native speakers of American English read an article. Six variables—syllable duration, vowel reduction, pause duration, linking, consonant cluster simplification, and speech rate—were acoustically analyzed in five sentences. In addition, ten English listeners rated these speech samples for degree of foreign accent. A regression analysis was used to analyze the relation among these variables.

Taiwanese learners displayed very different speech patterns according to the six acoustic variables from those of native English speakers. The perceptual ratings of the six individual variables showed a very strong positive correlation with the overall ratings, suggesting that timing patterns were more a holistic impression rather than a discrete component. Speech rate was the primary predictor determining native listeners’ perception of foreign accent. If this overall fluency variable was excluded, then vowel reduction and linking duration became the two most heavily weighted variables. The author proposed a temporal perception model to account for the effects of timing variables on native English listeners’ judgment of foreign accents. While this study was not examining intelligibility, it does highlight the need to incorporate measures of prosody into an analysis of foreign accented speech.

Kang, O, (2010) studied the relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. He examined international teaching assistant (ITA) classroom speech for a variety of suprasegmental features as well as native speakers’ judgments of that speech. Particularly, speech samples were acoustically analyzed for measures of speech rate, pauses, stress, and pitch range. Fifty eight U.S. undergraduate students evaluated the ITAs’ oral performance and commented on their ratings. The results showed that suprasegmental features independently contributed to listeners’ perceptual judgments. Accent ratings were best predicted by pitch range and word stress measures whereas comprehensibility scores were mostly associated with speaking rates. These last two studies highlight the significance of speech rate in the perception and understanding of NNS speech.

History of APST

The initial version of the test was a Macintosh system developed to screen the large numbers of non-native speakers who were to be employed as international teaching assistants or lab assistants at Northeastern University in Boston, MA. NNSs read the stimuli and their versions were recorded onto the computer using a high quality microphone, fitted appropriately and adjusted to the speaker’s volume. This first test version was scored by trained graduate students, who, having achieved a reasonable to degree of inter-scorer reliability (95% or better), listened to the speech samples, and entered their scores into the computer program. The program then provided a summary score. This was used with standard TOEFL scores to determine whether the NNS should be allowed into the lab or classroom or whether s/he should receive accent modification training. This first version showed the need for a more objective and quickly scored version of the test.

A second automatic prototype was developed with funding from an NIH grant. In this Phase I study, we developed statistical speaker models of NNSs and listener models of American English (AE) listeners. Major conclusions from Phase I that underlie this current research are:

Segmental errors of NNS speech in isolated words are major predictors of reduced intelligibility. See Figure 1 for an example of such errors, for the word pulled.

Lexical stress does not significantly independently contribute to intelligibility ratings.

Word intelligibility correlates with intelligibility of longer units of speech, particularly with an intelligibility rating carried out by listeners on American English speech.

A test using only selected objective acoustic measures can distinguish highly intelligible native speech from less intelligible NNS speech.

Figure 1. Example of an isolated-word formant rule for one phoneme.
(Formants are character-istic frequencies for vowels and liquids; each pho-neme has several of them. They arise from the differing positions of the tongue, lips, and jaw when producing these phonemes.) In this case, the line shows the division between pairs of measured formant values in examples that were judged correct by a phonetician (filled symbols) and those in productions judged incorrect. Formant pairs F1 (lowest frequency) and F2 (second lowest) for the /l/ of pulled. Correct productions correspond to a 450-Hz or smaller difference between F1 and F2, shown by the dividing line.

Figure 2. Empirical performance of an early version of APST. (Left) Predicted intelligibility on a 100- point scale, using only automatically detected segment-error rules, as in Figure 1. (Right) Predicted intelligibility of full APST, which combines segmental and other components of speech. As shown on the right, APST compares very favorably with listener evaluations of the same speakers. (Note: Listeners evaluated native speakers in this data set as having scores of 88 and above, NNSs as 79 and below.)

Further development of the APST system has been under the auspices of Speech Technology and Applied Research Corp. (S.T.A.R.). It is this newest version of the system that we test in this most recent work.

Description of Current System

Technologically, APST is built upon knowledge-based speech analysis (KBSA), physics and acoustics, and the physiology of human speech production, modification, and enhancement. The phrase knowledge-based refers to a system that is based on the careful study and analysis of the target; in this case, speech. Two key components of such a system include a knowledge base (derived from the expertise of a human “domain expert”) and inference mechanisms (a decision or classification engine).

In this research, Prof. Ferrier-Reid served as the domain expert. Her expertise was used to determine which things to test for (e.g., a word like river) – as well as key features within those items that might have special significance (the first syllable, or letter combination, ri- ). APST’s decision-classification system is a fuzzy logic system built on low-level acoustical measurements. APST uses KBSA to determine: phonetic (single-consonant or single-vowel), prosodic (phrase), and fluency (smoothness) components of intelligibility.

Research Design

Our own research and a recent review of the literature lead us to ask the following researchquestion:

Will the current version of APST and the judgments of a small group of naïve Native English Listeners agree on whether preselected non‐native speakers are most intelligible, least intelligible, or in the middle of the intelligibility range?

Particularly:

Will APST results agree with 1) intelligibility ratings and 2) position rankings of a smallgroup of naïve American English listeners who listen to digitized recordings of sentencesproduced by NNSs?

In this iterative design, we rely upon the findings of APST Phase I, showing that

Sentence Intelligibility (i.e., the number of words in sentences heard correctly) correlated highly with rating scales of intelligibility (nine‐point anchored scale) in sentences by naïve listeners. (This finding gives us confidence in the use of Likert scales to measure intelligibility.)

Phonetic intelligibility (i.e., the percentage of phonemes correct in single words assessed by a phonetician) also correlated highly with rating scales of intelligibility, supplying additional confidence in intelligibility rating scales

Listeners provided rankings of NNSs that can be used to sort speakers into most intelligible, least intelligible, and mid‐range. These speaker recordings will be used again in the current study.

Method

Five native speakers of American English, consisting of four untrained listeners and one trained pronunciation coach, listened to four sets of recordings in order to evaluate (using the nine‐point Likert scale) and rank (top/middle/bottom) their overall intelligibility. The recordings consisted of one Spanish male (SMJPA5) with a mid‐range intelligibility score from the APST, two Chinese females, one with a very high score (CFJLO5) and the other with a very low score (CFXLO9), and one male native speaker with a very high APST score (EMSGA21). The APST ranking is thus EMSGA21, CFJLO5, SMJPA5, CFXLO9.

Results and Analysis

On both measures, the evaluators all rated the recording subjects in a fashion consistent with their APST scores.

Note: To calculate Mean and Median in the graphs below, T/M/B was converted to 0/1/2.

Using both the Likert scale and the T/M/B ranking system, we tested for:

H0=Random Rating Selection, independent across listeners and speakers

Using this null hypothesis we were able to determine the likelihood that listeners would score subjects in a fashion consistent with the APST by chance. Specifically, it is straightforward to compute the probability that, of two speakers who are adjacent in APST scores, a listener would randomly assign increasing (4/9 chance), equal (1/9), or decreasing (4/9) Likert values on the 9‐point scale. (Such a probability is computed from the binomial distribution. Note that there are three adjacent pairs: EMSGA21/CFJLO5, CFJLO5/SMJPA5, and SMJPA5/CFXLO9.)

For the T/M/B ranking system we were able to reject the null hypothesis with p<0.003. For the Likert scale we were able to reject the null hypothesis with p<0.001. Our findings are all the more remarkable due to their exact correspondence on the Likert scale with the speaker ratings based upon their APST scores: Every listener scored every speaker in the same order as APST. Similarly, the T/M/B ranking system also corresponds as closely as possible to the APST scores. Of course, with four speakers and only three ranks, every listener is forced to assign identical ranks to at least one pair of speakers; but even then, the identically ranked speakers were adjacent in APST rank in every case.

References

American Speech-Language and Hearing Association position paper on social dialects. (1989).