Louis C.W. Pols Hot topics in the Field of Speech Synthesis Assessment

Transcription

1 SPEECH SYNTHESIS Christian Benoît Speech Synthesis: Present and Future Louis C.W. Pols Hot topics in the Field of Speech Synthesis Assessment in European Studies in Phonetics & Speech Communication, G. Bloothooft et al., Eds, OTS Publs, Utrecht, Speech Synthesis: Present and Future Christian Benoît ICP, Grenoble, France Speech synthesis: a pluridisciplinary and multilingual challenge Speech Synthesis is undoubtedly a technological challenge with many potential applications in human-machine communication. More basically, it is a crossroads where researchers with many different backgrounds collaborate to put together their knowledge in computational linguistics, phonetics, prosody, physiology, vocal tract modeling, signal processing, image synthesis, experimental psychology, etc. It is thus a wide area of inquiry which appeals to all those who are attracted by pluridisciplinarity and, first of all, by the most impressive property of humans: How do we speak? Synthesizing speech has been a human dream forever, much before silicon was mastered by electronics engineers. At the turn of our century, digital signal processing allows synthetic speech to be widely used. It already seems quite normal to us to have several machines speak to us everyday. Speaking toys are now widespread. However, we are still far from the ultimate reading machine as fluent, intelligible, and natural as a human speaker. A lot of research remains to be done so that progress is made towards this goal. Moreover, if a few synthesisers have now reached a satisfactory quality for most of the applications they can be used in, this only stands for English, French, and a few other European languages. The objective of speech synthesis is clearly to make machines sound as human as possible. Two scenarios can thus be foreseen for our grand children and our text-to-speech synthesiser in the next century in a huge open world: Either they both speak like today s English robots when they communicate over the linguistic boundaries, or they are both fluent multilingual speakers. It is our role as speech scientists to avoid the former scenario, and to promote the latter! Why synthesise speech? Like the desire to fly, it has always been a human dream in many civilisations to give inanimate objects a voice. Because speech is specific to humans, because having the ability to speak is considered as having a soul, many mystics in the history have pretended to give speech to machines. In the ancient times in Greece, some priests addressed their audience through a statue's mouth in order for the congregation to believe this was God's voice. A few centuries ago, geniuses in mechanics and acoustics, like Vaucanson or Von Kempelen, built machines equipped with artificial lungs and mouth made of mechanical articulators. When expertly played, those instruments could produce human-like sounds. The sentences so generated were almost intelligible, as long as the listeners were first aware of the linguistic

2 content. In the same vein, ventriloquists animate their puppet's mouth, while producing invisible vocal tract gestures, so that a young audience believes the puppet alive and able to speak. Even today, many synthetic voices are mistaken as human: speaking clocks over the telephone, departure announcements in railways stations, etc. The three faces of Speech Synthesis Etymologically, the Greek word synthesis refers to the building up of separate elements into a connected whole. It is thus the artificial production of compounds from their constituents (from the Concise Oxford Dictionary). There is no fixed definition of Speech Synthesis among speech scientists. The term has been given different meanings in the past, mostly depending on the capacities of technology and on what the "constituents" of speech should be. However, we can see three categories of "speech synthesisers". A dozen years ago, the first chips that could allow real-time generation of an acoustic waveform were given the name of "speech synthesizers". A digital-to-analog converter (DAC) builds up a continuous signal from the concatenation of several thousand bytes per second. An LPC decoder, or a formant "synthesizer" builds up a continuous signal from the concatenation of several hundreds of spectrum coefficients or of frequency parameters per second. In both cases, no speech has ever been "synthesized" at all: Only pre-recorded messages can be spoken by such systems. The second category of speech synthesis involves systems that concatenate pre-recorded sentences or words so that a new sentence can be generated although it has never been uttered entirely before. Such systems require some more or less sophisticated linguistic rules to work properly. Only a limited (although sometimes huge) number of new sentences can be generated in this way. The last category involves text-to-speech synthesis, that is the generation of unlimited text within a given language. It is today the kind of speech synthesis" one primarily refers to within the scientific community. a) Pre-stored digital messages Old myths live on: Ten to twenty years ago was a strong belief that all the machines we may use in our everyday life would benefit from being able to speak. A wide market was expected for speaking coffee-machines, speaking vacuum-cleaners and the like. Some cars still have the capacity to tell you orally that your back door is unlocked. The experience however shows that visible displays are much less boring than a repetitive, robot-like voice. Nowadays, the use of digital voices to produce a small set of sentences or words is mostly limited to the entertainment industry (speaking dolls, games, etc.), or to learning purposes (multimedia dictionaries, etc.). b) synthesis from concepts It is possible to generate a large number of new sentences from a relatively limited set of pre-stored speech segments. For instance, a speaking clock can be created with less than a hundred existing numbers, or with a few dozen segments of words. An answering machine

3 that reads aloud the balance of any bank account requires no more than a hundred speech units to work properly. However, those basic units have to be recorded, labeled, and edited with great care, and only experts may define appropriate corpora, since the context in which the units are recorded strongly affects their duration, intensity and melodic structure. This technique is often referred to as "synthesis from concept", in the sense that adequate rules may turn a simple concept into a useful spoken message. There are many existing applications on the telephone network, such as "You are trying to call Mr X. He has moved to Y. From where you are calling, please dial now Z". All possible users names and places must be previously recorded so that all possible concepts can be synthesized. Otherwise, the list of spoekn words must be updated with extra-recordings. Text-to-speech synthesis can also be used to generate those few non-existing words. c) Text-to-Speech Synthesis This is the most sophisticated part of the game. It stands for the conversion of any text into speech in a given language. There are several levels in the whole process. Generally, a first step, called text pre-processing, transforms numbers, abbreviations, acronyms, etc. into orthographic segments. Capital letters, hyphenations, etc., are also processed at this level. Then, text analysis aims at determining the grammatical characteristics of the words and the syntactic structure of the sentence which are necessary for both the phonemic conversion (disambiguation of heterophone homographs, liaisons, lexical stress, etc.) and the prosodic modeling. The grapheme-to-phoneme conversion transforms the text into a string of phonemes. Today, most of the converters make wide use of dictionaries with orthographic inputs and phonetic outputs. In most cases, it is somewhat difficult to separate text analysis and phonemic conversion, since huge word dictionaries may give simultaneously the phonemic conversion, the grammatical category, the stress position, and so forth. Prosodic modeling assigns a duration and melodic contours to the sentence and to its constituting syllables. The applied prosodic patterns depend on observations made in a language on the one hand, and on a symbolic annotation of the syntactic structure to be synthesized on the other hand. The duration and the melodic contours of a segment can be calculated by rules. They can also be duplicated from values measured on similar structures, serving as references which have been previously stored in prosodic lexicons. Finally, sound generation is the process that transforms a string of symbols into an acoustic output. Rulebased synthesis assigns target values to the formants of each phonetic unit and then smoothes the transitions in-between, depending on units duration. A widely used alternative makes use of coded speech segments prestored in dictionaries. Elementary speech segments can be syllables, diphones, polyphones, etc., or a mixture thereof. Those units can be simply digitized, or coded in term of LPC coefficients, formant frequencies, articulatory parameters, etc. If the elementary units are simply digitised, signal processing techniques (e.g., PSOLA) are first applied to the speech units so that fundamental frequency, duration, and intensity match those given by the prosodic model. If the dictionary is made of coded speech segments, prosodic transformation is applied to the parameters before they are concatenated, and then decoded into a speech signal. Finally, it is important to emphasize that expertise is not sufficient to improve the quality of a speech synthesizer. As pointed out by Louis Pols in his contribution, the performances of any module of a system must be evaluated by human listeners through a battery of subjective tests. This area is where experimental psychology is most useful. Psycholinguists are needed for speech synthesis to make progress. In turn, speech synthesis is a magnificent tool for psychological inquiry, since there is nothing better than highly controlled synthetic stimuli to investigate how human listeners perceive speech.

4 The speech synthesis community today As pointed out above, speech synthesis is at a scientific crossroads. Text-to-speech synthesis is by nature a challenging meeting place for related areas, ranging from linguistic and phonological modelling to signal and symbolic processing; from acoustic and articulatory phonetics to psycholinguistics. Speech Synthesis has long been but a small part of any general congress on Speech Communication. However, things are changing quickly. In 1990, the First International Workshop on Speech Synthesis gathered in Autrans (France) more than a hundred researchers from all over the world, under the aegis of the European Speech Communication Association. In 1992, the book Talking Machines (Bailly and Benoît, 1992) was published with forty contributions entirely devoted to speech synthesis. In 1994, the Second International Workshop on Speech Synthesis was organized in New Paltz (NY, USA) with the same success as the first edition. A second book, Progress in Speech Synthesis (van Santen et al., 1995) is currently under publication. Speech Synthesis is more and more active as a field of research. Its pluridisciplinary and multilingual aspects make it especially attractive to all those who are motivated by both basic and applied research. The future of Speech Synthesis What are the trends towards higher-quality speech synthesis? Rather than giving my personal views of the future, I recently distributed a questionnaire to several experts in the field. 38 responses, (20 from Europe, 15 from North America and 3 from Japan) gave promising insight to the future of Speech Synthesis. They are summarised below. Text analysis is considered as the field where a very strong effort should be put in the future. Although grapheme-to-phoneme conversion is already considered as a solved problem by a large number of people, it is expected by several researchers that self-learning systems should somehow replace some of the dictionaries whic hare now widespread. Stochastic techniques are also seen by many as a promising solution to build up prosodic databases. Although most of the prosodic models today are rule-based, it is considered that prosodic lexicons will take an increasing place. However, the most appropriate type of acoustic units should not be very different to that used today: a mixed set of polyphones and of demisyllables is still considered as the best compromise for today and for the future. It is certainly in the domain of acoustic coding that things will change dramatically. So far, PSOLA is seen as the best coding technique by two thirds of the respondents who thus favour the high quality of a synthesis method based upon the concatenation of speech segments stored in lexicons. Nevertheless, it is anticipated by more than half the responses that articulatory modelling will give the best results in the future. This expectation is somewhat surprising when considering the relatively small quantity of studies carried on in this area. There is little doubt that the effort in articulatory modelling will grow within the next years. This new trend is particularly interesting because it will bring back a lot of energy in a scientific domain much closer to human sciences than to signal processing. Moreover, research in articulatory modeling is receiving a renewed support from the emerging technology of visual speech synthesis. Facial animation is foreseen by many respondents as a promising area of inquiry in the next few years. Like acoustic synthesis, visible speech can make use of prestored images or of parametric models of the face. The latter technique is obviously that which is favored today, and it is also expected to be the best method in the future. Visible and invisible parts of the vocal tract and of the face have thus to be carefully studied in order for adequate articulatory parameters to be defined and controlled over the time. 3D analysis of the vocal tract in real-time is not yet accessible, but technology is progressing very fast. It is thus expected that reading machines will strongly benefit from research in articulatory dynamics. In turn, articulatory modelling should also help us better understand how humans produce speech. And this is certainly the most challenging part of the game!

5 Acknowledgements I am especialy indebted to all the specialists in Speech Synthesis who spent some time answering the questionaire that I circulated. Their responses were enlightning. I hope they will be useful to those students and young researchers eager to investigate new trends in this fascinating area of inquiry. References Bailly, G. and Benoit, C. (Eds) (1992). Talking Machines: Theories, Models, and Designs. Amsterdam: Elsevier. van Santen, J. et al. (Eds) (1995). Progress in Speech Synthesis. Berlin: Springer-Verlag.

Reading Competencies The Third Grade Reading Guarantee legislation within Senate Bill 21 requires reading competencies to be adopted by the State Board no later than January 31, 2014. Reading competencies

Phonetics 'Phonetics' is the study of pronunciation. Other designations for this field of inquiry include 'speech science' or the 'phonetic sciences' (the plural is important) and 'phonology.' Some prefer

American English File and the Common European Framework of Reference Karen Ludlow Starter 2 Int r o d u c t i o n What is this booklet for? The aim of this booklet is to give a clear and simple introduction

Лю Пэн COMPUTER TECHNOLOGY IN TEACHING READING Effective Elementary Reading Program Effective approach must contain the following five components: 1. Phonemic awareness instruction to help children learn

The following are can do statements in four skills: Listening, Speaking, Reading and Writing. Put a in front of each description that applies to your current Thai proficiency (.i.e. what you can do with

Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

INDEX Page No. List of Figures...XII List of Tables...XV 1. INTRODUCTION TO RECOGNITION OF FOR TEXT TO SPEECH CONVERSION 1.1 Introduction...1 1.2 Statement of the problem...2 1.3 Objective of the study...2

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania oananicolae1981@yahoo.com

LISTENING Standard : Students demonstrate competence in listening as a tool for learning and comprehension. Proficiency Level I: Students at this level are beginning to understand short utterances. They

Kindergarten - Big Ideas Language and stories can be a source of creativity and joy. Stories help us learn about ourselves and our families. Stories and other texts can be shared through pictures and words.

Notes and discussion Things to remember when transcribing speech David Crystal University of Reading Until the day comes when this journal is available in an audio or video format, we shall have to rely

7 One of the things you will often be asked to do at the university is to give a presentation in class. Many people are terrified of public speaking and would do anything to get out of it. Nevertheless,

Applied Phonetics and Phonology Weekday section Mid-Term Exam Study Guide Thomas E. Payne, Hanyang Oregon 2007 The following are questions that may appear on the mid-term exam for Linguistics 511G. Approximately

Applied Phonetics and Phonology Weekend section Mid-Term Exam Study Guide Thomas E. Payne, Hanyang Oregon 2007 The following are questions that may appear on the mid-term exam for Linguistics 511G. Approximately

Teacher's Guide to Pronunciation in English - High Beginning+ Pronunciation in English - Intermediate+ User Management System Included for all schools at no additional cost Feedback from students After

Signaller Recruitment Information Booklet Signaller Recruitment Information Booklet Dec 2013 Page 1 of 20 INTRODUCTION This booklet has been designed to support you through the Signaller Recruitment Process.

Chapter 9 Language Review: Where have we been? Stimulation reaches our sensory receptors Attention determines which stimuli undergo pattern recognition Information is transferred into LTM for later use

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking The perception and correct identification of speech sounds as phonemes depends on the listener extracting various

Overall Secondary Philosophy for English Language Learners Many students who enroll in the Dublin City Schools arrive with diverse levels of English Proficiency. Our English Language Learning program exists

Key Background Workplace communication skills, underpinned by language, literacy and numeracy, are increasingly valued in all occupations and industries due to the greater complexity of interactions between

Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university

Advanced CB 21 A One level Assess descriptions and narrations of factual and technical materials. Discriminate for accurate information while taking notes of a complex discourse. Assess the essential message

Lecture 4 Deep Structure and Transformations Thus far, we got the impression that the base component (phrase structure rules and lexicon) of the Standard Theory of syntax generates sentences and assigns

Whitefield Schools and Centre APPROACHES TO READING Position statement Key elements in teaching reading Learning to read is a complex activity. It may be the most complex subject to teach. It is widely

International Journal of Emerging Research in Management &Technology Research Article June 2015 Text to Speech Conversion with Language Translator under Android Environment Devika Sharma M.tech student

Culture and Language What We Say Influences What We Think, What We Feel and What We Believe Unique Human Ability Ability to create and use language is the most distinctive feature of humans Humans learn

Pedagogy Overview: Phonological Awareness What is Phonological Awareness? Phonological awareness is the ability to analyze and manipulate the sound structure of language. In Lexia Reading Core5, phonological

Improving your Presentation Skills Independent Study version English Language Teaching Centre University of Edinburgh Introduction You don t have to have perfect spoken English to give a very effective

PUSD High Frequency Word List For Reading and Spelling Grades K-5 High Frequency or instant words are important because: 1. You can t read a sentence or a paragraph without knowing at least the most common.

THESE ARE A FEW OF MY FAVORITE THINGS.. FLUENCY ENHANCING THERAPY FOR PRESCHOOL CHILDREN Guidelines for Verbal Interaction It has been found effective during adult-child verbal interactions to reduce the

Question 1: Does anyone have any advice on how to reduce a student's use of her AAC device to request one item only? I am trialing the (name of device) with a student who goes straight for 'I want cookies'

1 Speaking of Writing and Writing of Speaking David Crystal THE FUNDAMENTAL DISTINCTION The distinction between speech and writing is traditionally felt to be fundamental to any discussion about language.

Speech and Language Development during Elementary School By the end of kindergarten your child should be able to do the following: Follow 1-2 simple directions in a sequence Listen to and understand age-appropriate

ST PATRICK S PRIMARY SCHOOL FREMANTLE WHAT IS LITERACY? LITERACY STRATEGY Literacy is the ability to read, write, speak and listen to language in a way that allows people to communicate with each other

M. Hood Supervised by A. Lobb and S. Bangay G01H0708 Creating voices for the Festival speech synthesis system. Abstract This project focuses primarily on the process of creating a voice for a concatenative

Application of the Interactive Approach to the Teaching of English Reading in College Dong Yan Shanxi Heavy Machinery Institute Abstract: Based on the careful analysis of the reading process and the thorough

Language and Literacy In the sections below is a summary of the alignment of the preschool learning foundations with (a) the infant/toddler learning and development foundations, (b) the common core state

Project #1: THE ICE BREAKER Executive Summary: For your first speech project, you will introduce yourself to your fellow club members and give them some information about your background, interests and

L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES Zhen Qin, Allard Jongman Department of Linguistics, University of Kansas, United States qinzhenquentin2@ku.edu, ajongman@ku.edu

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE Ria A. Sagum, MCS Department of Computer Science, College of Computer and Information Sciences Polytechnic University of the Philippines, Manila, Philippines

Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,

Fry Instant Words High Frequency Words The Fry list of 600 words are the most frequently used words for reading and writing. The words are listed in rank order. First Hundred Group 1 Group 2 Group 3 Group