Description

Abstract

For a product to deliver value to society, the product must deliver expected benefits reliably for an ever increasing number of users. The strong correlation between size of the market and the reliability of the product can be observed in many new product introductions. In the field of speech technology, one sees early adoption of speech recognition technology among population of users who get large benefits from investing in the time to learn to use the system, such as users who have accessibility needs, busy professionals such as lawyers and doctors, etc. Lately, speech use has expanded to the general population, specifically with the use of speech on mobile devices. However, even in fairly successful products such as Siri, most users still use it for relatively easy tasks such as name dialing. To successfully deliver the goal of automated intelligent assistant to the general population, the product must be designed to deliver expected value reliably.

Speaker Biography

Eric Chang joined Microsoft Research Asia (MSRA)in July, 1999 to work in the area of speech technologies. Eric is currently the Senior Director of Technology Strategy at MSR Asia, where his responsibilities include communications, IP portfolio management, and driving new research themes such as eHealth. Prior to his new responsibilities at MSR Asia, Eric co-founded Microsoft Advanced Technology Center (ATC) in 2003 as the Assistant Managing Director. At ATC, Eric led teams to ship features for Windows and Windows Mobile and started a multi-disciplinary incubation team. Before joining ATC, Eric was the research manager of the speech group at MSRA and the acting University Relations Director for one year. A technology transfer result from his group is the Chinese version of Office XP, which incorporates the Mandarin speech recognition engine developed at Microsoft Research Asia. Prior to joining Microsoft Research, Eric was one of the founding members of the Research group at Nuance Communications, a pioneer in natural speech interface software for telecommunication systems. While at Nuance, Eric worked on various projects involving confidence score generation, acoustic modeling, and robust speech detection. He also led the technical effort to develop the Japanese version of the Nuance product. This project led to the world's first deployed Japanese natural language speech recognition system.

Eric has also developed speech recognition algorithms at M.I.T. Lincoln Laboratory, invented a new circuit optimization technique at Toshiba ULSI Research Center, and conducted pattern recognition research at General Electric Corporate Research and Development Center.

Eric graduated from M.I.T. with Ph.D., Master and Bachelor degrees, all in the field of electrical engineering and computer science. While at M.I.T., he was inducted into the honorary societies Tau Beta Pi and Sigma Xi. Eric is also a Senior Member of IEEE.

Eric has published papers in the fields of speech recognition, neural networks, and genetic algorithms in various journals and conferences. He is the author of several granted and pending patents. His research interests are spoken language understanding, machine learning, and signal processing.

Abstract

Automatic Speech Recognition (ASR) systems classify structured sequence data, where the label sequences (sentences) must be inferred from the observation sequences (the acoustic waveform). The sequential nature of the task is one of the reasons why generative classifiers, based on combining hidden Markov model (HMM) acoustic models and N-gram language models using Bayes' rule, have become the dominant technology used in ASR. Conversely, the machine learning (ML) and natural language processing (NLP) research areas are increasingly dominated by discriminative approaches, where the class posteriors are directly modeled. This talk describes recent work in applying discriminative models to ASR. To handle continuous, variable length, observation sequences, the approaches applied to ML and NLP tasks must be modified. This talk discusses the issues in applying discriminative models to ASR and possible solutions. The nature of the models, possible sets of features, and options for optimizing the parameters of the models, will all be described. Examples of applying these approaches to continuous ASR tasks will also be given.

Speaker Biography

Mark Gales studied for the B.A. in Electrical and Information Sciences at the University of Cambridge from 1985-88. Following graduation he worked as a consultant at Roke Manor Research Ltd. In 1991 he took up a position as a Research Associate in the Speech Vision and Robotics group in the Engineering Department at Cambridge University. In 1995 he completed his doctoral thesis: Model-Based Techniques for Robust Speech Recognition supervised by Professor Steve Young. From 1995-1997 he was a Research Fellow at Emmanuel College Cambridge. He was then a Research Staff Member in the Speech group at the IBM T.J.Watson Research Center until 1999 when he returned to Cambridge University Engineering Department as a University Lecturer. He is currently a Professor in Information Engineering and a Fellow of Emmanuel College. Mark Gales is a Fellow of the IEEE and was a member of the Speech Technical Committee from 2001-2004. He was an associate editor for IEEE Signal Processing Letters from 2009-2011 and is currently an associate editor for IEEE Transactions on Audio Speech and Language Processing. He is also on the Editorial Board of Computer Speech and Language. Mark Gales was awarded a 1997 IEEE Young Author Paper Award for his paper on Parallel Model Combination and a 2002 IEEE Paper Award for his paper on Semi-Tied Covariance Matrices.

Abstract

In recent years we have made a number of proposals for a paradigm for the automatic analysis by synthesis of speech prosody, which aims to characterise the length, pitch and loudness of the individual speech sounds which make up utterances. This paradigm has already been applied successfully to Western European languages (in particular English and French). In this presentation we look at some of the problems involved in applying the paradigm to a lexical tone language like Mandarin Chinese. For this preliminary investigation of read speech, we recorded a Chinese version of the 40 continuous 5 sentence passages of the Eurom1 corpus read by ten speakers (a total of just under 4 hours of speech). The speech was aligned at the phoneme, syllable and word levels using the SPASS automatic speech aligner and the pitch was modelled using the Momel and INTSINT algorithms. The results obtained for Chinese will be compared to those available for English, which is characterised as a language with lexical accent and French, which is characterised as a language with no lexical prosody.

Speaker Biography

Daniel Hirst has been working in the field of speech prosody and phonology for the past forty years. In this time he completed two doctoral theses (Doctorat de 3e Cycle 1974; Doctorat d'Etat 1987) and is at present Emeritus research Director at the CNRS laboratory "Parole et Langage" in Aix-Marseille University. Daniel Hirst has published numerous articles in several major journals (including Linguistics, Phonetica, Journal of Semantics, Phonology, Mind and Language, Linguistic Inquiry, Speech Communication, Journal of Computer Science, Journal of the International Phonetic Association, Journal of Speech Sciences) and has contributed chapters to numerous international collaborative volumes. He has was responsible (with Albert Di Cristo) for the edition of Intonation Systems: a Survey of Twenty Languages, a major study of the intonation of languages of the world published by Cambridge University Press to which he contributed the chapter on British English as well as an 80 page introduction (co-authored by Albert Di Cristo) in which he proposed a new international transcription system for intonation (INTSINT). Daniel Hirst has developped software for the automatic modelling of fundamental frequency curves and is at present working on an automatic prosodic labelling system for speech synthesis using the INTSINT transcription system. In 2000 he founded an international working group (SProSIG: the Speech Prosody Special Interest Group) affiliated to ISCA and with the support of 70 specialists from all over the world. This group was responsible for organising the First International Conference on Speech Prosody held in Aix en Provence in March 2002, followed by biennial international conferences held in Japan 2004, Germany 2006, Brazil 2008, USA 2010 and China 2012. Since April 2012 Daniel Hirst is also Lecture Professor at Tongji University, Shanghai, China.

Abstract

We are experiencing two disrupting technology developments: Neurotechnology as a man-machine interface and telecommunication as a time-space integrator. Hearing and speech enhancement is in the forefront of this technological revolution. As a prime example, I will use cochlear implants and hearing aids to illustrate convergence of these two technologies and their impact on not only human communication but also our life quality and style.

Speaker Biography

Fan-Gang Zeng is a leading researcher in auditory science and technology, unraveling brain mechanisms in loudness coding and speech recognition while translating research into two commercial products in hearing loss and tinnitus treatment. He has published more than 100 peer-reviewed journal articles, with 5000 citations and an h-index of 36 (Google Scholar June 2012). He is a Professor of Anatomy and Neurobiology, Biomedical Engineering, Cognitive Sciences, and Otolaryngology and Director of Center for Hearing Research at the University of California Irvine. He is a Fellow of The American Institute for Medical and Biological Engineering, Collegium Oto-Rhino-Larygologicum, IEEE, and the Acoustical Society of America.