AT&T WATSONSM converts between different communication modalities, allowing for humans and devices to interact more readily. It consists of a general-purpose engine and a collection of plugins, each of which performs a conversion or analysis task. These tasks, many involving speech and language, can be combined in various ways, depending on what information is being communicated.

One common use of WATSON is to convert human speech to text that can be readily interpreted by a device or other machine. In this case, the output might be simple text, or WATSON can perform the additional step of parsing the text so the human’s intent can be determined and communicated to the device. It works the other way, too; WATSON can take content generated by a machine and convert it to speech or text for humans to understand.

Essentially WATSON takes some input, analyzes it, performs one or more services, and returns a result, all in real time.

WATSON can not only convert from speech to text but can combine speech with other modalities, such as a touch-screen tap (“show me the closest Starbucks, here”) or other gesture, and send the information to a device. WATSON also converts from speech to speech to do translations, even involving multiple languages: speech input in one language can be converted to text in real time, followed by a text translation (with little delay), followed by the spoken translated sentence at sentence end.

The diversity of possibilities on a single platform is due to a plugin architecture where each subtask is contained in its own plugin. Depending on the task to be performed, WATSON selects the right plugins at run time, assembles them into a working engine, and coordinates the information exchange between the plugins. It also takes care of feeding the input media into the engine and forwarding partial or final results to the end device.

WATSON has been used within AT&T for IVR customers, including AT&T's VoiceTone® service, for over 20 years during which time the ASR algorithms, tools, and plug-in architecture have been refined to increase accuracy, convenience, and integration. Besides customer care IVR, AT&T WATSONSM has been used for speech analytics, speech translation (including the AT&T Translator app), mobile voice search of multimedia data, video search, voice remote, voice mail to text, web search, and SMS.

Increasingly, AT&T WATSONSM is being integrated into web-based, speech-enabled devices and services being worked on in Research, including the speech mashups and the AT&T WMSSP (WATSON Mobile Speech Services Platform) that currently supports Speak4it (local business search) and will support future production applications.

This talk first gives a brief tour of the components of an interactive speech system -- including speech recognition, language understanding, dialog control, language generation, and text-to-speech. Then, some of the key challenges of using this technology in practice are covered: unavoidable speech recognition errors, the variability of human speech, the curse of history, and the theory of mind problem. It then gives a few pointers for how to get started with the technology on mobile devices, covering local vs cloud-based speech recognition, guidance on creating speech recognition grammars, and an intro to some of the parameters which may need to be tuned.

{In previously published work, we have proposed a novel feature extraction algorithm approximating some of the human auditory characteristics and the robustness of an alternative energy estimation scheme. Herein, we examine the proposed feature performance under additive noise and suggest how to predict the noisy cepstral coefficient deviations by estimating the subband SNR values. Then, we examine the efficiency of the proposed features in the framework of a state-of-the-art LV-CSR system, namely the AT&T WATSON system. The features are examined in a mobile, voice search task, namely the Speak4It application. The proposed feature extraction scheme increases the overall performance by 6\% relative improvement, leaving the AM and LM training fixed. Additional improvements have been reported when this frontend is combined with advanced training techniques.}

{This paper reports on the development and advances in automatic speech recognition for the AT&T Speak4It voice-search application. With Speak4It as real-life example, we show the effectiveness of acoustic model (AM) and language model (LM) estimation (adaptation and training) on relatively small amounts of field-data. We then introduce algorithmic improvements concerning the use of sentence length in LM, of non-contextual features in AM decision-trees, and of the Teager energy in the acoustic front-end. The combination of these algorithms yields substantial accuracy improvements. LM and AM estimation on samples of field-data increases the word accuracy from 66.4% to 77.1%, a relative word error reduction of 32%. The algorithmic improvements increase the accuracy to 79.7%, an additional 11.3% relative error reduction.}

The AT&T Statistical Dialog Toolkit V1.0 is now available to the research community. This toolkit simplifies building statistical dialog systems, which maintain a distribution over multiple dialog states.