Standards Make the World Smaller

My previous columns have focused on specific standards and how they can be used in speech applications. I’ve written about the Multimodal Interaction Architecture (MMI), State Chart eXtensible Markup Language (SCXML), Call Center XML (CCXML), Pronunciation Lexicon, and Extensible Multimodal Annotation (EMMA). But by looking at standards, especially speech and multimodal standards, we see plenty of opportunities for new services and products, particularly for smaller enterprises. Standards break the world into smaller, distinct pieces, with agreed-on functions and boundaries.

In his insightful book The Pebble and the Avalanche: How Taking Things Apart Creates Revolutions, Moshe Yudkowsky discusses how breaking the world into smaller pieces creates opportunities. He refers to this process as disaggregation. Standards lead to a specific kind of disaggregation in which an industry agrees on what the new, smaller pieces of the world will be composed of and how they will interact. This makes assembling a system out of components much easier.

Think about installing computer hardware before the days of the USB port and plug and play. Remember IRQs? Remember fighting with drivers? Remember the three hours of hair-pulling to get your new mouse ready to use? Now think of all the hardware, like the mouse, keyboard, display, memory, sound cards, Webcams, etc., that no longer has to be bought from the computer manufacturer.

Another example, and one a little closer to the speech industry, is the Media Resources Control Protocol (MRCP), standardized by the Internet Engineering Task Force. MRCP defines the functions provided by speech applications (speech recognition, text-to-speech, and speaker verification) and how they can be accessed by other components, such as voice browsers.

Before MRCP, voice browsers and speech services were monoliths. Automatic speech recognition (ASR) and text-to-speech (TTS) options for a particular voice browser were fixed by the platform vendor, and communication between the ASR and the voice browser differed from platform to platform. The upshot was that ASR, TTS, and voice browser technology all came together in a package from the same vendor. They had to be used together regardless of whether that combination resulted in the best overall solution. MRCP standardizes this connection and that standardization creates opportunities.

With a standard interface to speech services, a smaller vendor that might not have the resources to build a full voice platform can play in this market by offering a stand-alone MRCP server or voice browser.

Adding Riches The importance of these considerations is increasing rapidly as more technologies become part of the everyday user interface. As the user experience becomes more natural and intuitive, the array of supporting technologies becomes richer and more complex. If a particular application requires input from speech, camera, handwriting, fingerprint reader, GPS, and accelerometers, few companies will have the expertise to start from scratch with all of these technologies. But if standards break down the system into components that can be provided individually by smaller companies that have experience in some of these specific areas, then these companies can enter the picture.

Speech recognition by itself is a very sophisticated technology. What if you wanted to combine speech with other inputs? An application might let you circle an area on a screen and put a caption on that area by speaking. You might say, “Mary’s house,” and that caption would appear in the circled area. And what if you wanted to clear the caption by shaking your phone? If speech, haptic, and pen experts were all required, then this might make the application too complex.

Multimodal standards, like EMMA and the MMI Architecture, make integration easier by breaking down input technologies into components and communication standards. These standards will underlie many innovations.

Just consider the possibilities. We could have products like speech recognition or TTS components, handwriting recognition components, and biometric components like fingerprint identification. Vendors could provide services to integrate different components into larger applications. Services could be provided on the Web, and other complex technologies, like natural language understanding or translation, could also be provided as stand-alone Web services. The range and variety of innovations based on these new standards surely would be truly amazing.

Deborah Dahl, Ph.D., is the principal at speech and language technology consulting firm Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interaction Working Group. She can be reached at dahl@conversational-technologies.com.