In this article

Microsoft Speech API overview

In this article

The cloud-based Microsoft Speech API provides developers an easy way to create powerful speech-enabled features in their applications, like voice command control, user dialog using natural speech conversation, and speech transcription and dictation. The Microsoft Speech API supports both Speech to Text and Text to Speech conversion.

Speech to Text API converts human speech to text that can be used as input or commands to control your application.

Text to Speech API converts text to audio streams that can be played back to the user of your application.

Speech to text (speech recognition)

Microsoft speech recognition API transcribes audio streams into text that your application can display to the user or act upon as command input. It provides two ways for developers to add Speech to their apps: REST APIs or Websocket-based client libraries.

REST APIs: Developers can use HTTP calls from their apps to the service for speech recognition.

Client libraries: For advanced features, developers can download Microsoft Speech client libraries, and link into their apps. The client libraries are available on various platforms (Windows, Android, iOS) using different languages (C#, Java, JavaScript, ObjectiveC). Unlike the REST APIs, the client libraries utilize Websocket-based procotol.

Advanced speech recognition technologies from Microsoft that are used by Cortana, Office Dictation, Office Translator, and other Microsoft products.

Real-time continuous recognition. The speech recognition API enables users to transcribe audio into text in real time, and supports to receive the intermediate results of the words that have been recognized so far. The speech service also supports end-of-speech detection. In addition, users can choose additional formatting capabilities, like capitalization and punctuation, masking profanity, and text normalization.

Support many spoken languages in multiple dialects. For the full list of supported languages in each recognition mode, see recognition languages.

Integration with language understanding. Besides converting the input audio into text, the Speech to Text provides applications an additional capability to understand what the text means. It uses the Language Understanding Intelligent Service(LUIS) to extract intents and entities from the recognized text.

Text to speech (speech synthesis)

Text to Speech APIs use REST to convert structured text to an audio stream. The APIs provide fast text to speech conversion in various voices and languages. In addition users also have the ability to change audio characteristics like pronunciation, volume, pitch etc. using SSML tags.