Powered by machine learning

Apply the most advanced deep-learning neural network algorithms to audio
for speech recognition with unparalleled accuracy. Cloud Speech-to-Text
accuracy improves over time as Google improves the internal speech
recognition technology used by Google products.

Recognizes 120 languages and variants

Cloud Speech-to-Text can support your global user base, recognizing 120 languages and variants. You can also filter inappropriate content
in text results for all languages.

Automatically identifies spoken language

Using Cloud Speech-to-Text you can identify what language is spoken in the utterance
(limit to four languages). This can be used for voice search
(such as, “What is the temperature in Paris?”) and command use cases
(such as, “Turn the volume up.”)

Returns text transcription in real time for short-form or long-form audio

Cloud Speech-to-Text can stream text results, immediately returning
text as it’s recognized from streaming audio or as the user is speaking.
Alternatively, Cloud Speech-to-Text can return recognized text from audio stored in
a file. It’s capable of analyzing short-form and long-form audio.

Cloud Speech-to-Text is tailored to work well with real-life speech and can accurately
transcribe proper nouns (such as, Sundar Pichai) and appropriately format language
(such as, dates, phones numbers). Google supports more than 10x proper nouns compared
to the number of words in the entire Oxford English Dictionary.

Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate)

video

Best for audio that originated from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate.

default

Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.

Speech recognition can be customized to a specific context by providing a set of
words and phrases that are likely to be spoken. This is especially useful for adding
custom words and names to the vocabulary and in voice-control use cases.

Real-time Streaming or Prerecorded Audio Support

Audio input can be streamed from an application’s microphone or sent from a
prerecorded audio file (inline or through Google Cloud Storage). Multiple audio
encodings are supported, including FLAC, AMR, PCMU, and Linear-16.

Auto-Detect Language BETA

When you need to support multilingual scenarios, you can now specify two to four language codes and Cloud Speech-to-Text will identify the correct language spoken and provide the transcript.

Choose from a selection of four pre-built models: default, voice commands and search, phone calls, and video transcription.

Speaker Diarization BETA

Know who said what - you can now get automatic predictions about which of the speakers in a conversation spoke each utterance.

Multichannel Recognition BETA

In multiparticipant recordings where each participant is recorded in a separate channel (e.g., phone call with two channels or video conference with four channels), Cloud Speech-to-Text will recognize each channel separately and then annotate the transcripts so that they follow the same order as in real life.

Cloud Speech-to-Text API pricing

Powerful speech recognition.

Cloud Speech-to-Text is priced per 15 seconds of audio processed after a 60-minute
free tier. For details, please see our pricing guide.

Feature

0-60 minutes

Over 60 minutes, up to 1 million minutes

Speech Recognition (all models except video)

Free

$0.006 USD / 15 seconds*

Video Speech Recognition

$0.006

$0.012 USD / 15 seconds*

This pricing is for applications on personal systems (e.g., phones, tablets, laptops,
desktops). Please contact us
for approval and pricing to use the Speech-to-Text API on embedded devices
(e.g., cars, TVs, appliances, or speakers).

* Each request is rounded up to the nearest increment of 15 seconds.
For example, if you make three separate requests, each containing 7 seconds of audio,
you are billed $0.018 USD for 45 seconds (3 × 15 seconds) of audio. Fractions of
seconds are included when rounding up to the nearest increment of 15 seconds.
That is, 15.14 seconds are rounded up and billed as 30 seconds.

A product or feature listed on this page is in beta. For more information on
our product launch stages, see here.