Speech Recognition API

iOS 10 brings a brand new Speech Recognition API that allows you to perform rapid and contextually informed speech recognition in both file-based and realtime scenarios. In this video, you will learn all about the new API and how to bring advanced speech recognition services into your apps.

A quick overview of what speech recognition is.
Speech recognition is the automatic process
of converting audio of human speech into text.

It depends on the language of the speech.
English will be recognized differently
than Chinese, for example.

On iOS, most people think of Siri
but speech recognition is also useful for many other tasks.
Since Siri was released with iPhone 4S,
iOS has also featured keyboard dictation.
That little microphone button next
to your iOS keyboard spacebar triggers speech recognition
for any UI kit text input.
Tens of thousands of apps every day use this feature.
In fact, about a third of all requests come
from third party apps.
It's extremely easy to use.
It handles audio recording and recording interruption.

It displays a user interface.
It doesn't require you to write anymore code than you would
to support any text input.

And it's been available since iOS 5.
But with its simplicity comes many limitations.
It often doesn't make sense for your user interface
to require a keyboard.
You can't control when the audio recording starts.
There's no control over which language is used.

It just happens to use the system's keyboard language.
There isn't even a way to know
if the dictation button is available.

The default audio recording might not make sense
for your use case and you may want more information
than just text.

So now in iOS 10,
we're introducing a new speech framework.
It uses the same underlying technology that we use
in Siri and Dictation.
It provides fast and accurate results
which are transparently customized to the user
without you having to collect any user data.
The framework also provides more information
about recognition than just text.

For example, we also provide alternative interpretations
of what your users might have said, confidence levels,
and timing information.

Audio for the API can be provided
from either pre-recorded files
or a live source like a microphone.

iOS 10 supports over 50 languages and dialects
from Arabic to Vietnamese.
Any device which runs iOS 10 is supported.

The speech recognition API typically does its heavy lifting
on our big servers which requires an internet connection.
However, some newer devices do support speech recognition all
the time.
We provide an availability API to determine
if a given language is available at the moment.

Use this rather than looking
for internet connectivity explicitly.
Since speech recognition requires transmitting the user's
audio over the internet,
the user must explicitly provide permission
to your application before speech recognition can be used.

There are four major steps
to adopting speech recognition in your app.
First, provide a usage description
in your app's info.plist.
For example, your camera app, Phromage,
might have used a usage description
for speech recognition of this will allow you
to take a photo just by saying cheese.
Second, request authorization using the request authorization
class method.
The explanation you provided earlier will be presented
to the user in a familiar dialogue.

And the user will be able to decide if they want
to provide your app to speech recognition.
Next create a speech recognition request.

If you already have recorded audio file,
use the SFSpeechURLRecognitionRequest
class.

Otherwise, you'll want
to use the SFSpeechAudioBuffer RecognitionRequest class.
Finally hand the recognition request
to an SFSpeech recognizer to begin recognition.
You can optionally hold onto the returned recognition task
which can useful for monitoring
and controlling recognition progress.
Let's see what this all looks like in code.
We'll assume we've already updated our info.plist
with an accurate description of how will it be used.
Our next step is to request authorization.
It may be a good idea to wait to do this
until the user has invoked a feature of your app
which depends on speech recognition.
The request authorization class method takes the completion
handler which doesn't guarantee a particular execution context.
Apps will typically want to dispatch to the main queue
if they're going to do something like enable
or disable a user interface button.
If your authorization handler has given authorize status,
you should be ready to start recognition.

If not, recognition won't be available to your app.
It's important to gracefully disable necessary functionality
when the user makes this decision
or when the device is otherwise restricted
from accessing speech recognition.
Authorization can be changed later
in the device's privacy settings.
Let's see what it looks like to recognize a prerecorded
audio file.

We'll assume we already have a file url.
Recognition requires a speech recognizer
which only recognizes a single language.

The default initializer for SFSpeechRecognizer is failable.
So it'll return nil if the locale is not supported.
The default initializer uses device's current locale.

In this function, we'll just return one in this case.
While this speech recognition may be supported,
it may not be available, perhaps due
to having no internet connectivity.
Use the is available property on your recognizer to monitor this.
Now we create a recognition request
with the recorded file's url and give
that to the recognition task method of the recognizer.
This method takes completion handler
with two optional arguments, result and error.
If result is nil, that means recognition has failed
for some reason.

Check the error parameter for an explanation.
Otherwise, we can read the speech we recognize so far
by looking at results.

Note that the completion handler may be called more than once
as speech is recognized incrementally.
You can tell the recognition is finished
by checking the is final property of the result.
Here we'll just print the text of the final recognition.
Recognizing live audio from the device's microphone is very
similar but requires a few changes.
We'll make an audio buffer recognition request instead.
This allows us to provide a sequence
of in memory audio buffers instead of a file on disc.
Here we'll use AVAudioEngine to get a stream of audio buffers.
Then append them to the request.

Note that it's totally okay to append audio buffers
to a recognition request both before
and after starting recognition.

One difference here is that we no longer ignore the return
value of the recognition task method.
Instead we store it in a variable property.

We'll see why in a moment.
When we're done recording, we need to inform the request
that no more audio is coming so that it can finish recognition.

Use the end audio method to do this.
But what if the user cancels our recording
or the audio recording gets interrupted?
In this case, we really don't care about the results
and we should free up any resources still being used
by speech recognition.

Just cancel the recognition task that we started --
we stored when we started recognition.
This can also be done for prerecorded audio recognition.

Now just a quick talk about some best practices.
We're making speech recognition available for free to all apps
but we do have some reasonable limits in place
so that the service remains available to everyone.
Individual devices may be limited in the amount
of recognitions that can be performed per day.

Apps may also be throttled globally
on a request per day basis.
Like other service backed APIs, for example CLGO Coder,
be prepared to handle network and rate limiting failures.
If you find that you're routinely hitting your
throttling limits, please let us know.

It's also important to be aware
that speech recognition can have a high cost in terms
of battery drain and network traffic.

For iOS 10 we're starting with a strict audio duration limit
of about one minute which is similar to that
of keyboard dictation.

A few words about being transparent
and respecting your user's privacy.
If you're recording the user's speech, it's a great idea
to make this crystal clear in your user interface.
Playing recording sounds and/or displaying a visual recording
indicator makes it clear to your users
that they're being recorded.
Some speech is a bad candidate for recognition.
Passwords, health data, financial information,
and other sensitive speech should not be given
to speech recognition.
Displaying speech as it is recognized like Siri
and Dictation do can also help users understand what your app
is doing.
It can also be helpful for users so that they can see
when recognition errors occur.
So developers.
Your apps now have free access
to high performance speech recognition dozens of languages.
But it's important to gracefully handle cases
where it's not available
or the user doesn't want your app to use it.
Transparency is the best policy.
Make it clear to the user
when speech recognition is being used.
We're incredibly excited to see what new uses
for speech recognition you come up with.

For more information and some sample code,
check out this session's webpage.
You might also be interested in some sessions on SiriKit.

There's one on Wednesday and the more advanced one later
on Thursday.
Thank you for your time and have a great WWDC.

Looking for something specific? Enter a topic above and jump straight to the good stuff.