Not so long ago, automatic speech recognition was something of a niche technology, which very few people used. That changed in 2011 with the release of Siri and Google Voice Search, with both Apple and Google making speech technology a key feature of smartphones. Despite recent improvements, speech recognition remains a difficult problem to solve, and even the best speech recognition technology makes errors.

It’s hard to appreciate why this is when understanding speech is something that even young kids can do with relative ease. Though, if you know any young kids, you’ll know that they don’t always let you know that they understand!

Here are just a few of the things that make speech recognition hard for computers:

Peopledonotleavegapsbetweenwords. While your brain hears speech as a series of discrete words, closer listening to the audio shows that the boundary between words is fuzzy. Just try listening to someone speaking a foreign language to hear how difficult it is to pick out the individual words.

People speak sloppily and ungrammatically. They slur words together, mispronounce, stop part-way through words to correct what they’ve said, and don’t always speak in coherent sentences.

Building a speech recognition system needs hours of transcribed audio to create a model of the acoustics of speech. This data is expensive to obtain, and even expert transcribers make mistakes.

Today’s speech recognisers need to cope with lots of different speakers, each of which has their own accent, speaking rate and style.

Background noise obscures speech, making it difficult to recognise. Different types of noise have different characteristics – for example street noise is very different from the noise inside a car – and a computer must be able to cope with lots of different types of noise.

Our smartphones and devices all have different makes and model of microphone, each of which distorts your speech in a subtly different way.

Homonyms, or words/phrases that sound the same, are impossible to tell apart without more information, e.g.

The human ear is really good at filtering out noise and the subtle changes from using different microphones. We’re also great at quickly adapting to how a new speaker talks, and using context and real world knowledge to disambiguate things that we haven’t heard properly. On the other hand, we need to explicitly tell computers how to do these things, which means inventing and trialling a bunch of ways to find out what works and what doesn’t.