Google’s AI watched thousands of hours of TV to learn how to read lips better than you

Researchers from Google’s UK-based artificial intelligence division DeepMind have collaborated with scientists from the University of Oxford to develop the world’s most advanced lip-reading software – and it probably reads lips better than you.

To accomplish this, the researchers fed thousands of hours of TV footage from the BBC to a neural network, training it to annotate videos based on mouth movement analysis with an accuracy of 46.8 percent.

For context, when tasked with captioning the same video, a professional human lip-reader proved to be almost four times less efficient, accurately guessing the right word only 12.4 percent of the time.

The research builds upon previously published work by the University of Oxford that used similar techniques to build a lip-reading app called LipNet that could read video recordings of volunteers speaking in simple sentences with an accuracy of over 90 percent.

However, unlike Oxford’s program, DeepMind’s software – dubbed “Watch, Listen, Attend, and Spell” – was trained and tested on much more challenging footage.

In the process, Google’s neural network watched 5,000 hours of footage from popular TV shows including Newsnight, Question Time and The World Today. The videos featured over 110,000 different sentences and approximately 17,500 unique words. By comparison, LipNet read a total of 51 unique words.

Here’s how the Google researchers sum up the scope and goals of their study:

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem – unconstrained natural language sentences, and in the wild videos

Deep Mind speculates that besides coming in handy to individuals with impaired hearing, the newly developed software could also support a wide range of applications, including annotating films as well as communicating to digital assistants like Siri and Alexa simply by using lip gestures.