Robots can now learn household tasks by watching YouTube videos

This site may earn affiliate commissions from the links on this page. Terms of use.

Astute followers of artificial intelligence may recall a moment from three years ago, when Google announced it had birthed unto the world a computer able to recognize cats using only videos uploaded by YouTube users. At the time, this represented something of a high water mark in AI. To get an idea for how far we have come since then, one has only to reflect on recent advances in the RoboWatch project, an endeavor that is teaching computers to learn complex tasks using instructional videos posted on YouTube.

That innocent “learn to play guitar” clip you posted on your YouTube video feed last week? It may someday contribute to putting Carlos Santana out of a job. That’s probably pushing it; it’s more likely that thousands of home nurses and domestic staff will be axed long before guitar gods have to compete with robots. A recent groundswell of interest in bringing robots into the marketplace as caregivers for the elderly and infirm, in part fueled by graying population bases throughout the developed world, has created the necessity for teaching robots simple household tasks. Enter the RoboWatch project.

Most advanced forms of AI currently in use rely upon a branch of supervised machine learning, which requires large datasets to be “trained” on. The basic idea is that when provided with a sufficiently large database of labeled examples, the computer can learn to recognize what differentiates the items within the training set, and later apply that classifying ability to new instances it encounters. The one drawback to this form of artificial intelligence is that it requires large databases of labeled examples, which are not always available or require much human curation to create.

RoboWatch is taking a different tack, using what’s called unsupervised learning to discover the important steps in YouTube instructional videos without any previous labeling of data. Take for instance a YouTube video on omelet making. Using the RoboWatch method, the computer successfully parsed the video on omelet creation and catalog the important steps without having first been trained with labeled examples.

Color code activity steps and automatically generated captions, all created by the RoboWatch algorithm for making an omelet.

It was able to do this by looking at a large amount of instructional omelet-making videos on YouTube and creating a universal storyline from their audio and video signals. As it turns out, most of these videos will contain certain identical steps, such as cracking the eggs, whisking them in a bowl, and so on. When presented with enough video footage, the RoboWatch algorithm can tease out what the essential parts of the process are and what is arbitrary, creating a kind of archetypal omelet formula. It’s easy to see how unsupervised learning could quickly enable a robot to gain a vast assortment of practical household know-how while keeping human instruction to a minimum.

The RoboWatch project follows similar advances in video captioning pioneered at Carnegie Mellon University. Earlier this year, we reported on a project headed by Dr. Eric Xing, which seeks to use real-time video summarization to detect unusual activity in video feeds. This could lead to surveillance cameras with the built-in ability to detect suspicious activity. Putting these developments together, it’s clear unsupervised learning models using video footage are likely to pave the way for the next breakthrough in artificial intelligence, one that will see robots entering our lives in ways that are likely to both scare and fascinate us.

Tagged In

I wonder what happens if robots watch people beating each other up or shooting someone on youtube… maybe it’ll ‘tease out what the essential parts of the process are and what is arbitrary, creating a kind of archetypal -death kill- formula’. In time they’ll be self-aware and start to have feelings.

Mr_Blastman

What about all those ED 209 clips on youtube?

Gil G

Hopefully suicide booths will be appearing on every street corner as per “Futurama.”

XenoSilvano

It would be rather creepy to live in society where video analysis algorithms via surveillance cameras get so good that they could accurately detect more about you from your minute body language than any human being ever could, very 1984.

Jeffrey Davis

Agreed to extreme. This video analysis tech better not go Skynet… :(

XenoSilvano

Sad thing is that technological advancements usually always are, they have to test it out somehow, what better way than the populous.

Rajab Natshah

Good Good Goooood

Garu Derota

youtube is full of stupid or wrong stuff. being italian, I can assure you many many video on “how to cook italian food” are totally wrong (just to remain in the omelette range)

This site may earn affiliate commissions from the links on this page. Terms of use.

ExtremeTech Newsletter

Subscribe Today to get the latest ExtremeTech news delivered right to your inbox.

Email

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our
Terms of Use and
Privacy Policy. You may unsubscribe from the newsletter at any time.