Carl Vondrick is a doctoral candidate and researcher at MIT, where he studies computer vision and machine learning. His research focuses include leveraging large-scale data with minimal annotation and its applications to predictive vision and scene understanding.

Recently his work has received a lot of media attention, including features in Forbes, Wired, CNN and PopSci, and other media outlets worldwide. As part of his work with MIT CSAIL, Carl built a deep learning vision system for AI to learn and understand human behaviour and interactions, using popular TV shows like The Office, Desperate Housewives, and YouTube videos. The resulting algorithm analyzes videos, then uses what it learns to predict how humans will behave.

My research studies computer vision and machine learning, a field that teaches machines to understand images and videos. Visual perception models today typically excel with large amounts of labeled data; however, this is difficult to scale because annotating data is expensive. My work seeks to efficiently leverage unlabeled data and human supervision in order to train machines to understand complex scenes. By developing methods to economically scale computer vision to huge datasets, I believe we can build more powerful and more accurate perception systems.

What do you feel are the leading factors enabling recent advancements in machine vision?

I think it is the combination of two factors. The first factor is the availability and the capability to process massive annotated datasets. However, data alone is not enough, which leads to my second factor: the understanding of how to use large-scale data.

A few years before deep learning proved to be state-of-the-art, my coauthors and I tried collecting massive datasets in an attempt to “brute force” the vision problem. But we ran into several issues because our models did not have the capacity to take advantage of this large volume of data. Deep learning provides a class of models with a high learning capacity, but this power doesn’t usually surface until you throw massive datasets at it. You had to put together both the data and the models.

In other words, machines need both brains (efficient algorithms) and brawn (big data). We have a better grasp on these now.

What present or potential future applications of Machine Vision excite you most?

I am most excited by how computer vision may become the main interface for artificial agents to understand our world. Vision is one of the richest natural senses for humans, so I expect it will also be a crucial sensor for robots as they fulfill more human roles. Machines are fairly capable at solving problems where structure exists already (such as board games). Visual data is challenging because we have to infer this structure from a high-dimensional, unconstrained image. However, if we can develop algorithms to convert images into a structured format, then I think artificial intelligence may start discovering interesting insights about people and the world.

One potential application that excites me are machines inside our home that analyze human behavior and help us develop a healthier and more effective lifestyle. My collaborator Hamed and I worked on a vision system that watches Olympic sports and gives feedback on how the athletes might be able to improve. There are a couple of efforts to commercialize ideas like this, but usually you have to wear a sensor or manually enter data, which becomes tedious (and I personally forget to charge batteries). Vision is an unobtrusive interface in this respect. If machines can understand vision and speech, we wouldn’t have to adjust our daily routines for machines to start making sense of us.

Which industries will be most disrupted by machine intelligence?

I think machine perception can revolutionize industries where insight can be acquired from data at scale. This will likely happen soon in places like medicine, food, and inside the home. For example, computer vision may provide an affordable, more accurate procedure to screen people for medical issues. I don’t know if vision algorithms will completely replace doctors or farmers or any other professional, but they can give humans better information.

What developments can we expect to see in machine intelligence in the next 5 years?

One major trend will be machines starting to learn without much human supervision. There is a tremendous amount of rich yet unlabeled data available, and it would be a breakthrough if we could fully capitalize on it. Moore’s law suggests computation will scale exponentially each year, but model accuracy usually scales logarithmically in amount of labeled data. This implies the bottleneck to machine intelligence will not be computer power, but rather how much data we can label. We need to leverage unlabeled data and efficient annotation algorithms.

I think machines that reason about several modalities are around the corner. While vision is one of the most important senses, one strength of machine learning is to integrate information across many different domains. For example, my collaborator Yusuf and I are working on a project to transfer knowledge between modalities, and our experiments suggest cross-modal learning is a powerful signal for teaching machines. This capability would allow machines to learn in one domain yet apply it elsewhere, opening up possibilities such as learning from text or from virtual worlds.

Finally, I believe that predictive vision is going to start work well soon. A grand challenge in AI is to build machines that can take actions in the world. I think anticipation will be a crucial ingredient for machine action. In order to plan, robots will need to understand what outcome their actions will have, and anticipate the future state of the world. This is a challenging problem, and anticipation is a key subproblem. Fortunately, unlabeled video is available on massive scales and naturally provides a signal about time. Our work takes some early steps at leveraging these signals, and I can’t wait to see how far we can push it.