New computer vision challenge wants to teach robots to see in 3D

Computer vision is ready for its next big test: seeing in 3D. The ImageNet Challenge, which has boosted the development of image-recognition algorithms, will be replaced by a new competition next year that aims to help robots see the world in all its depth.

Since 2010, researchers have trained image recognition algorithms on the ImageNet database, a go-to set of more than 14 million images hand-labelled with information about the objects they depict. The algorithms learn to classify the objects in the photos into different categories, such as house, steak or Alsatian. Almost all computer vision systems are trained like this before being fine-tuned on a more specific set of images for different tasks.

In 2015, a team from Microsoft built a system that was over 95 per cent accurate, surpassing human performance for the first time in the challenge’s history. And photo apps from Google and Apple allow people to search their photo collections using terms like food or baby. Google Photos even classifies images by abstract concepts like “happiness”.

“When we were starting the project, these were not things that industry had done yet,” says Alex Berg at the University of North Carolina at Chapel Hill, who is one of the competition’s organisers. “Now they are products that millions of people are using.”

Introducing the real world

So the ImageNet team say it’s time for a fresh challenge in 2018. Although the details of this competition have yet to be decided, it will tackle a problem computer vision has yet to master: making systems that can classify objects in the real world, not just in 2D images, and describe them using natural language.

“There is very little work on putting a 3D scene through a machine-learning algorithm,” says Victor Prisacariu at the University of Oxford. Building a large database of images complete with 3D information would allow robots to be trained to recognise objects around them and map out the best route to get somewhere. This database would largely comprise images of scenes inside homes and other buildings.

The existing ImageNet database consists of images collected from across the internet and then labelled by hand, but these lack the depth information needed to understand a 3D scene. The database for the new competition could consist of digital models that simulate real-world environments or 360-degree photos that include depth information, says Berg. But first someone must make these images. As this is difficult and costly, the data set is likely to be a lot smaller than the one for the original challenge.

Robot vision is ready for its ImageNet moment, says Andrew Davison at Imperial College London. He is already working on the next generation of in-home robots that will take over from devices such as the floor-cleaning Roomba. These will need to know how to deal with objects and manipulate the world around them, he says. “I really think you need this detailed 3D understanding both of the shape of the world, but also a semantic understanding of what’s in it,” he says.

The new challenge will also assist augmented and virtual reality, says Davison. Knowing where objects are in the real world will help augmented reality systems like the Microsoft HoloLens depict virtual objects within it. “It’s very much the same capability,” he says.

Berg isn’t expecting major progress in the first couple of years of the new challenge, but he has an idea of what success might look like. Eventually, he would like to see robots that can consistently understand the environment around them and explain what they see just as well as a human can. However, achieving either of these things is more than five years away, he says.