Eyes on the Road: How Autonomous Cars Understand What They’re Seeing

To an untutored eye, they’re a multicolored collection of boxes,blocks and numbers. To a seasoned observer, they’re data readings from fisheye cameras, lidar and other sensors. But to an autonomous car, they’re details of the profoundly complex world that it’s navigating.

You’ll see a flood of demos at this year’s International Consumer Electronics Show, in Las Vegas, from carmakers and researchers that show the road from the perspective of the new breed of autonomous driving systems.

While the technology these new systems unleash is awesome — marshalling some 24 trillion deep learning operations a second — the feats they achieve may need bit of explaining.

So, here’s a quick guide to what you’re seeing — and what you’re not seeing — when viewing an autonomous driving demo.

What You’re Seeing

There are two big categories of recognition you’ll see demonstrated:

Semantic segmentation — This is the ability to label the pixels — the tiny dots that make up a computer image — that belong to particular classes of object. We can see that in the example below. The road is blue. People are orange. Cars are red. If the computer can figure what’s in the image to this level of detail, then we have greater confidence in the ability for an autonomous system to navigate safely.

Object detection — This is the ability to bound the location of an object with a box. You’ll see videos that show our ability to detect many classes of objects at the same time. In the example below, we have a detector that we’ve designed to identify people and cars. Bounding boxes is a simpler way than segmentation to describe the location of an object.

What You’re Not Seeing

So, semantic segmentation and object detection are what you’ll notice on the screen during most demos of how autonomous driving systems see the road.

But here’s what you’re not seeing: the key role played by deep learning in driving all this. Deep learning lets us do what people can’t. Creating software that could recognize every class of object on the road is just not practical. There’s just too much stuff.

The solution is to teach machines to teach themselves. Deep learning allows us to specify a complex goal. If formulated in the right way — and with the right recipes, or algorithms — the network can figure out a way to do complicated things.

We use the tremendous computational power of NVIDIA GPUs to train these networks. GPUs are ideally suited for deep learning because they can tackle many tasks at once, or in parallel.

We’re using a one-shot detection and segmentation architecture based on recent advanced deep learning networks like GoogLeNet and VGG. One-shot means the network takes in a full image and spits out detections or pixel labeling for segmentation.

That lets automakers quickly train systems, using video from real-world driving, to recognize objects in a vast array of situations. The more data they throw into these deep learning systems, the smarter they get. And it enables them to compare the results to real-world image recognition benchmarks managed by independent groups of researchers, such as the KITTI benchmark suite, that let them compare how they’re doing to their peers.

How to Judge What You’re Seeing

So, the next time you see a demo, ask three questions. How are these systems trained to understand situations they’ve never seen before? Are they running in real time? How do these results compare to benchmarks computer scientists use to gauge the accuracy of these computer vision systems?