Designing Convolutional Neural Networks for Urban Scene Understanding

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. These works may not be reposted without the explicit permission of the copyright holder.

Abstract

Semantic segmentation is one of the essential and fundamental problems in com- puter vision community. The task is particularly challenging when it comes to urban street scenes, where the object scales vary significantly. Recent advances in deep learning, especially deep convolutional neural networks (CNNs), have led to signif- icant improvement over previous semantic segmentation systems. In this work, we show how to improve semantic understanding of urban street scenes by manipulating convolution-related operations that are better for practical use. First, we implement dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bi- linear upsampling. Second, we propose a hybrid dilated convolution (HDC) frame- work in the encoding phase. This framework 1) effectively enlarges the receptive fields of the network to aggregate global information; 2) alleviates what we call the gridding issue caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a new state-of-art re- sult of 80.1% mIOU in the test set. We also are state-of-the-art overall on the KITTI road estimation benchmark and the PASCAL VOC2012 segmentation task.