In a previous post, we studied various open datasets that could be used to train a model for pixel-wise semantic segmentation of urban scenes. Here, we take a look at various deep learning architectures that cater specifically to time-sensitive domains like autonomous vehicles. In recent years, deep learning has surpassed traditional computer vision algorithms by learning a hierarchy of features from the training dataset itself. This eliminates the need for hand-crafted features and thus such techniques are being extensively explored in academia and industry.

Deep Learning Architectures

Prior to deep learning architectures, semantic segmentation models relied on hand-crafted features fed into classifiers like Random Forests, SVM, etc. But after their mettle was proved in image classification tasks, these deep learning architectures started being used by researchers as a backbone for semantic segmentation tasks. Their feature learning capabilities, along with further algorithmic and network design improvements, have then helped produce fine and dense pixel predictions. We introduce one such pioneering work below called Fully Convolutional Network (FCN) on the basis of which all future models are roughly based.

Contribution
This work reinterpreted the final fully connected layers of various LSVRC (Large Scale Visual Recognition Challenge, a.k.a ImageNet) networks such as AlexNet and VGG16 as fully convolutional networks. Using the concept of skip-layer fusion to decode low-resolution feature maps to pixel-wise prediction allowed the network to learn end to end.

Architecture
As seen in the above image, the upsampled outputs of a particular layer are concatenated with the outputs of the previous layer to improve the accuracy of the output. Thus, appearance (edges) from the shallower layers are combined with coarse and semantic information from the deeper layers. The upsampling operation in the deeper layers’ feature maps is also trainable, unlike conventional upsampling operations that use mathematical interpolations.

Drawbacks
The authors did not add more decoders since there was no additional accuracy gain and thus, high-resolution features were ignored. Also, using the encoder feature maps during inference time makes the process memory intensive.

Real-Time Semantic Segmentation

Post FCN, various other networks such as DeepLab (introduced atrous convolutions), UNet (introduced encoder-decoder structure), etc., have made pioneering contributions to the field of semantic segmentation. On the basis of the aforementioned networks, various state-of-the-art models like RefineNet, PSPNet, DeepLabv3, etc. have achieved an IoU (Intersection Over Union) > 80% on benchmark datasets like Cityscapes and PASCAL VOC.

But real-time domains like autonomous vehicles need to make decisions in the order of milliseconds. As can be seen from Figure 2, the aforementioned networks are quite time-intensive. Table 1 also details the memory requirements of various models. This has encouraged researchers to explore novel designs to achieve output rates of >10fps from a neural network and contain fewer parameters.

This network has much fewer trainable parameters since the decoder layers use max-pooling indices from corresponding encoder layers to perform sparse upsampling. This reduces inference time at the decoder stage since, unlike FCNs, the encoder maps are not involved in the upsampling. Such a technique also eliminates the need to learn parameters for upsampling, unlike in FCNs.

Architecture

The SegNet architecture adopts the VGG16 network along with an encoder-decoder framework wherein it drops the fully connected layers of the network. The decoder sub-network is a mirror copy of the encoder sub-network, both containing 13 layers. Figure 3(B) shows how SegNet and FCN carry out their feature map decoding.

The authors created a light network using an asymmetric encoder-decoder architecture. In addition, they made various other architectural decisions such as early downsampling, dilated and asymmetricconvolutions, not using bias terms, parametric ReLU activations and Spatial DropOut.

Architecture

Efficient Neural Network (E-net) aims to reduce inference time on images by reducing a large number of floating point operations present in previous architectures. This too is an encoder-decoder based architecture, with the difference that the decoder is much larger than the encoder. Here, the authors take inspiration from ResNet-based architectures – there is a single main branch, with extensions (convolutional filters) that separate from it and merge back using element-wise addition as shown in Figure 4(B)(2).

The authors also do not use any bias terms and noticed that it does not lead to any loss of accuracy. They also employ early downsampling – Figure 4(C), which reduces the image dimensions and hence saves on costly computations at the beginning of the network. Using dilated convolutions also makes sure that the network has a wider receptive field and saves them from aggressive downsampling early on in the network.

This model has been tested by the authors on the Nvidia Jetson TX1 Embedded Platform and code may be found here.

Figure 5 [Source] : ICNet architecture with its three branches for multi-scale inputs

Contribution

ICNet (Image Cascade Network) cascades feature maps from various resolutions. This is done to exploit the processing efficiency of low-resolution images and high inference quality of high-resolution images. A representation of this logic is shown in Figure 5.

Instead of trying intuitive strategies such as downsampling inputs (as in ENets), downsampling feature maps or model compression (removing feature maps entirely), the authors use a CFF (Cascade Feature Fusion)unit to merge feature maps of low resolutions with those of high resolutions.

Architecture

Here the input image is fed into three different branches with different resolutions of ¼, ½ and 1. Each branch reduces the spatial resolution of its input by ⅛ and hence the outputs of the three branches are ¹/₃₂, ¹/₁₆ and ⅛, of the original image. The output of branch1 (o/p – ¹/₃₂) is fused with the output of branch2 (o/p – ¹/₁₆) using the previously mentioned CFF unit. Similar operations are performed for branch2 and branch3, and the final output of the network is an image which is ⅛ of the original size.

Since convolutional parameters are shared between the ¼ and ½ resolution branches, the network size is also reduced. Also, during inference, branch1 and branch2 are completely discarded, and only branch3 is utilised. This leads to a computation time of a mere 9ms.

The branches use the design of PSPNet50 (a 50-layer deep ResNet for semantic segmentation). They contain 50, 17 and 3 layers respectively.

Final Thoughts

Various architectures have made novel improvements in the way 2-dimensional data is processed through data graphs. Although embedded platforms continue to improve with more memory and FLOPS capability, the above architectural and mathematical improvements have led to major leaps in semantic segmentation network outputs. With state-of-the-art networks, we can now achieve an output rate (in fps) that is close enough to image acquisition rates, and with acceptable quality (in mIoU) for autonomous vehicles.

Bio: Prerak Mody is a Computer Vision researcher at Playment, which partners with computer vision teams to annotate their data with a fully managed enterprise-grade solution at scale. He has previously worked as a Data Scientist at Postdot Tech (the famous Postman API client) and Leaf Tech (a home automation company).