READING

Hayder et al. introduce a framework for instance-level semantic segmentation called Shape-Aware Instance Segmentation (SAIS) network. The main idea is to combine the Region Proposal Network (RPN) of [] with the newly proposed Object Mask Network (OMN). The motivation is to allow networks to generate good instance-level segmentations based on bounding boxes that do not fully cover the object. This motivation is illustrated in Figure 1.

Figure 1 (click to enlarge): Illustration of the motivation for proposing the so-called Object Mask Networks (OMNs). First column: the original image with ground truth instance segmentation. Second column: bounding box proposal for the object (top) and the retrieved instance segmentation when predicting binary masks restricted to the bounding box (bottom). Right column: illustration of the distance transform based representation (top) used to allow instance segmentations beyond the initial bounding box (bottom).

Object Mask Networks (OMNs) are generally regular neural networks predicting what Hayder et al. call a shape-aware mask representation. In particular, given a bounding box, they do not predict a binary mask, but a per-pixel distance transform (as illustrated in Figure 1). A distance transform encodes the Euclidean distance to the nearest (object) boundary pixel in each pixel. Hayder et al. additionally cap the distance transform value with a maximum value of $R$. Then, these values are quantized into $K$ values - i.e. the distance transform values are represented by $K$-dimensional binary vectors such that the distance $D(p)$ for pixel $p$ can be expressed as

$D(p) = \sum_{n = 1}^K r_n\dot b_n(p)$, $\sum_{n = 1}^K b_n(p) = 1$

where the $b_n$ correspond to a one-hot binary vector that is predicted by the network. Given the distance transform per pixel. It is easy to obtain the final object mask by placing a disk of radius $D(p)$ at pixel $p$ and taking the union over these disks. Luckily, this operation can be expressed as convolution enabling the integration into the overall network structure, see the paper for details.

An OMN takes proposals from a Region Proposal Network, warps the corresponding features and predicts $K$ feature maps corresponding to the $b_1,\ldots,b_K$. These feature maps are then fed into a deconvolution model transforming them into a binary mask as explained in the paper (basically expressing the idea with the disks at every pixel location in terms of network layers).

The overall SAIS network puts a one-layer classifier on top of the OMN based on the binary mask and the bounding box features from the RPN. The full network architecture is then illustrated in Figure 2.

Figure 2 (click to enlarge): Illustration of the full architecture consisting of RPN, OMN and final classifier as discussed in the text.

They proof the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes. Some qualitative results are shown in Figure 3. For quantitative results and a comparison to state-of-the-art techniques see the paper.