READING

Pohlen et al. introduce Full Resolution Residual Networks (FRRN) for semantic segmentation of urban street scenes, e.g. on Cityscapes []. The proposed network architecture is based in large parts on the idea of residual units []. The proposed full resolution residual unit is an extension based on the following observations: current state-of-the-art deep networks for semantic segmentation are often based on pre-trained models that excel in recognition performance but lack localization performance. To improve localization performance, low-level features are crucial, but these are often neglected or lost in traditional architectures based on several pooling layers. Therefore, Pohlen et al. propose a network architecture based on two streams, a residual stream that successively applies full resolution residual units on the full resolution image, and a pooling scheme that follows a traditional encoder/decoder architecture [] based on several pooling stages. The latter is supposed to learn the high-level features while the former provides low-level features for better localization performance (i.e. more accurate boundaries).

The residual unit is illustrated in Figure 1 and can generally be described as computing

$x_n = x_{n - 1} + \mathcal{F}(x_{n-1};\mathcal{W}_n)$

where $x_n$ is the output of layer $n$ and $\mathcal{W}_n$ represents the parameters in layer $n$, i.e. the layer is only responsible for representing a residual. The idea is to improve training by making the gradient partly independent of the depth, see the paper for details. A full resolution residual unit takes the form depicted in Figure 2 and computes

$z_n = z_{n - 1} + \mathcal{H}(y_{n - 1},z_{n - 1};\mathcal{W}_n)$

$y_n = \mathcal{G}(y_{n - 1}, z_{n - 1}; \mathcal{W}_n)$

where $z_n$ is the output of the residual stream at layer $n$, $y_n$ the output of the pooling stream at layer $n$, and $\mathcal{W}_n$ the weights of layer $n$. The full resolution residual network is implemented as in Figure 3. In particular, the residual input $z_{n-1}$ is first pooled to reduce its size, then two convolutional layers including a batch normalization layer and a rectified linear unit layer each follow. Finally, the residual output is up-scaled using an unpooling layer.

Figure 1: High-level illustration of the residual unit.

Figure 2: High-level illustration of the full resolution residual unit.

Figure 3: Detailed description of the full resolution residual unit as explained in the text.

The overall architecture (they evaluated two architectures, called FRRN A and B, respectively) are summarized in Figure 4. Training is done using adam and the bootstrap cross entropy loss [], see the paper for details.

Figure 4: Illustration of the evaluated architectures. The pooling and residual stream are illustrated in red and blue, respectively, the proposed full reoslution residual units span both streams.

The presented results look promising — especially as no pre-training is necessary. They achieve state-of-the-art performance while training FRRN B on half resolution only and up-scaling the results using bilinear interpolation. Qualitative results are shown in Figure 5. Also check out Tobias Pohlen's webpage for details and source code.