Depth estimation with deep Neural networks part 2

This is our second mini-blog about depth estimation. If you haven’t read part 1 , I would truly recommend you to read it.

We will talk today about “Deeper Depth Prediction with Fully Convolutional Residual Networks” paper, it is really good paper that has shown a good level of durability when we used it some funny applications that depend on a fine depth “we will talk about this later”, And also we provide a Tensorflow implementation of this paper.

First we must admit the image-depth data-sets are much fewer than other data-sets related to the popular tasks like Classifications or Object detection, So we should use the Transfer Learning Technique . No matter what the architecture are using. In order not to consume our precious rare labeled data to learn some basic features about the scene that they have already been learnt in other tasks .

But now the important question is what encoder “a Pre-Trained model that we will use to convert the image to its’ basic features”, The most important criteria is the receptive field at the last convolutional layer , The larger it became the better basic features we can get from this model.

The Architecture :

Deeper Depth 2016 Architecture

They have used ResNet as an encoder because it has 483*483 receptive field so it would be enough to fully capture the input image 304*228.

The main contribution is that they have used Residual Up-Convolutions instead of Fully-Connected layers because the FC Layers are discriminative by its’ nature so it doesn’t suit a regression problem like depth estimation and the other its really memory consuming.

This paper shows a great comparison between different up-projection blocks. lets dig deep in this comparison.

a) Vanilla UP-Convolution: It uses un-pooling layer”The reverse operation of pooling where we map each cell value in the input feature map to 2*2 cells in the out put map where the input value occupies the top left cell and the other cells are zeros.” followed by conv-layer but the un-pooling much weaken the resulting features map and its make it hard to learn any thing useful from this sparse map.

b) Vanilla UP-Projection: It is much like the first block but it uses projection in order to make it easier to the model to learn. By using Fusion between two independent branches in order to get more dense feature map. But the resulting features map still sucks, The authors have been inspired by the projection blocks at the ResNet.

c) Fast UP-Convolution: Here they proposed a great contribution by splitting the 5*5 conv-filter weights into non-overlapping groups, indicated by different colors and A{3*3},B{3*2},C{2*3},D{2*2} in the figure. Each of them will produce a separate features map and the resulting feature map would be an interleaving of each category output.

They call it fast because they found that using this block decreases the training time by 15%.

Figure

d) Fast UP-Projection: They just applied the new interleaving technique to the up-projection approach.

The Loss function: They have used the reverse Huber loss function because they don’t want the data-set outliers to have a great effect on the gradient flow, So they used the l1-norm when the depth difference crosses a certain thershold.

Huber loss function

The output : The resulting depth map is good but it is blurry, This might represents an issue in a cool application “we will discuss it later, so stay tuned”.