READING

Sevilla-Lara et al. largely build upon the work by Sun et al. [1] by incorporating semantic segmentation into the proposed layered model for optical flow. Furthermore, the approach by Sun et al. is slightly altered to what is called "localized layers". Their approach is based upon the following two requirements:

Compute a semantic segmentation using the approach of [2] (i.e. the VGG-network [3] is transformed into a fully-convolutional model and refined to predict the necessary classes);

use [4] to compute an initial flow field.

Given the semantic segmentation, the initial flow field is refined depending on the class:

Planes (e.g. roads, sky and water) are modeled using planar motion by fitting a homography using RANSAC which is then defines the motion of each pixel belonging to the Plane-class;

Things (common foreground objects with bounded extend; e.g. cars, pedestrians and animals) are modeled using affine motion as in [1] (with the difference that the graphical model is not applied globally but to the patch containing the object at hand; note that objects are obtained after refining the semantic segmentation using a CRF and computing connected components).

At the time of publication, Sevilla-Lara et al. reported the leading result on the KITTI dataset [6], see the leaderboard, and qualitative results look promising on both the KITTI dataset and on selected Youtube sequences, see Figure 1.