view

In this talk, I will describe a deep architecture to learn to find good
correspondences for wide-baseline stereo. Given a set of putative sparse
matches and the camera intrinsics, we train our network in an end-to-end
fashion to label the correspondences as inliers or outliers, while
simultaneously using them to recover the relative pose, as encoded by the
essential matrix. Our architecture is based on a multi-layer perceptron
operating on pixel coordinates rather than directly on the image, and is
thus simple and small. We introduce a novel normalization technique,
called Context Normalization, which allows us to process each data point
separately while imbuing it with global information, and also makes the
network invariant to the order of the correspondences. Our experiments on
multiple challenging datasets demonstrate that our method is able to
drastically improve the state of the art with little training data.