2019

The motion of the world is inherently dependent on the spatial structure of the world and its geometry. Therefore, classical optical flow methods try to model this geometry to solve for the motion. However, recent deep learning methods take a completely different approach. They try to predict optical flow by learning from labelled data. Although deep networks have shown state-of-the-art performance on classification problems in computer vision, they have not been as effective in solving optical flow. The key reason is that deep learning methods do not explicitly model the structure of the world in a neural network, and instead expect the network to learn about the structure from data. We hypothesize that it is difficult for a network to learn about motion without any constraint on the structure of the world. Therefore, we explore several approaches to explicitly model the geometry of the world and its spatial structure in deep neural networks.

The spatial structure in images can be captured by representing it at multiple scales. To represent multiple scales of images in deep neural nets, we introduce a Spatial Pyramid Network (SpyNet). Such a network can leverage global information for estimating large motions and local information for estimating small motions. We show that SpyNet significantly improves over previous optical flow networks while also being the smallest and fastest neural network for motion estimation. SPyNet achieves a 97% reduction in model parameters over previous methods and is more accurate.

The spatial structure of the world extends to people and their motion. Humans have a very well-defined structure, and this information is useful in estimating optical flow for humans. To leverage this information, we create a synthetic dataset for human optical flow using a statistical human body model and motion capture sequences. We use this dataset to train deep networks and see significant improvement in the ability of the networks to estimate human optical flow.

The structure and geometry of the world affects the motion. Therefore, learning about the structure of the scene together with the motion can benefit both problems. To facilitate this, we introduce Competitive Collaboration, where several neural networks are constrained by geometry and can jointly learn about structure and motion in the scene without any labels. To this end, we show that jointly learning single view depth prediction, camera motion, optical flow and motion segmentation using Competitive Collaboration achieves state-of-the-art results among unsupervised approaches.

Our findings provide support for our hypothesis that explicit constraints on structure and geometry of the world lead to better methods for motion estimation.

2013

We introduce Puppet Flow (PF), a layered model describing the optical flow of a person in a video sequence. We consider video frames composed by two layers: a foreground layer corresponding to a person, and background.
We model the background as an affine flow field. The foreground layer, being a moving person, requires reasoning about the articulated nature of the human body. We thus represent the foreground layer with the Deformable Structures model (DS), a parametrized 2D part-based human body representation. We call the motion field defined through articulated motion and deformation of the DS model, a Puppet Flow. By exploiting the DS representation, Puppet Flow is a parametrized optical flow field, where parameters are the person's pose, gender and body shape.

Statistical models of non-rigid deformable shape have wide application in many fields,
including computer vision, computer graphics, and biometry. We show that shape deformations
are well represented through nonlinear manifolds that are also matrix Lie groups.
These pattern-theoretic representations lead to several advantages over other alternatives,
including a principled measure of shape dissimilarity and a natural way to compose deformations.
Moreover, they enable building models using statistics on manifolds. Consequently,
such models are superior to those based on Euclidean representations. We
demonstrate this by modeling 2D and 3D human body shape. Shape deformations are
only one example of manifold-valued data. More generally, in many computer-vision and
machine-learning problems, nonlinear manifold representations arise naturally and provide
a powerful alternative to Euclidean representations. Statistics is traditionally concerned
with data in a Euclidean space, relying on the linear structure and the distances associated
with such a space; this renders it inappropriate for nonlinear spaces. Statistics can,
however, be generalized to nonlinear manifolds. Moreover, by respecting the underlying
geometry, the statistical models result in not only more effective analysis but also consistent
synthesis. We go beyond previous work on statistics on manifolds by showing how,
even on these curved spaces, problems related to modeling a class from scarce data can be
dealt with by leveraging information from related classes residing in different regions of the
space. We show the usefulness of our approach with 3D shape deformations. To summarize
our main contributions: 1) We define a new 2D articulated model -- more expressive than
traditional ones -- of deformable human shape that factors body-shape, pose, and camera
variations. Its high realism is obtained from training data generated from a detailed 3D
model. 2) We define a new manifold-based representation of 3D shape deformations that
yields statistical deformable-template models that are better than the current state-of-the-
art. 3) We generalize a transfer learning idea from Euclidean spaces to Riemannian
manifolds. This work demonstrates the value of modeling manifold-valued data and their
statistics explicitly on the manifold. Specifically, the methods here provide new tools for
shape analysis.

This chapter introduces the concept of a Steerable Random Field (SRF). In contrast to traditional Markov random field (MRF) models in low-level vision, the random field potentials of a SRF are defined in terms of filter responses that are steered to the local image structure. This steering uses the structure tensor to obtain derivative responses that are either aligned with, or orthogonal to, the predominant local image structure. Analysis of the statistics of these steered filter responses in natural images leads to the model proposed here. Clique potentials are defined over steered filter responses using a Gaussian scale mixture model and are learned from training data. The SRF model connects random fields with anisotropic regularization and provides a statistical motivation for the latter. Steering the random field to the local image structure improves image denoising and inpainting performance compared with traditional pairwise MRFs.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems