Learning Invariance through Imitation

Graham W. Taylor, Ian Spiro, Chris Bregler and Rob Fergus

Sample retrieval results. Each row is a query. We select a test
image (column 1) and find its 10 nearest neighbors using our method
which we call Pose Sensitive Embedding (PSE).
The blue text in each image indicates seed id (left) and
distance (in embedded space) from the query (right).

Non-scientific abstract (for geeks and non-geeks)

Computer vision has hit the mainstream with applications such as cars that detect pedestrians, motion capture for animation, and applications that let you cash a cheque by snapping a picture from your mobile phone. A great example of computer vision in the consumer market is Microsoft's Kinect gaming system which can accurately detect the pose of one or more individuals allowing gameplay to be controlled using just the body. Such a system must be able to detect pose reliably under a wide variety of conditions - different players, unusual clothing, poor lighting, cluttered backgrounds, and other sources of variation. One way that we could perform pose estimation is keeping around a large database of examples of people in a variety of poses along with labels indicating the configuration of the body in 2D or 3D. When presented with a new example (without labels) we can compare it against the database to find the best match. We then can assign the labels of the best match to the new example. However, the matching (or similarity) problem is a very tough one - especially due to the large amount of input variability due to the factors described above. If we had many examples of people in similar pose but under differing conditions, we could use machine learning to construct an algorithm that matches based on the important information (e.g. pose) and ignores the distracting information (e.g lighting, clothing, background, etc.). But how do we collect such data?
In a somewhat unusual move for computer scientists, we turned to the
Dutch progressive-electro band C-Mon and Kypski. Their music video/crowdsourcing project "One Frame of Frame" asks people on the web to replace one frame of the band's music video for the song "More or Less" with a capture from a webcam. A visitor to the band's website is shown a single frame of the video and asked to perform an imitation in front of the camera. The new contribution is spliced into the video which updates once an hour. This turns out to be the perfect data source for learning an algorithm to compute similarity based on pose. Armed with the band's data and a few machine learning tricks up our sleeves, we built a system that is highly effective at matching people in similar pose but under widely different settings.

Scientific abstract (for geeks)

Supervised methods for learning an embedding aim to map
high-dimensional images to a space in which perceptually similar
observations have high measurable similarity.
Most approaches rely on binary similarity, typically defined by class
membership where labels are expensive to obtain and/or difficult to
define. In this paper we propose crowd-sourcing similar images by
soliciting human imitations. We exploit temporal coherence in video
to generate additional pairwise graded similarities between the
user-contributed imitations. We introduce two methods for learning
nonlinear, invariant mappings that exploit graded similarities. We
learn a model that is highly effective at matching people in similar
pose. It exhibits remarkable invariance to identity, clothing,
background, lighting, shift and scale.

Schematic of our approach. We assume for each frame
of video, there exists an unobserved low-dimensional representation of pose,
Z. A seed image is generated by mapping from pose space to pixels,
X, through an unobserved interpretation function. Our method learns a nonlinear embedding,
f(X|θ) which approximates Z with a
low-dimensional vector. In the example above, users are asked to
imitate seed images taken from a music video (http://oneframeoffame.com).