My research is focused on algorithms for obtaining 3D models of
objects from visual data such as digital photographs or video. One motivation
for this research is that visual data is very easy to aquire, typically much
easier than other 3D sensors. Most importantly however, 3D models that have been
obtained from visual cues are, by construction, visually faithful to the real
world objects they are modelling. Therefore, these techniques offer our best
chance of bridging the gap between the real world and the virtual worlds of
Computer Graphics, eventually leading to photorealistic 3D content in both
films and computer games.

News

NEW: OCT'18: Together with
Dr Diego Faria we have been awarded a CHIST-ERA project entitled "InDex - Robot In-hand Dexterous manipulation by extracting data from human manipulation of objects to improve robotic autonomy and dexterity" (total award EUR 1.2 million, EUR 441k to Aston). The project aims to capture and understand how humans perform in-hand object manipulation in order to replicate the observed skilled movements with dexterous artificial hands. Aston's part will involve experience transfer in Deep Reinforcement Learning. InDex is a collaboration between Aston (coordinators), Sorbonne, TU Wien, U of Genoa and U of Tartu.

NEW: SEP'18:
PhD studentship available on the topic of Deep Reinforcement Learning for Robot Dexterous Manipulation. More details here.

APR'18:
PhD studentship available on the topic of Deep Reinforcement Learning for Robot Perception, Reasoning and Interaction. More details here and online application form here.

MAR'12: I was invited to give a talk at the Microsoft Research UK New Faculty Summit 2012 During this event, 8 new faculty members in Computer Vision had the opportunity to present their work to an audience of junior and senior CV researchers. This was followed by a lovely dinner and lots of very stimulating discussion. Thanks MSR Cambridge for organizing this! The slides I presented can be found here.

NOV '11: During ICCV'11 in Barcelona, we co-organized a workshop entitled "Live Dense Reconstruction with Moving Cameras" together with Andy Davison and Richard Newcombe. The workshop was a great success, attended by about 200 people. Check out the website here.

JUNE '10: A half-day tutorial entitled "3d Shape Reconstruction from
Photographs: a Multi-View Stereo Approach " was presented at CVPR 2010
together with Carlos Hernandez
and Yasutaka Furukawa. Check out its
webpage containing slides and other material.

MAY '10: Two book chapters published in an edited volume entitled Computer Vision:
Detection, Recognition and Reconstruction, Cipolla, Battiato, Farinella (Eds.),
2010 Springer-Verlag.

The Middlebury evaluation for multi-view stereo has been really successful in motivating researchers, inspiring numerous publications and driving the field of MVS forward.
In this work, eight years after the publication of the Middlebury dataset, we are releasing a large scale, realistic and very challenging new dataset.
The data has been captured using DTU's robotic arm apparatus and ground truth is obtained using a structured light scanner.
The data acquisition is very accurate (0.1pixels reprojection error and 0.2mm in geometric accuracy) and covers a very wide range of scenes including well-textured, partially untextured, diffuse, shiny and metalic objects.
As a first benchmark, we evaluated three established state-of-the-art MVS methods from the top performers at Middlebury.
The main finding was that all methods perform to a very high standard, illustrating how much the technology has progressed.
However there were also serious performance limitations, mainly to do with challenging scenes.
This will hopefully inspire a second round of improvements until the technology is finally ready for real-world use.

The project has its own website that can be found here. Also, check out a video showing the acquisition setup.

In this project we looked at the problem of fusing depth-map measurements probabilistically. The results show our method outperforming competitors in some regimes, especially under heavy noise/outlier measurements.
However the key merit of the approach is the principled variational Bayesian framework which shows great promise and paves the way for more complex models. More on this soon!

The paper describes a probabilistic, online, depth map fusion frame-work, whose generative model for the sensor measurement process accurately incorporates both long-range visibility constraints and a spatially varying, probabilistic outlier model. In addition, we propose an inference algorithm that updates the state variables of this model in linear time each frame. Our detailed evaluation compares our approach against several others, demonstrating and explaining the improvements that this model offers as well as highlighting a problem with all current methods: systemic bias.

We investigate the problem of obtaining a dense reconstruction in real-time, from a live video stream. In recent years, Multi-view stereo (MVS) has received considerable attention and a number of methods have been proposed. However, most methods operate under the assumption of a relatively sparse set of still images as input and unlimited computation time. Video based MVS has received less attention despite the fact that video sequences offer significant benefits in terms of usability of MVS systems. In this paper we propose a novel video based MVS algorithm that is suitable for real-time, interactive 3d modeling with a hand-held camera. The key idea is a per-pixel, probabilistic depth estimation scheme that updates posterior depth distributions with every new frame. The current implementation is capable of updating 15 million distributions per second. We evaluate the proposed method against the state-of-the-art real-time MVS method and show improvement in terms of accuracy.

This paper addresses the problem of automatically obtaining
the object/background segmentation of a rigid 3D object
observed in a set of images that have been calibrated for
camera pose and intrinsics. Such segmentations can be used
to obtain a shape representation of a potentially texture-less
object by computing a visual hull. We propose an automatic
approach where the object to be segmented is identified by the
pose of the cameras instead of user input such as 2D bounding
rectangles or brush-strokes.
The key behind our method is a pairwise MRF framework that
combines (a) foreground/background appearance models, (b)
epipolar constraints and (c) weak stereo correspondence into a
single segmentation cost function that can be efficiently solved
by Graph-cuts. The segmentation thus obtained is further
improved using silhouette coherency and then used to update
the foreground/background appearance models which are fed
into the next Graph-cut computation. These two steps are
iterated until segmentation convergences.
Our method can automatically provide a 3D surface
representation even in texture-less scenes where MVS
methods might fail. Furthermore, it confers improved
performance in images where the object is not readily
separable from the background in colour space, an area that
previous segmentation approaches have found challenging.

This paper addresses the problem of obtaining 3d detailed
reconstructions of human faces in real-time and with
inexpensive hardware. We present an algorithm based on
a monocular multi-spectral photometric-stereo setup. This
system is known to capture high-detailed deforming 3d surfaces
at high frame rates and without having to use any
expensive hardware or synchronized light stage. However,
the main challenge of such a setup is the calibration stage,
which depends on the lights setup and how they interact
with the specific material being captured, in this case, human
faces. For this purpose we develop a self-calibration
technique where the person being captured is asked to perform
a rigid motion in front of the camera, maintaining a
neutral expression. Rigidity constrains are then used to
compute the head.s motion with a structure-from-motion algorithm.
Once the motion is obtained, a multi-view stereo
algorithm reconstructs a coarse 3d model of the face. This
coarse model is then used to estimate the lighting parameters
with a robust estimator which allows for detailed realtime
3d capture of faces. The calibration procedure is validated
with two real sequences.

The previous 3DPVT'10 version can be found here The capture system is identical to the one we presented
in ECCV'08.
The main difference is the clever calibration method for
photometric stereo that was inspired from our earlier
CVPR'06 work.
Check out the supplementary video
here

This paper proposes an improvement to a large class of Multi-View Stereo algorithms that fuse stereo depth maps. We show that if individual depth-maps are filtered for outliers prior to the fusion stage, good performance can be maintained in sparse data-sets. Our strategy is to collect a list of good hypotheses for the depth of each pixel. We then chose the optimal depth for each pixel by enforcing consistency between neighbouring pixels in a depth-map. A crucial element of the fitering stage is the introduction of a possible unknown depth hypothesis for each pixel, which is selected by the algorithm when no consistent depth can be chosen. This pre-processing of the depth-maps allows the global fusion stage to operate on fewer outliers and consequently improve the performance under sparsity of data.

Shadows present a significant challenge for Photometric Stereo methods. When four or more images are available, local surface orientation is overdetermined and the shadowed pixels can be discarded. In this paper we look at the challenging case when only three images under three different illuminations are available. In this case, when one of the three pixel intensity constraints is missing due to shadow, a 1 dof ambiguity per pixel arises. We show that using integrability one can resolve this ambiguity and use the remaining two constraints to reconstruct the geometry in the shadow regions. As the problem becomes ill-posed in the presence of noise, we describe a regularization scheme that improves the numerical performance of the algorithm while preserving data.

When three lights illuminate a surface from three different angles and with three different colors, there is a one-to-one mapping between the RGB color measured by a camera and the surface orientation. If we illuminate a complex object under this setup, we can invert the mapping to get surface orientations from an RGB image, then integrate those to get a depth-map. In this paper, this idea, previously used only with static objects, is applied to the reconstruction of a deforming object, such as a moving cloth. We capture color videos of complex motions of fabrics, from which we extract sequences of depth maps. We propose a simple scheme with which these depth maps can be registered to a canonical pose and this allows complex applications such as texture mapping or avatar skinning. A video showing the system in action can be found in the following links: short avi, longer version and from YouTube part 1 and part 2.

This is an extension and consolidation of our CVPR 2006 work on multi-view photometric stereo. The main difference of this paper is that significant albedo variations in the surface of the reconstructed object can be tolerated. In the case where albedo variation is present on the object, we can usually obtain reconstructions with classic multi-view dense stereo. We show however, how our work produces results of much higher geometric detail than multi-view stereo, by exploiting the change in illumination. An earlier version of this work appeared in

Many multi-view stereo methods are faced with the problem of segmenting 20-100 calibrated images of a 3D object. These segmentations are used to create a visual hull which is a first approximation to the object's geometry. In this paper we propose a simple technique for automatically segmenting these images. Our idea is based on two observations: (1) In each image the camera will usually fixate on the object of interest and (2) the segmentations are not independent because of the silhouette coherence constraint. We use (1) to initalise an object color model. We then perform a series of simultaneous segmentations using (2). In each iteration we update the color model based on previous results. The process converges to the correct segmentations after just a few iterations.

Here we revisit our CVPR 2005 work and develop a much improved formulation. The object surface is defined as a partition of 3D space into 'inside' and 'outside' regions. The cost functional, which we optimise using Graph-cuts, is a combination of a simple balooning force and an occlusion-independent Normalised Cross Correlation cost. The advantages of our approach are the following: (1) Objects of arbitrary topology can be fully represented and computed as a single surface
with no self-intersections.
(2) The representation and geometric regularisation is image and viewpoint independent.
(3) Global optimisation is computationally tractable, using existing max-flow algorithms.

In this work we explore how the photo-consistency criterion can be used to obtain a likelihood for a given 3D location being 'empty'. We observe that if a 3D point is considered as photo-consistent from a certain camera, then all 3D locations between that point and the camera are likely to be empty. In fact the degree of likelihood for 'emptiness' is related to the degree of photo-consistency. We formalise this observation probabilistically and show how it can be used to reconstruct difficult concavities in objects.

In this work we have obtained
full 3D reconstructions of single-albedo, near-Lambertian objects such as white porcelain
from 36 views under changing but unknown lighting (single distant light-source assumed). This work is
the first to generalise uncalibrated photometric stereo in the multi-view setting. For a more detailed
look at some other models we reconstructed using this technique, look
here.
All you will need is a java enabled browser.

Frontier Points are a robust geometrical
feature extracted from the silhouettes. They are points on the surface of the
object with a known 3D location and known local surface orientation. In this paper we have
shown how they can be used to recover information about the surface
reflectance of the object as well as illumination.

The object surface is defined as a boundary separating
the Visual Hull surface from an inner surface at a constant offset from and
inside the Visual Hull. The volume between these two surfaces is discretized
into voxels and for each voxel we compute a photo-consistency cost. Using Graph-Cuts
and a specially defined weighted graph, we compute the surface that optimally
separates voxels inside and outside the scene.

The object surface is defined as a height field on
top of a coarse approximation of the scene surface (typically the visual
hull). The height field is formulated as a Markov Random Field incorporating
photo-consistency and surface smoothness constraints. The resulting cost
function is optimized using Loopy Belief Propagation.

Code

A simple, pattern-based camera pose estimation toolbox for Matlab, suitable for Multi-view stereo reconstructions of small objets. This was shown at our recent MVS tutorial at CVPR 2010.