Publications

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whoseoutput is the volume density and view-dependent emitted radiance athat spatial location. We synthesize views by querying 5D coordinatesalong camera rays and use classic volume rendering techniques to projectthe output colors and densities into an image. Because volume renderingis naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how toeffectively optimize neural radiance fields to render photorealistic novelvews of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and viewsynthesis.

Printed and digitally displayed photos have the ability to hide imperceptible digital data that can be accessed through internet-connected imaging systems. Another way to think about this is physical photographs that have unique QR codes invisibly embedded within them. This paper presents an architecture, algorithms, and a prototype implementation addressing this vision. Our key technical contribution is StegaStamp, a learned steganographic algorithm to enable robust encoding and decoding of arbitrary hyperlink bitstrings into photos in a manner that approaches perceptual invisibility. StegaStamp comprises a deep neural network that learns an encoding/decoding algorithm robust to image perturbations approximating the space of distortions resulting from real printing and photography. We demonstrates real-time decoding of hyperlinks in photos from in-the-wild videos that contain variation in lighting, shadows, perspective, occlusion and viewing distance. Our prototype system robustly retrieves 56 bit hyperlinks after error correction -- sufficient to embed a unique code within every photo on the internet.

We present a deep learning solution for estimating the incident illumination at any 3D location within a scene from an input narrow-baseline stereo image pair. Previous approaches for predicting global illumination from images either predict just a single illumination for the entire scene, or separately estimate the illumination at each 3D location without enforcing that the predictions are consistent with the same 3D scene. Instead, we propose a deep learning model that estimates a 3D volumetric RGBA model of a scene, including content outside the observed field of view, and then uses standard volume rendering to estimate the incident illumination at any 3D location within that volume. Our model is trained without any ground truth 3D data and only requires a held-out perspective view near the input stereo pair and a spherical panorama taken within each scene as supervision, as opposed to prior methods for spatially-varying lighting estimation, which require ground truth scene geometry for training. We demonstrate that our method can predict consistent spatially-varying lighting that is convincing enough to plausibly relight and insert highly specular virtual objects into real images.

Eye movements provide insight into what parts of an image a viewer finds most salient, interesting, or relevant to the task at hand. Unfortunately, eye tracking data, a commonly-used proxy for attention, is cumbersome to collect. Here we explore an alternative: a comprehensive web-based toolbox for crowdsourcing visual attention. We draw from four main classes of attention-capturing methodologies in the literature. ZoomMaps is a novel zoom-based interface that captures viewing on a mobile phone. CodeCharts is a self-reporting methodology that records points of interest at precise viewing durations. ImportAnnots is an annotation tool for selecting important image regions, and cursor-based BubbleView lets viewers click to deblur a small area. We compare these methodologies using a common analysis framework in order to develop appropriate use cases for each interface. This toolbox and our analyses provide a blueprint for how to gather attention data at scale without an eye tracker.

Imaging through fog has important applications in industries such as self-driving cars, augmented driving, air-
planes, helicopters, drones and trains. Current solutions
are based on radar that suffers from poor resolution (due
to the long wavelength), or on time gating that suffers from
low signal-to-noise ratio. Here we demonstrate a technique
that recovers reflectance and depth of a scene obstructed by
dense, dynamic, and heterogeneous fog. For practical use
cases in self-driving cars, the imaging system is designed in
optical reflection mode with minimal footprint and is based
on LIDAR hardware. Specifically, we use a single photon
avalanche diode (SPAD) camera that time-tags individual
detected photons. A probabilistic computational framework
is developed to estimate the fog properties from the measurement itself, and distinguish between background pho-
tons reflected from the fog and signal photons reflected from
the target.

Vehicles, search and rescue personnel, and endoscopes
use flash lights to locate, identify, and view objects in their
surroundings. Here we show the first steps of how all these
tasks can be done around corners with consumer cameras. We introduce a method that couples traditional
geometric understanding and data-driven techniques. To
avoid the limitation of large dataset gathering, we train the
data-driven models on rendered samples to computationally
recover the hidden scene on real data. The method has three
independent operating modes: 1) a regression output to localize a hidden object in 2D, 2) an identification output to
identify the object type or pose, and 3) a generative network
to reconstruct the hidden scene from a new viewpoint.

We demonstrate that the imaging optics of an ultrafast camera (or a depth camera) can
be dramatically different from the imaging optics of a conventional photography camera.
More specifically, we demonstrate that by folding the optical path in time, one can collapse
the conventional photography optics into a compact volume or multiplex various
functionalities into a single imaging optics piece without losing spatial or temporal
resolution. By using time-folding at different regions of the optical path, we achieve an order of magnitude lens
tube compression, ultrafast multi-zoom imaging, and ultrafast multi-spectral imaging. Each
demonstration was done with a single image acquisition without moving optical components.

Widely used in news, business, and educational media, infographics are handcrafted to effectively communicate messages about complex and often abstract topics including ‘ways to conserve the environment’ and ‘understanding the financial crisis’. Composed of stylistically and semantically diverse visual and textual elements, infographics pose new challenges for computer vision. While automatic text extraction works well on infographics, computer vision approaches trained on natural images fail to identify the stand-alone visual elements in infographics, or ‘icons’. To bridge this representation gap, we propose a synthetic data generation strategy: we augment background patches in infographics from our Visually29K dataset with Internet-scraped icons which we use as training data for an icon proposal mechanism. Combining our icon proposals with icon classification and text extraction, we present a multi-modal summarization application. Our application takes an infographic as input and automatically produces text tags and visual hashtags that are textually and visually representative of the infographic’s topics respectively.

Traditional cameras require a lens and a mega-pixel sensor to capture images. The lens focuses light from the scene onto the sensor. We demonstrate a new imaging method that is lensless and requires only a single pixel for imaging. Compared to previous single pixel cameras our system allows significantly faster and more efficient acquisition. This is achieved by using ultrafast time-resolved measurement with compressive sensing. The time-resolved sensing adds information to the measurement, thus fewer measurements are needed and the acquisition is faster. Lensless and single pixel imaging computationally resolves major constraints in imaging systems design. Notable applications include imaging in challenging parts of the spectrum (like infrared and THz), and in challenging environments where using a lens is problematic.

Object Classification through Scattering Media
with Deep Learning on Time Resolved
Measurement

A deep learning method for object classification through scattering media. Traditional techniques to see through scattering media rely on a physical model that has to be precisely calibrated. Computationally overcoming the scattering relies heavily on such physical models, and on the calibration accuracy. Thus, such systems are extremely sensitive to an accurate and lengthy calibration process. Our method trains on synthetic data with variations in calibration parameters that allows the network to learn an invariant model to calibration of lab experiments.