Double Seminar by Tamara and Alex Berg

Tamara L Berg and Alex Berg

About the Event

"Rethinking Object Detection in Computer Vision" ( Alex Berg),
Object detection is one of the core problems in computer vision and is a lens through which to view the field. It brings together machine learning (classification, regression) with the variation in appearance of objects and scenes with pose and articulation (lighting, geometry) , and the difficulty of what to recognize for what purpose (semantics) all in a setting where computational complexity is not something to talk about in abstract terms, but matters every millisecond for inference and where it can take exaflops to train a model (computation).
I will talk about our ongoing work attacking all fronts of the detection problem. One is the speed-accuracy trade-off, which determines the settings where it is reasonably possible to use detection. Our work on single shot detection (SSD) is currently the leading approach [1,2]. Another direction is moving beyond detecting the presence and location of an object to detecting 3D pose. We are working on both learning deep-network models of how visual appearance changes with pose and object [3], as well as integrating pose estimation as a first class element in detection [4].
Going beyond presence and position to estimating pose is especially important is for object detection in the world around us, e.g in robotics, as opposed to on isolated internet images without context. I call this setting "situated recognition". A key illustration that this setting is under addressed is the lack of work in computer vision on the problem of active vision, where perception is integrated in a loop with sensor platform motion, a key challenge in robotics. I will present our work on a new approach to collecting datasets for training and evaluating situated recognition, allowing computer vision researchers to study active vision, for instance training networks using reinforcement learning on a densely sampled data of real RGBD imagery without the difficulty of operating a robot in the training loop. This is a counterpoint to recent work using simulation and CG for such reinforcement learning, where our use of real images allows studying and evaluating real-world perception.
I will also briefly mention our lower-level work on computation for computer vision and deep learning algorithms and building tools for implementation on GPUS and fPGAs, as well as other ongoing projects.
Collaborators for major parts of this talk
UNC Students- Wei Liu, Cheng-Yang Fu, Phil Ammirato, Ric Poirson, Eunbyung Park
Outside academic collaborator- Prof. Jana Kosecka (George Mason University)
Adobe: Duygu Ceylan, Jimei Yang, Ersin Yumer;
Google: Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed
Amazon: Ananth Ranga, Ambrish Tyagi
[1]
SSD: Single Shot MultiBox Detector
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy,
Scott Reed, Cheng-Yang Fu, Alexander C. Berg
ECCV 2016
https://arxiv.org/pdf/1512.02325.pdf
[2]
DSSD : Deconvolutional Single Shot Detector
Cheng-Yang Fu*, Wei Liu*, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg
arXiv preprint arXiv:1701.06659
https://arxiv.org/pdf/1701.06659.pdf
[3]
Transformation-Grounded Image Generation Network for Novel 3D View Synthesis
Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, Alexander C. Berg
To appear CVPR 2017
[4]
Fast Single Shot Detection and Pose Estimation
Patrick Poirson, Philip Ammirato, Cheng-Yang Fu, Wei Liu, Jana Kosecka, Alexander C. Berg
3DV 2016
https://arxiv.org/pdf/1609.05590
[5]
A Dataset for Developing and Benchmarking Active Vision
Phil Ammirato,Patrick Poirson, Eunbyung Park, Jana Kosecka, and Alexander C. Berg
ICRA 2017
***************************************************************************************************************************************
"Image Description and Beyond..." (Tamara Berg),
Much of everyday language and discourse concerns the visual world around us, making understanding the relationship between the physical world and language describing that world an important challenge problem for AI. Comprehending the complex and subtle interplay between the visual and linguistic domains will have broad applicability toward inferring human-like understanding of images, producing natural human-robot interactions, and grounding natural language. In computer vision, along with improvements in deep learning based visual recognition, there has been an explosion of recent interest in methods to automatically generate natural language outputs for images and videos. In this talk I will describe our group's efforts to understand and produce relevant natural language about images, from developing early methods to generate complete and human-like image descriptions, to moving beyond general image descriptions toward more focused natural language, such as referring expressions and question-answering.

Biography

Tamara L. Berg:
I received my B.S. in Mathematics and Computer Science from the University of Wisconsin, Madison in 2001. I then completed a PhD in Computer Science from the University of California, Berkeley in 2007 under the advisorship of Professor David Forsyth as a member of the Berkeley Computer Vision Group. Afterward, I spent 1 year as a research scientist at Yahoo! Research. From 2008-2013 I was an Assistant Professor in the Computer Science department at Stony Brook University and core member of the consortium for Digital Art, Culture, and Technology (cDACT). I joined the computer science department at the University of North Carolina Chapel Hill (UNC) in Fall 2013 and am currently a tenured Associate Professor. I am the recipient of an NSF Career award, 2 google faculty awards, the 2013 Marr Prize, and the 2016 UNC Hettleman Award.
***************************************************************************************************************************************
Alex Berg:
I am an associate professor at UNC Chapel Hill as of July 2016 and joined UNC in August 2013. I am also the CTO of Shopagon Inc. Previously I was an assistant professor at Stony Brook University. I completed a Ph.D. at U.C. Berkeley with Jitendra Malik, and have had the chance to work with many wonderful people.
I am interested in all aspects of computer vision and related problems in other fields. My thesis was on shape and object recognition in images using a new take on deformable templates. I also work on large scale machine learning algorithms for object recognition and detection, image retrieval, recognizing and synthesizing human action in video, recovering human body poses from photographs, detecting and identifying human faces in images, detecting vehicles in images, and more...