Object Detection, Segmentation, Tracking, and Recognition

Detection and Tracking of Objects of Variable Shape Structure

Profs. Betke and Sclaroff and their students proposed a new method for
object detection and tracking. The method, described in IEEE
Trans. PAMI 2008, uses Hidden State Shape Models (HSSMs), a
generalization of Hidden Markov Models, which were introduced to model
object classes of variable shape structure, such as hands, using a
probabilistic framework. The proposed polynomial inference algorithm
automatically determines hand location, orientation, scale and finger
structure by finding the globally optimal registration of model states
with the image features, even in the presence of clutter. Betke's
student Jingbin Wang conducted experiments that showed that the
proposed method is significantly more accurate than previously
proposed methods based on chamfer-distance matching.

The group extended the model to Dynamic Hidden-State Shape Models
(DHSSMs) to track and recognize the non-rigid motion of objects with
structural shape variation, in particular, human hands. Betke's
student Zheng Wu conducted experiments that showed that the proposed
method can recognize the digits of a hand while the fingers are being
moved and curled to various degrees. The method was also shown to be
robust to various illumination conditions, the presence of clutter,
occlusions, and some types of self-occlusions.

Shape-based Object Segmentation

Jingbin Wang, Erdan Gu, and Margrit Betke proposed a method that
combines shape-based object recognition and image segmentation
techniques. Given a shape prior represented in a multiscale curvature
form, the proposed method identifies the target objects in images by
grouping oversegmented image regions. The problem is formulated in a
unified probabilistic framework, and object segmentation and
recognition are accomplished simultaneously by a stochastic Markov
Chain Monte Carlo (MCMC) mechanism. Within each sampling move during
the simulation process, probabilistic region grouping operations are
influenced by both the image information and the shape similarity
constraint. The latter constraint is measured by a partial shape
matching process. A generalized cluster sampling algorithm combined
with a large sampling jump and other implementation improvements,
greatly speeds up the overall stochastic process. The proposed method
supports the segmentation and recognition of multiple occluded objects
in images. Betke's student Jingbin Wang conducted experiments using
both synthetic and real images.

The Kernel-Subset-Tracker

Sam Epstein and Margrit Betke extended the Semi-least Squares problem
defined by Rao and Mitra in 1971 to the Kernel Semi-least Squares
problem. They introduced ``subset projection,'' a technique that
produces a solution to this problem. They showed how the results of
such a subset projection can be used to approximate a computationally
expensive distance metric. The general problem of distance metric
learning is common in real-world applications such as computer vision
and content retrieval. The proposed variant of the distance learning
problem has particular applications to exemplar-based video tracking.
Epstein and Betke developed the Kernel-Subset-Tracker, which
determines the distance (similarity) of the tracked object to each
training image of the object. From these distances, a template is
created and used to find the next position of the object in the video
frame. Experiments with test subjects show that augmenting the Camera
Mouse with the Kernel-Subset-Tracker improves communication bandwidth
statistically significantly. Tracking of facial features was
accurate, without feature drift, even during rapid head movements and
extreme head orientations. The Camera Mouse augmented with the
Kernel-Subset-Tracker enabled a stroke-victim with severe motion
impairments to communicate via an on-screen keyboard.

Statistical Object Recognition

Margrit Betke, in collaboration with Eran Naftali and Nick Makris,
derived analytic conditions that are necessary for the maximum
likelihood estimate to become asymptotically unbiased and attain
minimum variance for estimation problems in computer vision. In
particular, problems of estimating the parameters that describe the 3D
structure of rigid objects or their motion are investigated. It is
common practice to compute Cramer-Rao lower bounds (CRLB) to
approximate the mean-square error in parameter estimation problems,
but the CRLB is not guaranteed to be a tight bound and typically
underestimates the true mean-square error. The necessary conditions
for the Cramer-Rao lower bound to be a good approximation of the
mean-square error are derived. The tightness of the bound depends on
the noise level, the number of pixels on the surface of the object,
and the texture of the surface. We examine our analytical results
experimentally using polyhedral objects that consist of planar surface
patches with various textures that move in 3D space. We provide
necessary conditions for the CRLB to be attained that depend on the
size, texture, and noise level of the surface patch.

The problem of recognizing objects subject to affine
transformation in images is examined from a physical
perspective using the theory of statistical estimation. Focusing
first on objects that occlude zero-mean scenes with additive
noise, we derive the Cramer-Rao lower bound on the
mean-square error in an estimate of the six-dimensional
parameter vector that describes an object subject to affine
transformation and so generalize the bound on
one-dimensional position error previously obtained in radar
and sonar pattern recognition. We then derive two useful
descriptors from the object?s Fisher information that are
independent of noise level. The first is a generalized coherence
scale that has great practical value because it corresponds to
the width of the object?s autocorrelation peak under affine
transformation and so provides a physical measure of the
extent to which an object can be resolved under affine
parameterization. The second is a scalar measure of an object?s
complexity that is invariant under affine transformation and
can be used to quantitatively describe the ambiguity level of a
general 6-dimensional affine recognition problem. This
measure of complexity has a strong inverse relationship to the
level of recognition ambiguity. We then develop a method for
recognizing objects subject to affine transformation imaged in
thousands of complex real-world scenes. Our method exploits
the resolution gain made available by the brightness contrast
between the object perimeter and the scene it partially
occludes. The level of recognition ambiguity is shown to
decrease exponentially with increasing object and scene
complexity. Ambiguity is then avoided by conditioning the
permissible range of template complexity above a priori
thresholds. Our method is statistically optimal for recognizing
objects that occlude scenes with zero-mean background.

Intelligent Vehicles

We have developed methods for multiple vehicle detection and tracking
from a moving vehicle. Our methods include real-time analysis of
highway and city traffic scenes under reduced visibility conditions.
We also investigated how to evaluate the focus of attention of a
driver in city traffic. The constantly changing lighting conditions
made automated face analysis very challenging.

The project started at the University of Maryland, where Margrit Betke
was a postdoctoral researcher, working with Larry S. Davis and Esin
Haritaoglu. She later collaborated with Huan Nguyen from Boston
College on some extensions of the project. Her student William
Mullally eventually started a project on analyzing the face and eyes
of a driver.

In these videos, each detected car is automatically annotated with
a white frame. Each detected object that is potentially a car is
annotated with a black frame. The linear approximations of lane
markers are used to establish regions of interest for the car
detector. Cars in front are detected reliably. Cars in other lanes
can be detected reliably only in highway scenes or scenes with
sufficient visibility. In the tunnel sequence, the apparent size of
passing cars is larger than the modeled size. Passing cars are thus
not detected correctly.