Tony directs the Columbia Machine Learning
Laboratory. Our research area is machine learning, a field which
develops novel algorithms that use data to model complex real-world
phenomena and to make accurate predictions about them. Our work spans
both the applied and the fundamental aspects of the field. We have
made contributions by migrating generalized matching, graph-theoretic
methods and combinatorial methods into the field of machine
learning. We have also introduced unifying frameworks that combine two
separate methodologies in the field: discriminative approaches and
generative approaches. The machine learning applications we have
worked on have had real-world impact achieving state of the art
results in vision, face recognition and spatio-temporal tracking,
leading to a successful startup which applies machine learning
to spatio-temporal data. Our papers are available online here.

Learning with Matchings and Perfect Graphs

Learning with Generalized Matchings

We are interested in the application of generalized matching and other
graph-theoretic / combinatorial methods to machine learning
problems. Matching is an important tool in many fields and
settings. One famous example of generalized matching is Google's
AdWords system which approximately matches advertisers to search terms
(AdWords generates a significant portion of Google's revenue). Until
recently, matching has had limited impact in machine learning. In the
past 5 years, we have introduced and explored the connections between
machine learning problems and generalized matching. In 1965, Jack Edmonds began
pioneering work into the combinatorics of matching and its many
extensions including b-matching. This work received the following citation during
Edmonds' receipt of the 1985 John von Neumann Theory Prize, Jack Edmonds has been one of
the creators of the field of combinatorial optimization and polyhedral
combinatorics. His 1965 paper "Paths, Trees, and Flowers" was one of
the first papers to suggest the possibility of establishing a
mathematical theory of efficient combinatorial algorithms.

Matchings and generalized matchings are a family of graphs (such as
trees) that enjoy fascinating combinatorics and fast
algorithms. Matching goes by many names and variations including
permutation (or permutation-invariance), assignment, auction,
alignment, correspondence, bipartite matching, unipartite matching,
generalized matching and b-matching. The group of matchings is called
the symmetric group in algebra. We introduced the use of matching in
machine learning problems and showed how it improves the state of the
art in machine learning problems like classification, clustering,
semisupervised learning, collaborative filtering, tracking, and
visualization. We have also shown how matching leads to theoretical
and algorithmic breakthroughs in MAP estimation, Bayesian networks and
graphical modeling which are important topics in machine learning.

We initially explored matching from a computer vision
perspective where it was already a popular tool. For instance, to
compare two images for recognition purposes, it is natural to first
match pixels or patches in the images by solving a bipartite matching
problem. We first explored matching for this type of image alignment
and interleaved it into unsupervised and supervised learning
problems. For instance, we explored matching within unsupervised
Gaussian modeling and principal components analysis settings
(ICCV 03,AISTAT 03,
COLT 04).
We also explored matching in classification frameworks by
building kernels that were invariant to correspondence
(ICML 03)
and classifiers that were invariant to permutation
(ICML 06b).
These methods significantly outperformed other approaches particularly
on image datasets when images were represented as vector sets or bags
of pixels (ICCV
03). The paper A Kernel between Sets of Vectors (ICML
03) won an award at the International Conference on Machine
Learning in 2003 out of a total of 370 submitted papers.
Given the empirical success of matching in image problems, we
considered its application to other machine learning problems. One key
insight was that matching acts as a general tool for turning data into
graphs. Once data is in the form of a graph, a variety of machine
learning problems can be directly handled by algorithms from the
graph theory community. We began pioneering the use of b-matching, a
natural extension to matching in a variety of machine learning
problems. We showed how it can lead to state of the art results in
visualization or graph embedding
(ICML 09,
AISTATS 07d),
clustering (ECML 06),
classification (AISTAT 07b),
semisupervised learning (ICML 09b),
collaborative filtering (NIPS 08b),
and tracking (AISTAT 07a).

We were initially motivated by the KDD 2005 Challenge which was a
competition organized by the CIA and NSA to cluster anonymized
documents by authorship. By using b-matching prior to spectral
clustering algorithms, we obtained the best average accuracy in the
ER1B competition in competition with 8 funded teams from various other
universities. The work was published in a classified journal
(Jebara, Shchogolev and Kondor in JICRD 2006) and a related non-classified publication can
be found in (ECML 06).
A simple way of seeing the usefulness of b-matching is to consider
it as a principled competitor to the most popular and simplest machine
learning algorithm: k-nearest neighbors. For instance, consider
connecting the points in the figure below using 2-nearest-neighbors.
Typically, when computing k-nearest-neighbors, many nodes will
actually have more than k neighbors since, in addition to choosing
k neighbors, each node itself can be selected by other nodes during
the greedy procedure. This situation is depicted in Figure (a) below.
Here, several two ring datasets are shown and
should be clustered into the two obvious components: the inner ring
and the outer ring. However, when the smaller ring's radius is similar
to the larger ring's radius or when the number of samples on each ring
is low, k-nearest-neighbors does not recover the two ring
connectivity correctly. This is because nodes in the smaller inner
ring get over-selected by nodes in the outer ring and eventually have
large neighborhoods. Ultimately, k-nearest neighbor algorithms do
not produce regular graphs as an output and the degree for many nodes
(especially on the inner ring) is larger than k. This
irregularity in k-nearest-neighbors can get exponentially worse in
higher dimensional datasets as shown by (Devroye and Wagner, 1979). This limits
the stability and generalization power of k-nearest neighbors by the
so called kissing number in d-dimensional space. Conversely, b-matching methods
guarantee that a truly regular graph (of minimum total edge length) is
produced. Figure (b) shows the output of b-matching and
therein it is clear that each node has exactly 2 neighbors. Even
spectral clustering will not tease apart these two rings at some
scales and sampling configurations.

After showing how matching improves spectral clustering in practice
(ECML 06),
we investigated its use in other machine learning
settings. For instance, in
(AISTAT 07b),
we showed how it can
perform more accurate classification than k-nearest neighbors
without requiring much more computational effort. In visualization
problems where high dimensional datasets are typically turned into
k-nearest neighbor graphs prior to low dimensional embedding
(AISTATS 07d),
we showed how b-matching yields more
balanced connectivity and more faithful visualizations
(Snowbird 06,
ICML 09).
The latter paper Structure Preserving Embedding won the best paper award
at the International Conference on Machine
Learning in 2009 out of a total of 600 submitted papers. In addition, in
semisupervised learning and so-called graph transduction methods
(ICML 08),
we showed how b-matching consistently improves
accuracy
(ICML 09b).
In collaborative filtering problems, we
applied a generalization of b-matching where each node need not have
constant degree b but rather any customizable distribution over
degrees and this yielded more accurate recommendation results
(NIPS 08b).
Finally, by realizing that matchings form a finite
group in algebra known as the symmetric group, we have managed to
apply group-theoretic tools such as generalized fast Fourier
transforms to solve problems over distributions on matchings
(AISTAT 07a).
Such distributions over matchings are useful in
multi-target tracking since, at each time point, the identity of current
targets needs to be matched to previously tracked
trajectories. However, since there are n! matchings, these
distributions need to be band-limited using Fourier analysis to remain
efficient (polynomial). To our knowledge, this was the first
application of group theory to matchings in a machine learning and or
tracking setting and led to improvement over previous methods.

Matching, MAP Estimation and Message Passing

One of the obstacles in applying off the shelf generalized matching
solvers to real machine learning problems is computation time. For
instance, the fastest solvers from the combinatorial optimization
community were too computationally demanding to handle more than a few
hundred data points (ECML
06). This led us to build our own algorithms for solving
b-matchings by trying a generally fast machine learning approach known
as message passing (or belief propagation which is
used interchangeably here). This method is extremely popular in
machine learning problems and is used to perform approximate inference
in graphical models and decomposable probability distributions. In
(AISTAT
07b), we realized that b-matching problems could be written as a
graphical model, a probability density function that decomposes into a
product of cliques in a sparse graph. To find the maximum a posteriori
(MAP) estimate of this graphical model, we implemented message
passing, a clever and fast way of approximating the b-matching problem
by sending messages along the graph. Message passing works for MAP
estimation on tree-structured graphical models but only gives
approximate answers on graphical models with loops. Since the
b-matching graphical model had many loops, all we could hope for was
suboptimal convergence some of the time. Amazingly, message passing on
this loopy graphical model efficiently produced the correct
answer all the time. We published a proof guaranteeing the
efficient and exact convergence of message passing for b-matching problems in
(AISTAT
07b), and showed one of the few cases where the optimality is provably
preserved despite loops in the graphical model. The optimality of
message passing for trees (Pearl 1988) and single-loop graphs (Weiss
2000) were previously known however the extension to b-matching
represents an important special case. This was significant since
defining which graphical models admit exact inference has been an open
and exciting research area in the field of machine learning (and other
communities such as decoding) for many years. Graphical models
corresponding to matching, b-matching and generalized matching extend
known guarantees beyond trees and single loop graphs to a wider set of
models and problems. Today, the connection between message passing and
matching is a promising area of theoretical research with
contributions from multiple theory researchers (Bayati, Shah and
Sharma 2005, Sanghavi, Malioutov and Willsky 2008, as well as Wainwright and
Jordan 2008).

Simultaneously, several applied researchers
have downloaded our code (available online) and are efficiently
solving large scale (bipartite and unipartite) b-matching problems
using message passing. Our group uses the code regularly for the
various papers described in the previous section. In our original
article we proved that the algorithm requires O(n^3) time for solving
maximum weight b-matchings on dense graphs with n nodes yet
empirically our code seemed to be much more efficient. Recently,
(Salez and Shah 2009) proved that, under mild assumptions, message
passing methods take O(n^2) on dense graphs. This result indicates
that our implementation is one of the fastest around and applies to
problems of comparable scale as k-nearest neighbors. Currently, we are
working on other theoretical guarantees for b-matching in terms of its
generalization performance when applied to independently sampled
data. To do so, we are attempting to characterize the stability of
b-matching compared to k-nearest neighbors by following the approach
of (Devroye and Wagner 1979) which should yield theoretical guarantees
on its accuracy in machine learning settings.

Matchings, Trees and Perfect Graphs

Matchings, b-matchings and generalized matchings are just families
of graphs in the space of all possible undirected graphs on n nodes. There
are n! possible matchings and even more possible
b-matchings. Amazingly, however, maximum weight matchings
can be recovered efficiently either by using Edmonds' algorithm
or by using matching-structured graphical models which
admit exact inference via message passing. Similarly, there are
n^(n-2) trees and tree-structured graphical models admit exact
inference via message passing.

We have worked on tree structured graphical modeling in several
settings because of these interesting computational advantages. For
instance, in
(UAI 04),
we developed tree hierarchies of
dynamical systems that allow efficient structured mean field
inference. These were useful for multivariate time series tracking and
other applied temporal data. Another remarkable property of trees is
that summation over a distribution over all n^(n-2) trees can be
done in cubic time using Kirchoff's and Tutte's theorem. In
(NIPS 99)
we proposed inference over the distribution of all
possible undirected trees over n nodes using Kirchoff's matrix tree
theorem. In
(UAI 08)
we developed inference over the distribution
of all possible out directed trees over n nodes using Tutte's matrix
tree theorem. These methods are useful in modeling non iid data such
as phylogenetic data, disease spread, language arborescence models and
so forth.

Not only do trees have fascinating combinatorial and computational
properties, they lead to important empirical results on real world
machine learning problems. Similarly, not only do matchings have
fascinating combinatorial and computational properties, they lead to
important empirical results on real world machine learning problems.
Is there some larger family of graphs which subsumes these two
families of graphs? This answer is yes and leads to the most exciting
topic in combinatorics: perfect graphs . In
(UAI 09)
we were the first to identify this contact point between perfect graphs
and machine learning and probabilistic inference.

Recently, mathematicians and combinatorics experts have had a
breakthrough in their field through the work of (Chudnovsky et
al. 2006) who proved the decades-old strong perfect graph
conjecture and developed algorithms for testing graphs to determine if
they are perfect. We have shown that perfect graphs have important
implications in the field of machine learning and decoding by using
perfect graphs to identify when inference in general graphical models
is exact and when it can be solved efficiently via message passing
(UAI 09).
This is done by converting graphical models (with
loops) and the corresponding MAP estimation problem into what we call a
nand Markov random field . The topology of this structure is then
efficiently diagnosed to see if it forms a perfect graph. If it is
indeed perfect, then MAP estimation is guaranteed exact and efficient
and message passing will provide the optimal solution. This
generalizes known results on trees and the results on generalized
matchings to the larger class of perfect
graphs. Defining which graphical models admit exact inference has been
an open and exciting research area in the field of machine learning
(and other communities such as decoding) for many years. Perfect graph
theory is now a valuable tool in the field's arsenal. This recent work
extends graphical model and Bayesian network inference guarantees that
are known about trees and matchings to the more general and more
fundamental family of perfect graphs.

Discriminative and Generative Learning

Before our work in matchings and graphs, we began contributing to
machine learning by combining discriminative (large margin) learning
with generative (probabilistic) models. The machine learning community
had two schools of thought: discriminative methods (such as large
margin learning support vector classification) and generative methods
(such as Bayesian networks, Bayesian learning (Pattern Recognition 00) and
probability density estimation). Generative models summarize data and
are more flexible since the practitioner can introduce various
conditional independence assumptions, priors, hidden variables and
myriad parametric assumptions. Meanwhile, discriminative models only
learn from data to make accurate input to output predictions and offer
less modeling flexibility, limiting the practitioner to explore kernel
methods which can be less natural.

The combination of generative and discriminative learning was explored
in the book
(Jebara 04)
and provided several tools for combining these goals. These including
variational approaches to maximum conditional likelihood estimation
(NIPS 98,
NIPS 00)
as well as
maximum entropy discrimination or MED
(NIPS 99,
UAI 00,
ICML 04).
These methods work directly
with generative models yet estimate the models' parameters to reduce
classification error and maximize the resulting classifier's margin.
In fact, the MED approach subsumes
support vector machines and also leads to important SVM extensions by
using probabilistic machinery. This MED framework has led to the first
convex multitask SVM approach which performs shared feature and kernel
selection while estimating multiple SVM classifiers
(ICML 04,
JMLR 09).
Similarly, nonstationary kernel selection and
classification with hidden variables is also possible to derive using
this framework
(ICML 06).
In current work, we are exploring
variants of conic kernel selection by learning mixtures of
transformations, a method that also subsumes SVM kernel learning
(NIPS 07).

Another approach to combining discriminative and generative learning
is to use probability distributions to form kernels which are then
used directly by standard large margin learning algorithms such as
support vector machines.
We proposed probability
product kernels
(ICML 03,
COLT 03,
JMLR 04) and hyperkernels
(NIPS 06) which involve the integral of two
probability functions over the sample space. This kernel is the
affinity corresponding to the Hellinger divergence. This approach
allows kernels to be built from exponential family models, mixtures,
Bayesian networks and hidden Markov models. Since then, other
researchers have extended this idea and developed kernels from
probabilities by exploring other information divergences beyond the
Hellinger divergence.

While large margin estimation has led to consistent improvement in
the classification accuracy of generative models and probabilistic
kernels, our most recent work has been investigating even more
aggressive discriminative criteria. This has led to a more powerful
criterion for discrimination we call large relative
margin. There, the margin formed by the classifier is maximized while
also controlling the spread of the projections of the data onto the
classification. This work has led to significant improvement over the
performance of support vector machines without making any additional
assumptions about the data. The extension of large margin approaches
to the relative margin case is straightforward, efficient, has
theoretical guarantees and empirically strong performance
(AISTATS 07c,
NIPS 08).
We are currently investigating its
applicability to large margin structured prediction problems, variants
of boosting as well as the estimation of even more discriminative
generative models.

Application Areas

In addition to the above fundamental machine learning work, we are also
dedicated to real-world applications of machine learning and its
widespread impact on industry and other fields. In older work, we have
built machine-learning inspired computer vision systems that achieved
top-performing face recognition performance and obtained an honorable
mention from the Pattern Recognition Society
(Pattern Recognition 00,
NIPS 98b).
We have also worked on highly cited
(300 citations) vision-based person tracking
(CVPR 97), behavior
recognition systems
(ICVS 99)
as well as mobile and
wearable computing (ISWC 97,
IUW
98). In early 2006, we helped
found Sense Networks a
startup that applies machine learning to location and mobile phone
data. Other startups we are involved in
include
Agolo , Ninoh
and Bookt.