Abstracts

Clustering: Probably Approximately Useless?, Rich Caruana (MSR)

Clustering never seems to live up to the hype. To paraphrase the popular saying, clustering looks good in theory, yet often fails to deliver in practice. Why? You would think that something so simple and elegant as finding groups of similar items in data would be incredibly useful. Yet often it isn't. The problem is that clustering rarely finds the groups *you* want, or expected, or that are most useful for the task at hand. There are so many good ways to cluster a dataset that the odds of coming up with the clustering that is best for what you're doing are small. How do we fix this and make clustering more useful in practice? How do we make clustering do what you want, while still giving it the freedom to "do its own thing" and surprise us?

In recent years, manifold learning has become increasingly popular as
a tool for performing non-linear dimensionality reduction. This has
led to the development of numerous algorithms of varying degrees of
complexity that aim to recover the underlying low-dimensional
parameters of the data using either local or global features.

It is also widely recognized that the low dimensional parametrizations
will typically distort the geometric properties of the original data,
like distances, angles, areas and so on. These distortions depend both
on the data and on the algorithm used.

Building on the Laplacian Eigenmap framework, we propose a new
paradigm that offers a guarantee, under reasonable assumptions, that
*any* manifold learning algorithm will preserve the geometry of a data
set. Our approach is based on augmenting the output of an algorithm
with geometric information, embodied in the Riemannian metric of the
manifold. The Riemannian metric allows us to compute geometric
quantities (such as angle, length, or volume) for any coordinate
system or embedding of the manifold. This geometric faithfulness,
which is not guarante edfor most algorithms, allows us to define
geometric measurements that are inde pendent of the algorithm used,
and hence move seamlessly from one algorithm to another. In this work,
we provide an algorithm for estimating the Riemannian metric from data
and demonstrate the advantages of our approach in a variety of
examples.

As an application of this new framework, we develop a new, principled,
unsupervised to selecting the scale parameter in manifold learning,
based on optimizing the geometric self-consistency w.r.t the scale.

This talk will not require any knowledge of advanced mathematics or
manifold learning.

Joint work with Dominique Perrault-Joncas.

Deep Learning (but not the kind you were thinking of), Ran Gilad-Bachrach (MSR)

Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic. As an alternative, we define the depth of hypotheses and show that generalization and robustness can be bounded as a function of this depth. Therefore, we suggest using the median hypothesis, which is a deep hypothesis, and present algorithms for approximating it.

One contribution of this work is an efficient method for approximating the Tukey median. The Tukey median, which is often used for data visualization and outlier detection, is a special case of the family of medians we define: however, computing it exactly is exponentially slow in the dimension. Our algorithm approximates such medians in polynomial time while making weaker assumptions than those required by previous work.

The Bayesian Information Criterion (BIC) is a widely used model
selection technique that is inspired by the large-sample asymptotic
behavior of Bayesian approaches to model selection. In this talk we
will consider such approximate Bayesian model choice for problems that
involve models whose Fisher-information matrices may fail to be
invertible along other competing submodels. When models are singular
in this way, the penalty structure in BIC generally does not reflect
the large-sample behavior of their Bayesian marginal likelihood.
While large-sample theory for the marginal likelihood of singular
models has been developed recently, the resulting approximations
depend on the true parameter value and lead to a paradox of circular
reasoning. Guided by examples such as determining the number of
components of mixture models, the number of factors in latent factor
models or the rank in reduced-rank regression, we propose a resolution
to this paradox and give a practical extension of BIC for singular
model selection problems.

Joint work with Martyn Plummer.

Bio: Mathias Drton is Professor of Statistics at the University of Washington. A native of Germany, he received his PhD in Statistics
from UW in 2004. After a Postdoc in Mathematics at UC Berkeley, he spent seven years at the University of Chicago before returning to Seattle in 2012.

Learning from large-scale datasets with high-cardinality attributes is very common in information retrieval, online services and telemetry. Training on many examples of unbounded dimensionality requires adaptive yet efficient feature representations suitable for distributed computing platforms. We focus on one such representation that has been successfully used by practitioners for many years, yet received only casual mentions in literature: inducing features from aggregated conditional statistics. The talk will describe a generalization of this approach derived via gradient descent in functional space, and its connections to several alternative representations: feature hashing, matrix factorization and deep networks. The algorithm also provides a useful illustration to highlight requirements imposed by industrial engineering needs on large-scale learning methods: convenience of stage-wise training, ease of model monitoring and accuracy debugging, and suitability for conventional distributed computing infrastructure.

Many real-world problems involve negative interactions; we might want
search results to be diverse, sentences in a summary to cover distinct
aspects of the subject, or objects in an image to occupy different
regions of space. However, traditional structured probabilistic
models tend deal poorly with these kinds of situations; Markov random
fields, for example, become intractable even to approximate.
Determinantal point processes (DPPs), which arise in random matrix
theory and quantum physics, behave in a complementary fashion: while
they cannot encode positive interactions, they define expressive
models of negative correlations that come with surprising and exact
algorithms for many types of inference, including conditioning,
marginalization, and sampling. I'll present our recent work on a
novel factorization and dual representation of DPPs that enables
efficient and exact inference for exponentially-sized structured sets.
We develop an exact inference algorithm for DPPs conditioned on
subset size and derive efficient parameter estimation for DPPs from
several types of observations, as well as approximation algorithms for
large-scale non-linear DPPs. I'll illustrate the advantages of DPPs
on several natural language and computer vision tasks: document
summarization, image search and multi-person pose estimation
problems in images.

Bio: Ben Taskar received his bachelor's and doctoral degree in Computer Science from Stanford University. After a postdoc at the University of
California at Berkeley, he joined the faculty at the University of Pennsylvania. In the spring of 2013, he joined the University of Washington Computer Science & Engineering Department. His research interests include machine learning, natural language processing and computer vision. He has been awarded the Sloan Research Fellowship, the NSF CAREER Award, and selected for the Young Investigator Program by the Office of Naval Research and the DARPA Computer Science Study Group. His work on structured prediction has received best paper awards at several conferences.

Multi-Relational Latent Semantic Analysis, Scott Yih (MSR)

We present Multi-Relational Latent Semantic Analysis (MRLSA) which generalizes Latent Semantic Analysis (LSA). MRLSA provides an elegant approach to combining multiple relations between words by constructing a 3-way tensor. Similar to LSA, a low-rank approximation of the tensor is derived using a tensor decomposition. Each word in the vocabulary is thus represented by a vector in the latent semantic space and each relation is captured by a latent square matrix. The degree of two words having a specific relation can then be measured through simple linear algebraic operations. We demonstrate that by integrating multiple relations from both homogeneous and heterogeneous information sources, MRLSA achieves state-of-the-art performance on existing benchmark datasets for two relations, antonymy and is-a.

The field of brain-computer interfacing has seen rapid advances in recent years, with applications ranging from cochlear implants for the deaf to brain-controlled prosthetic arms for the paralyzed. This talk will provide a brief overview of the various types of brain-computer interfaces (BCIs) and the techniques they use for mapping brain signals to control outputs. I will then highlight some opportunities as well as challenges for machine learning in helping facilitate the transition of BCIs from the laboratory to the real world.

Bio: Rajesh Rao is an Associate Professor in the Computer Science & Engineering department at the University of Washington. He is the recipient of an NSF CAREER award, an ONR Young Investigator Award, a Sloan Faculty Fellowship, and a Packard Fellowship for Science and Engineering. He is the author of the new textbook Brain-Computer Interfacing: An Introduction (Cambridge University Press, 2013) and the co-editor of two volumes, Probabilistic Models of the Brain (MIT Press, 2002) and Bayesian Brain (MIT Press, 2007). With Adrienne Fairhall, he offered the first MOOC on computational neuroscience (about 50K registered students). His research spans the areas of computational neuroscience, AI, and brain-computer interfacing. His other interests include the 4000-year-old undeciphered Indus script (a topic on which he has given a TED talk) and Indian miniature painting.