Abstract

Despite progress in visual perception tasks such as image classification and
detection, computers still struggle to understand the interdependency of
objects in the scene as a whole, e.g., relations between objects or their
attributes. Existing methods often ignore global context cues capturing the
interactions among different object instances, and can only recognize a
handful of types by exhaustively training individual detectors for all
possible relationships. To capture such global interdependency, we propose a
deep Variation-structured Reinforcement Learning (VRL) framework to
sequentially discover object relationships and attributes in the whole image.
First, a directed semantic action graph is built using language priors to
provide a rich and compact representation of semantic correlations between
object categories, predicates, and attributes. Next, we use a
variation-structured traversal over the action graph to construct a small,
adaptive action set for each step based on the current state and historical
actions. In particular, an ambiguity-aware object mining scheme is used to
resolve semantic ambiguity among object categories that the object detector
fails to distinguish. We then make sequential predictions using a deep RL
framework, incorporating global context cues and semantic embeddings of
previously extracted phrases in the state vector. Our experiments on the
Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome
dataset validate the superiority of VRL, which can achieve significantly
better detection results on datasets involving thousands of relationship and
attribute types.

Abstract

In this talk, I will discuss the problem of recovering an incomplete m\times n
matrix of rank r with columns arriving online over time. This is known as the
problem of life-long matrix completion, and is widely applied to
recommendation system, computer vision, system identification, etc. The
challenge is to design provable algorithms tolerant to a large amount of
noises, with small sample complexity. I will give algorithms achieving strong
guarantee under two realistic noise models. In bounded deterministic noise, an
adversary can add any bounded yet unstructured noise to each column. For this
problem, I will present an algorithm that returns a matrix of a small error,
with sample complexity almost as small as the best prior results in the
noiseless case. For sparse random noise, where the corrupted columns are
sparse and drawn randomly, I will show you an algorithm that exactly recovers
an \mu_0-incoherent matrix by probability at least 1-\delta with sample
complexity as small as O(\mu_0rn\log (r/\delta)). This result advances the
state-of-the-art work and matches the lower bound in a worst case. I will also
discuss the scenario where the hidden matrix lies on a mixture of subspaces
and show that the sample complexity can be even smaller. The proposed
algorithms perform well experimentally in both synthetic and real-world
datasets.

Abstract

Data driven approaches to modeling time-series are important in a variety of
applications from market prediction in economics to the simulation of robotic
systems. However, traditional supervised machine learning techniques designed
for i.i.d. data often perform poorly on these sequential problems. We propose
that time series and sequential prediction, whether for forecasting,
filtering, or reinforcement learning, can be effectively achieved by directly
training recurrent prediction procedures rather then building generative
probabilistic models.

To this end, we introduce a new training algorithm for learned time-series
models, Data as Demonstrator (DaD), that theoretically and empirically
improves multi-step prediction performance on model classes such as recurrent
neural networks, kernel regressors, and random forests. Additionally,
experimental results indicate that DaD can accelerate model-based
reinforcement learning for control tasks. We next show that latent-state
time-series models, where a sufficient state parametrization may be unknown,
can be learned effectively in a supervised way. Our approach, Predictive State
Inference Machines (PSIMs), directly optimizes – through a DaD-style training
procedure – the inference performance by identifying the recurrent hidden
state as a predictive belief over statistics of future observations.
Fundamental to our learning frameworks is that the prediction of observable
quantities is a lingua franca for building AI systems.

Abstract

Social network services generally allow users to post, forward, share, or
“like” a piece of information; order, or comment a product or hotel; “check
in” a place of interest. All the above behaviors can be grouped as temporal
sequences of users and at what time users act, which are called temporal
cascades. So a temporal cascade contains users’ actions on a specific piece of
information, product, hotel, place, etc. Since such actions are publicly
visible, or pushed to related users or communities on purpose by the systems,
users in a cascade influence each other, like the contagion of behaviors.
Based on the phenomenon of contagions, people study viral marketing, like to
maximize influence, and understand how opinion forms. Those studies and
applications need to know the causality of who influences whom, as well as
accurate values of influences. So a fundamental problem in those applications
is that users' influences are difficult to quantify. Thus we propose a model
that defines parameters on every user with latent influence vector and
susceptibility vector. Such low-dimensional and distributed representations
naturally consider the dependencies of interpersonal influences, and reduce
the model complexity, comparing to the previous models. We conduct extensive
experiments on real Microblog data, showing that our model with distributed
representations achieves better performance.

Abstract

We focus mainly on the offline 1-dimensional multiple changepoint detection
problem with Gaussian errors where we want to detect locations where the
underlying mean changes. In this problem, we prove that any procedure with a
fast enough \ell_2 error rate, in terms of its estimation of the underlying
piecewise constant mean vector, automatically has an (approximate) changepoint
screening property. Specifically, each true changepoint in the underlying mean
vector has an estimated changepoint nearby. We also show, again assuming only
knowledge of the \ell_2 error rate, that a simple post-processing step can
eliminate spurious estimated changepoints, and thus deliver asymptotic bounds
on the Hausdorff distance between the true and estimated set of changepoints.
Specifically, in addition to the screening property described previously, we
are assured that each estimated changepoint has a true changepoint nearby.

As a special case, we focus on the application of our results whereby an
improved \ell_2 error rate for the 1-dimensional fused lasso for a fixed
number of changepoints can yield competitive bounds on the Hausdorff distance
compared to other changepoint methods (i.e., binary segmentation, wild binary
segmentation, SMUCE). If time remains, I will discuss our simulations and
extensions to other variants of offline changepoint problems (changepoint
detection on graphs, trend filtering).

Paper: https://arxiv.org/abs/1606.06746. This is joint work with James Sharpnack (UC Davis), Alessandro Rinaldo (CMU) and Ryan Tibshirani (CMU).

Abstract

Halloween is a spooky time of year, and forensic science can be a spooky area
of application. As a special halloween treat, this week's machine learning
lunch will feature statistical methods for analyzing two types of forensic
evidence: bullet cartridges and shoe prints.

Bullet Cartridges: When a gun is fired, it leaves marks on cartridges that are
believed to be unique to the gun. This means that cartridges collected from
crime scenes can be compared to cartridge images stored in a database, to
determine if they were fired from the same gun. In this talk, we will describe
fully automated methods for comparing 2D cartridge images. We pre-process and
register the images, and use correlation to measure the similarity between
them. We quantify the uncertainty in making any statement of a match using a
hypothesis test.

Shoe prints: The President’s Council of Advisors on Science and Technology
recently released a report which criticized the science behind latent footwear
impression analysis based on accidentals (e.g. cuts, holes, and debris on a
shoe’s sole which accumulate on a shoe through wear). In particular, the
report acknowledged that the assumptions of existing methods are inappropriate
and need to be empirically evaluated. Acting on these recommendations, we
propose a new nonparametric Bayesian hierarchical model for the distribution
of accidentals on shoe surfaces. This model features nonparametric mixtures of
Bernstein polynomials hierarchically modelled using compound random measures.
To train the model, we are using a database of 386 shoes collected and marked
by the Israeli Police Department. Wearing of costumes to the talk is
encouraged. Best costume gets a free lunch. Okay everyone gets a free lunch.

Abstract

In many settings, predictions must be made over structured output spaces.
Examples include both discrete structures such as sequences and clusterings,
as well as continuous ones such as trajectories. The conventional machine
learning approach to such "structured prediction" problems is to learn over a
holistically pre-specified structured model class (e.g., via conditional
random fields or structural SVMs). In this talk, I will discuss recent work
along an alternative direction of using learning reductions, or "learning to
optimize".

In learning to optimize, the goal is to reduce the structured prediction
problem into a sequence of standard prediction problems that can be solved via
conventional supervised learning. Such an approach is attractive because it
can easily leverage powerful function classes such as random forests and deep
neural nets. The main challenge lies in identifying a good learning reduction
that is both principled and practical. I will discuss two projects in detail:
contextual submodular optimization, and smooth online sequence prediction.

Abstract

In these days, computer vision and machine learning area are largely affected
by the availability of dataset. For example, we have seen great achievements
in image classification and object detection problems by exploiting the
publicly available datasets such as ImageNet and PASCAL VOC dataset. However,
a similar dataset to understand human's interaction and non-verbal behaviors
is extremely rare. One of the major obstacles in building such dataset is the
fact that measuring non-verbal cues of interacting multiple people is also
challenging.

The Panoptic Studio is a system composed of more than 500 diverse sensors,
specifically designed to measure subtle non-verbal signals of interacting
multiple people. The system takes, as input, 480 synchronized video streams of
multiple people engaged in social activities, and produces, as output, the
time-varying 3D structure of anatomical body landmarks. Using this system, we
have collected and processed various interesting scenes where multiple people
are naturally interacting, and recently we have released the CMU Panoptic
Studio Dataset. In this dataset, we publicly share 500+ video inputs, fully
automatically reconstructed 3D body pose, and calibration data for all the
sequences, with a toolbox for a quick start. In this talk, I will introduce
our system, reconstruction algorithms, and the dataset with various potential
applications.

Abstract

Empirical risk minimization is a key tool in modern machine learning which is
well understood when the underlying problem is well-behaved. Many emerging
problems, however, are not so well-behaved and contain complicated nonconvex,
nonsmooth, and unbounded structures. Examples include graphical models, neural
networks, dictionary learning, and matrix factorization. For such problems,
existing techniques are unsuitable and new principles are necessary to solve
them. Using directed graphical models as a motivating example, this talk will
explore the problem of regularized empirical risk minimization with nonconvex
and nonsmooth parameter constraints as well as unbounded loss functions. We
will focus on both computational and theoretical aspects of this problem in a
high-dimensional setting, motivated by applications to computational biology
and human genetics. The main result is an efficient algorithm for
approximating directed graphs with millions of free parameters that is several
orders of magnitude faster than existing approaches and comes with strong
statistical guarantees. We will also discuss progress towards a general
nonasymptotic framework for understanding the statistical behaviour of
estimators defined via nonconvex empirical risk minimization.

Bio: Bryon recently joined CMU as a postdoc in the SAILING Lab after receiving
his PhD in Statistics from UCLA. His research interests include graphical
models, multi-task learning, and high-dimensional statistics with applications
to computational biology, genomics, and causal inference.

Abstract

We present an algorithm for estimating bounds on causal effects from
observational data which combines graphical model search with simple linear
regression. We assume that the underlying system can be represented by a
linear structural equation model with no feedback, and we allow for the
possibility of latent variables. Under assumptions standard in the causal
search literature, we use conditional independence constraints or greedy
search to learn an equivalence class of ancestral graph Markov models. Then,
for each model in the equivalence class, we perform the appropriate regression
(using causal structure information to guide covariate adjustment) to estimate
a set of possible causal effects. Our approach is based on the "IDA" procedure
of Maathuis et al. (2009), which assumes that all relevant variables have been
measured (i.e., no unmeasured confounders). We generalize their work by
relaxing this assumption, which is often violated in applied contexts. We
validate the performance of our algorithm on simulated data and demonstrate
improved precision over IDA when latent variables are present.

Bio: Daniel Malinsky is a PhD student in Logic, Computation, and Methodology
at Carnegie Mellon University. He works on causal inference with graphical
models, especially structure learning from time series data and applications
in policy and health.

Abstract

We explore architectures for general pixel-level prediction problems, from
low-level edge detection to mid-level surface normal estimation to high-level
semantic segmentation. Convolutional predictors, such as the
fully-convolutional network (FCN), have achieved remarkable success by
exploiting the spatial redundancy of neighboring pixels through convolutional
processing. Though computationally efficient, we point out that such
approaches are not statistically efficient during learning precisely because
spatial redundancy limits the information learned from neighboring pixels. We
demonstrate that (1) stratified sampling allows us to add diversity during
batch updates and (2) sampled multi-scale features allow us to explore more
nonlinear predictors (implemented though multiple fully-connected layers
followed by ReLU) that improve overall accuracy. We demonstrate that our
single architecture produces state-of-the-art results for semantic
segmentation on PASCAL-Context, surface normal estimation on NYUD dataset, and
edge detection on BSDS without contextual post-processing.

Abstract

Given a stream of multimodal sensory data, an autonomous robot must
continuously refine its understanding of itself and its environment as it
makes decisions on how to act to achieve a goal. These are difficult problems
that roboticists have attacked using classical tools from mechanics and
controls and, more recently, machine learning. However, the classical methods
and machine learning algorithms are often seen to be at odds, and researchers
continue to debate the merits of engineering vs. learning.

A recurring theme in this talk will be that prior knowledge and domain
insights can make learning and inference easier. I will discuss several
fundamental robotics problems including continuous-time motion planning,
localization, and mapping from a unified probabilistic inference perspective.
I will show how models from statistical machine learning like Gaussian
Processes can be tightly integrated with insights from engineering expressed
as differential equations to solve these problems efficiently. Finally, I will
demonstrate the effectiveness of these algorithms on several real-world
robotics platforms.

Bio: Byron Boots is an assistant professor in the School of Interactive
Computing and the Institute for Robotics and Intelligent Machines at Georgia
Tech. Prior to joining Georgia Tech, Byron was a postdoctoral researcher
working with Dieter Fox in the Robotics and State Estimation Lab at the
University of Washington. He received his Ph.D. in Machine Learning from
Carnegie Mellon in 2012 where he was advised by Geoff Gordon. Byron’s work on
learning models of dynamical systems received the 2010 Best Paper award at
ICML. His current research focuses on developing theory and systems that
integrate perception, learning, and decision making.

Abstract

When human annotators are given a choice about what to label in an image, they
apply their own subjective judgments on what to ignore and what to mention.
We refer to these noisy ``human-centric'' annotations as exhibiting human
reporting bias. Examples of such annotations include image tags and keywords
found on photo sharing sites, or in datasets containing image captions. In
this paper, we use these noisy annotations for learning visually correct image
classifiers. Such annotations do not use consistent vocabulary, and miss a
significant amount of the information present in an image; however, we
demonstrate that the noise in these annotations exhibits structure and can be
modeled. We propose an algorithm to decouple the human reporting bias from the
correct visually grounded labels. Our results are highly interpretable for
reporting ``what's in the image'' versus ``what's worth saying.'' We
demonstrate the algorithm's efficacy along a variety of metrics and datasets,
including MS COCO and Yahoo Flickr 100M. We show significant improvements over
traditional algorithms for both image classification and image captioning,
doubling the performance of existing methods in some cases.

Abstract

We propose a Laplace approximation that creates a stochastic unit from any
smooth monotonic activation function, using only Gaussian noise. This paper
investigates the application of this stochastic approximation in training a
family of Restricted Boltzmann Machines (RBM) that are closely linked to
Bregman divergences. This family, that we call exponential family RBM
(Exp-RBM), is a subset of the exponential family Harmoniums that expresses
family members through a choice of smooth monotonic non-linearity for each
neuron. Using contrastive divergence along with our Gaussian approximation, we
show that Exp-RBM can learn useful representations using novel stochastic
units.

Abstract

How can we enable computers to process natural language in multilingual
settings (think airports, social media, etc.)? Thanks to decades of natural
language processing (NLP) research, we have been able to develop language
analyzers in many languages, including languages with little or no training
data. I will review some of the key developments in multilingual NLP research,
with an emphasis on syntax.

Despite this progress, the mainstream approach to developing multilingual NLP
models (for high-resource languages) has been to independently train one model
for each language, which is unsatisfactory for practical as well as
theoretical reasons. To that end, I will describe a general framework for
training language-universal models, and how to instantiate it for dependency
parsing.

Abstract

To address several important issues involved in latent variable models (LVMs),
such as effective capturing of long-tail patterns, shrinkage of model size
without sacrificing modeling power, improving interpretability, several
studies have been devoted to ``diversify'' LVMs, which aim to encourage the
components in LVMs to be diverse. Most of existing study is performed in a
frequentist-style regularization framework, where the components are learned
via point estimation. In this work, we investigate how to ``diversify'' LVMs
in another learning paradigm -- Bayesian learning -- which has advantages
complementary to point estimation, such as alleviating overfitting via model
averaging and quantifying uncertainty. We propose two approaches that have
complementary advantages. One is to define diversity-promoting mutual angular
priors which assign larger density to components with larger mutual angles
based on Bayesian Network and von-Mises Fisher distribution and use these
priors to affect the posterior via Bayes rule. We develop two efficient
approximate posterior inference algorithms based on variational inference and
Markov chain Monte Carlo sampling. The other approach is to impose
diversity-inducing regularization directly over the post-data distribution of
components. These two methods are applied to the Bayesian mixture of experts
model to encourage the ``experts'' to be diverse and experimental results
demonstrate the effectiveness and efficiency of our methods.

Bio: Pengtao Xie is a PhD student in the Machine Learning Department at
Carnegie Mellon University. His primary research focus is latent variable
models and distributed machine learning systems. He received a M.E. from
Tsinghua University in 2013 and a B.E. from Sichuan University in 2010. He is
the recipient of Siebel Scholarship, Goldman Sachs Global Leader Scholarship
and National Scholarship of China.

Abstract

We propose a novel high dimensional nonparametric model named ATLAS
which is a generalization of the sparse additive model. The ATLAS
model assumes the high dimensional regression function can be locally
approximated by a sparse additive function, while such an
approximation may change from the global perspective. We aim to
estimate high dimensional function using a novel kernel-sieve hybrid
regression estimator that combines the local kernel regression with
the B-spline basis approximation. We show the estimation rate of true
function in the supremum norm. We also propose two types of confidence
bands for true function. Both procedures proceed in two steps: (1) a
novel bias correction method is applied to remove the shrinkage
introduced by the model selection penalty and (2) quantiles of the
normalized de-biased estimator are approximated by quantiles of the
limiting distribution or a Gaussian multiplier bootstrap. We further
show that the covering probability of the bootstrap confidence bands
converges to the nominal one at a polynomial rate.

Abstract

In many animation projects, the animation artist typically spends significant
time animating the face. This process involves many labor-intensive tasks that
offer relatively little potential for creative expression. One particularly
tedious task is speech animation: animating the face to match spoken audio.
Indeed, the often prohibitive cost of speech animation has limited the types
of animations that are feasible, including localization to different
languages.

In this talk, I will show how to view speech animation through the lens of
data-driven sequence prediction. In contrast to previous sequence prediction
settings, visual speech animation is an instance of contextual spatiotemporal
sequence prediction, where the output is continuous and high-dimensional
(e.g., a configuration of the lower face), and also depends on an input
context (e.g., audio or phonetic input).

I will present a decision tree framework for learning to generate
context-dependent spatiotemporal sequences given training data. This approach
enjoys several attractive properties, including ease of training, fast
performance at test time, and the ability to robustly tolerate corrupted
training data using a novel latent variable approach. I will showcase this
approach in a case study on speech animation, where our approach outperforms
several competitive baselines in both quantitative and qualitative
evaluations, and also demonstrates strong robustness to corrupted training
data.

This is joint work with Taehwan Kim, Sarah Taylor, Barry-John Theobald, and Iain Matthews.

Abstract

Intelligent agents acting in the real world need advanced vision capabilities
to perceive, learn from, reason about and interact with their environment. In
this talk, I will explore the role that humans play in the design and
deployment of computer vision systems. Large-scale manually labeled datasets
have proven instrumental for scaling up visual recognition, but they come at a
substantial human cost. I will first briefly talk about strategies for making
optimal use of human annotation effort for computer vision progress. However,
no dataset can foresee all the visual scenarios that a real-world system might
encounter. I will argue that seamlessly integrating in human expertise at
runtime will become increasingly important for open-world computer vision. I
will introduce both mathematical frameworks for human-machine collaboration as
well as deep reinforcement learning models that open up new avenues for
human-in-the-loop exploration.

Bio: Olga Russakovsky recently completed her PhD in computer science at
Stanford and is now a postdoctoral fellow at Carnegie Mellon University. Her
research is in computer vision, closely integrated with machine learning and
human-computer interaction. Her work was featured in the New York Times and
MIT Technology Review. She served as an Area Chair for WACV’16, led the
ImageNet Large Scale Visual Recognition Challenge effort for two years, and
organized multiple workshops and tutorials on large-scale recognition at
premier computer vision conferences ICCV’13, ECCV’14, CVPR’15, ICCV’15 and
CVPR’16. In addition, she founded and directs the Stanford AI Laboratory’s
outreach camp SAILORS (featured in Wired and published in SIGCSE’16) designed
to expose high school students in underrepresented populations to the field of
AI.

Abstract

Infection and diffusion processes over networks arise in many domains. These
introduce many challenging prediction tasks, such as influence estimation,
trend prediction, and epidemic source localization. The standard approach to
such problems is generative: assume an underlying infection model, learn its
parameters, and infer the required output. In order to learn efficiently, the
chosen infection models are often simple, and learning is focused on inferring
the parameters of the model rather than on optimizing prediction accuracy.
Here we argue that for prediction tasks, a discriminative approach is more
adequate.

We introduce DIMPLE, a novel discriminative learning framework for training
classifiers based on dynamic infection models. We show how highly non-linear
predictors based on infection models can be "linearized" by considering a
larger class of prediction functions. Efficient learning over this class is
performed by constructing "infection kernels" based on the outputs of
infection models, and can be plugged into any kernel-supporting framework.
DIMPLE can be applied to virtually any infection-related prediction task and
any infection model for which the desired output can be calculated or
simulated. For influence estimation in well-known infection models, we show
that the kernel can either be computed in closed form, or reduces to
estimating co-influence of seed pairs.

We apply DIMPLE to the tasks of influence estimation on synthetic and real
data from Digg, and to predicting customer network value in Polly, a viral
phone-based development-related service deployed in low-literate communities.
Our results show that DIMPLE outperforms strong baselines.

Abstract

Computer vision is currently undergoing a period of rapid progress,
brought in part through the integration of machine-learning techniques
with big training datasets. This talk will attempt to examine some of
the modeling insights behind this progress, as well as open challenges
that remain. A well-known but under-appreciated observation is that
visual phenomena follows a long-tail distribution: a few modes of
appearance are common, while many rare modes are in the tail. As an
example, people commonly stand or walk, but can contort their body
into many more poses. I will argue that the "tail" remains the open
challenge because training data is limited (even in the big-data
setting). I will describe some promising methods that address this
difficulty by synthesizing new data examples, either explicitly with a
computer graphics pipeline or implicitly through compositional
representations. The latter view suggests novel variants of deep
architectures that reason about compositional variables. I will
conclude by demonstrating such architectures on various visual
recognition tasks, including perceptual grouping, object recognition,
and people tracking.

Abstract

Deep learning (DL) has achieved notable successes in many machine learning
tasks. A number of software frameworks have been developed to expedite the
process of designing and training deep neural networks (DNNs), such as Caffe,
Torch, and Theano. Currently these frameworks can harness multiple GPUs on the
same machine, but are unable to use GPUs that are distributed across multiple
machines; as even average-sized deep networks can take days to train on a
single GPU with 100s of GBs to TBs of data, distributed GPUs present a prime
opportunity for scaling up DL. However, the limited inter-machine bandwidth
available on commodity Ethernet networks presents a bottleneck to distributed
GPU training, and prevents its trivial realization. To investigate how to
adapt existing software frameworks to efficiently support distributed GPUs, we
propose Poseidon, a scalable system architecture for distributed inter-machine
communication in existing DL frameworks. We integrate Poseidon into the Caffe
framework and evaluate its performance at training convolutional neural
networks for object recognition in images. Poseidon features three key
contributions that accelerate DNN training on clusters: (i) a three-level
hybrid architecture that allows Poseidon to support both CPU-only and
GPU-equipped clusters, (ii) a distributed wait-free backpropagation (DWBP)
algorithm to improve GPU utilization and to balance communication, and (iii) a
structure-aware communication protocol (SACP) to minimize communication
overheads. We empirically show that Poseidon converges to the same objectives
as a single machine, and achieves state-of-the-art training speedup across
multiple models and well-established datasets, using a commodity GPU cluster
of 8 nodes (e.g. 4.5x speedup on AlexNet, 4x on GoogLeNet, 4x on CIFAR-10). On
the much larger ImageNet 22K dataset, Poseidon with 8 nodes achieves better
speedup and competitive accuracy to recent CPU-based distributed deep learning
systems such as Adam and Le et al, which use 10s to 1000s of nodes.

Abstract

As both a computer scientist and a musician, I design intelligent systems to
understand and extend human musical expression. To understand means to model
the musical expression conveyed through acoustic, gestural, and emotional
signals. To extend means to use this understanding to create expressive,
interactive, and autonomous agents, serving both amateur and professional
musicians. In particular, I create interactive artificial performers that are
able to perform expressively in concert with humans by learning musicianship
from rehearsal experience. This study unifies machine learning and knowledge
representation of music structure and performance skills in an HCI framework.
In this talk, I will go over the learning techniques and present robot
musicians capable of playing collaboratively and reacting to musical nuance
with facial and body gestures.

Abstract

As data acquisition methods improve and the volume of data for analysis
increases in brain science, computer- and neuroscientists have begun to
work more closely together to extract fundamental principles that govern
the organization and function of neural circuits. To facilitate these
interactions, BrainHub will sponsor a hackathon for CS and ML graduate
students using experimental data acquired by Carnegie Mellon
neuroscientists. The basic goal will be to use the data to find something
interesting, and the prize will involve graduate student support for the
following academic year. Please come to this open discussion to help
organize the event -- what datasets are most useful, how much time should
be allotted, how much faculty involvement would be optimal, as well as
other important details that will help you and your peers get the most out
of this competition.

Abstract

Learning to reason and understand the world’s knowledge is a fundamental
problem in Artificial Intelligence (AI). While it is always hypothesized that
both the symbolic and statistical approaches are necessary to tackle complex
problems in AI, in practice, bridging the two in a combined framework might
bring intractability—most probabilistic first-order logics are simply not
efficient enough for real-world sized tasks. In this talk, I will describe
some of my recent progress on theories and practices in statistical relational
learning: 1) a scalable learning and reasoning framework called ProPPR, whose
inference time does not depend on the size of knowledge graph; 2) a
meta-reasoning theory that learns structures from relational data; 3) and a
joint approach for scalable information extraction and relational reasoning.
This is joint work with William Cohen and Katie Mazaitis.

Abstract

In this talk I am going to focus on the distribution regression problem:
regressing to vector-valued outputs from probability measures. Many important
machine learning and statistical tasks fit into this framework, including
multi-instance learning or point estimation problems without analytical
solution such as hyperparameter or entropy estimation. Despite the large
number of available heuristics in the literature, the inherent two-stage
sampled nature of the problem makes the theoretical analysis quite
challenging: in practice only samples from sampled distributions are
observable, and the estimates have to rely on similarities computed between
sets of points. To the best of our knowledge, the only existing technique with
consistency guarantees for distribution regression requires density estimation
as an intermediate step (which often performs poorly in practice), and the
domain of the distributions to be compact Euclidean. I propose a simple,
analytically computable, ridge regression based alternative to distribution
regression by embedding the distributions to a reproducing kernel Hilbert
space, and learning the regressor from the embeddings to the outputs. I am
going to present the main ideas why this scheme is consistent in the two-stage
sampled setup under mild conditions (on separable topological domains enriched
with kernels) and present an exact computational-statistical efficiency
tradeoff description showing that the studied estimator is able to match the
one-stage sampled minimax optimal rate. Specifically, this result answers a
16-year-old open question by establishing the consistency of the classical set
kernel [Haussler, 1999; Gartner et. al, 2002] in regression, and also covers
more recent kernels on distributions, including those due to [Christmann and
Steinwart, 2010]. [Joint work with Bharath Sriperumbudur, Barnabas Poczos,
Arthur Gretton.]

Bio: Zoltan Szabo is a Research Associate at the Gatsby Unit, University
College London (2013 - present). He holds a double PhD in Computer Science and
Applied Mathematics from the Eotvos Lorand University (2009-2012; Budapest,
Hungary). His primary research interests are information theory, statistical
machine learning, empirical processes and kernel methods with applications in
remote sensing (sustainability), distribution regression, structured sparsity,
independent subspace analysis and its extensions, collaborative filtering.

Abstract

This talk emphasizes the use of structured methods to model Natural Language
Understanding (NLU) problems. We argue that many NLU tasks can benefit from
using models that are capable of incorporating not just linguistic cues but
also the contexts in which these cues appear. In this talk, we use a
structured approach to model the 'flow of information' in text to solve two
problems: (i) Analyzing a paragraph to identify if a desire expressed in the
paragraph was fulfilled, and (ii) Predicting need for instructor intervention
in MOOC discussion forums.

We first address the problem of reading and understanding a textual paragraph
containing an expression of a desire to identify if the desire got fulfilled.
The method reads the paragraph as a story from the perspective of the
protagonist - the entity that expressed the desire. We track the protagonist's
actions and emotional states to make the binary prediction.

We then analyze contents of online educational discussion forums to
automatically suggest threads to instructors that require their intervention.
This can alleviate the need for the instructor to manually peruse all threads
of the forum, and help students who need to interact with the instructor. Our
method incorporates thread structure for the problem by using latent variables
that abstract contents of individual posts and model the flow of information
in the thread.

Abstract

We propose a class of nonparametric two-sample tests with a cost linear in the
sample size. Two tests are given, both based on an ensemble of distances
between analytic functions representing each of the distributions. The first
test uses smoothed empirical characteristic functions to represent the
distributions, the second uses distribution embeddings in a reproducing kernel
Hilbert space. Analyticity implies that differences in the distributions may
be detected almost surely at a finite number of randomly chosen
locations/frequencies. The new tests are consistent against a larger class of
alternatives than the previous linear-time tests based on the (non-smoothed)
empirical characteristic functions, while being much faster than the current
state-of-the-art quadratic-time kernel-based or energy distance-based tests.
Experiments on artificial benchmarks and on challenging real-world testing
problems demonstrate that our tests give a better power/time tradeoff than
competing approaches, and in some cases, better outright power than even the
most expensive quadratic-time tests. This performance advantage is retained
even in high dimensions, and in cases where the difference in distributions is
not observable with low order statistics.

Abstract

Several scientific and engineering problems can be cast as the optimisation of
an expensive black box function. Bayesian Optimisation (BO), a method which
models this function as a sample from a Gaussian Process, is used quite
successfully in a plethora of applications. In Machine Learning, BO is fast
becoming the method of choice to tune hyper parameters for expensive Machine
Learning algorithms (e.g. Neural Networks).

In this talk, I will start with a general introduction to the techniques used
in BO and survey existing theoretical results. We will look at the statistical
and computational challenges in scaling BO to high dimensions. Then I will
present some of our recent work which tackles these challenges using additive
Gaussian processes. We show that regret improves from exponential in dimension
to linear in dimension for additive functions. Empirically, our methods
outperform naive BO and other global optimisation methods on several synthetic
and real problems.

Abstract

I will introduce the Non-Metric Space Library, which is an ongoing effort to
develop and evaluate methods for generic non-metric spaces. First, I will
review applications of nearest-neighbor (NN) search in ML/NLP and explain why
non-metric spaces are important. Second, I will survey the state of the art
and explain how ML can be used to devise more efficient search methods. Third,
I will talk about the library itself (i.e., technical details, use cases, and
underlying engineering decisions) and how it compares to the state of the art.
The talk may be useful to those planning to apply NN-methods in their work.
Furthermore, individuals convinced that locality sensitive hashing methods are
superior to all other approaches may also find this talk interesting.

Abstract

Lexical semantics is a sub-field of natural language processing that aims to
obtain and represent the meaning of words in a format that is mutually
intelligible to humans and computers. In this talk we will explore two
different forms of word meaning representations: word vector representations
and word lexicons. In the first part of the talk, I will present a method to
improve existing models of word vector representations with explicit knowledge
from semantic lexicons using a graph-based model. In the second part of the
talk, I will use graph-based semi-supervised learning to construct
wide-coverage morpho-syntactic lexicons with high quality from a small seed
lexicon. The automatically constructed lexicons provide features that
significantly improve performance in two downstream tasks: morphological
tagging and dependency parsing.

Abstract

Trend filtering is a recently developed tool Steidl et al. (2006), Kim et al.
(2009) for nonparametric regression. Given n points, the trend filtering
estimate is defined as the minimizer of a penalized least squares, where the
penalty is the l1-norm of the kth order discrete derivatives over the input
points. We will give an overview of some interesting connections between these
estimates and adaptive spline estimation though a new function basis called
the "Falling Factorial Basis", and illustrate the provable
statistical superiority of trend filtering to other common nonparametric
regression tools, such as smoothing splines and kernel smoothing.

I will also present a generalization of trend filtering to nonparametric
estimation over graphs. This approach is more locally adaptive compared to the
standard methods such as wavelet smoothing and Laplacian smoothing and can be
generically used in many applications. This is closely related to TV
denoising, graph cut based image segmentation, and is applied to event
detection and semi-supervised learning.
Finally, I will talk about statistical and computational challenges on this
problem that are still open to date.

Abstract

Sparse Gaussian graphical model estimation has become a modern workhorse for
learning networks from high-dimensional datasets of single modality such as
gene-expression levels. However, we often would like to learn a cascade of
networks that combines multiple types of data into a single statistical
analysis. In biology, this desire is motivated by hope that integrating
various kinds of genomic and clinical data will allow us to better dissect the
underpinnings of complex diseases.

This talk describes a principled, scalable approach to learning and
interpreting such cascades of networks with sparse Gaussian chain graph
models. We propose a general recipe for learning such models using sparse
conditional Gaussian graphical models as components. I will also discuss how
our method can be used to discover structured sparsity and can naturally
handle partially-available data. Finally, I will describe recent work on
learning such networks with a million variables on a single machine in about a
day.

Abstract

Apache Spark is a fast and general engine for large-scale data processing, and
it is one of the fastest-growing open-source projects in big data, with ~1K
contributors from academia and industry. This talk will discuss MLlib, the
Machine Learning library built on top of Spark, and our approach to
distributed ML. We will overview the project's current status and future
directions, including algorithmic coverage and scaling/speed improvements.
Finally, we will discuss one or two algorithms in more detail, mentioning
challenges in distributed computing.

This talk will provide background on Spark and MLlib for a general audience,
as well as provide roadmap and implementation details interesting to
experienced Spark users.

Bio: Joseph Bradley is an Apache Spark Committer working on MLlib at
Databricks, the startup founded by the creators of Spark. He worked with
Prof. Carlos Guestrin at CMU, receiving a Ph.D. in Machine Learning in 2013
(conditional random fields, parallel sparse regression). He spent a year as a
postdoc at UC Berkeley (sparse models, peer grading in MOOCs) before joining
Databricks.

Abstract

Diversity regularization of latent variable models (LVMs) aims to encourage
the latent factors in LVMs to be different from each other, which can 1)
capture long-tail latent factors; 2) reduce the complexity of models without
losing their modeling power. In this talk, I will introduce 1) how the
diversity regularizer is defined; 2) how to optimize it; 3) a theoretical
justification on why it works; 4) its applications in document modeling and
distance learning.

Bio: Pengtao Xie is a PhD student in the machine learning department of CMU.
His research interests lie in the diversity regularization and scalability of
latent variable models. He obtained a M.E. from Tsinghua University in 2013
and a B.E. from Sichuan University in 2010. He is the recipient of Siebel
Scholarship, Goldman Sachs Global Leader Scholarship and National Scholarship
of China.

Abstract

Convex optimization has developed a wide variety of useful tools critical to
many applications in machine learning. However, unlike linear and quadratic
programming, general convex solvers have not yet reached sufficient maturity
to fully decouple the convex programming model from the numerical algorithms
required for implementation. Especially as datasets grow in size, there is a
significant gap in speed and scalability between general solvers and
specialized algorithms.

This talk addresses this gap with a new model for convex programming based on
an intermediate representation of convex problems as a sum of functions with
efficient proximal operators. This representation serves two purposes: 1) many
problems can be expressed in terms of functions with simple proximal
operators, and 2) the proximal operator form serves as a general interface to
any specialized algorithm that can incorporate additional l2-regularization.
On a single CPU core, numerical results demonstrate that the sum-of-prox form
results in significantly faster algorithms than existing general solvers based
on conic forms. In addition, splitting problems into separable sums is
attractive from the perspective of distributing solver work amongst multiple
cores and machines. We develop a system that scales to 100s of CPU cores and
gigabyte-scale data, enabling general convex programming frameworks to be
applied a much larger class of problems.

We apply large-scale convex programming to several problems arising from
building the next-generation, information-enabled electrical grid. In these
problems (as is common in many domains) large, high-dimensional datasets
present opportunities for novel data-driven solutions. We present approaches
based on convex models for several problems: probabilistic forecasting of
electricity generation and demand, model predictive control for device energy
management and source separation for whole-home energy disaggregation.

Abstract

Data clustering and representation learning are two closely related tasks that
can mutually benefit each other. On one hand, the feature vectors produced by
representation learning are the inputs of data clustering and largely affect
the clustering performance. On the other hand, the cluster labeled generated
by clustering algorithms can supervise representation learning to make the
learned representations tailored to clustering tasks. In this work, we aim to
integrate these two tasks into one unified framework to perform them
simultaneously and enable them to mutually promote each other. Specifically,
for text data, we design a Multi-grain Clustering Topic Model to
simultaneously perform document clustering and topic learning; for image data,
we design a Double Layer Gaussian Mixture Model to integrate image clustering
and codebook learning. Experiments on various datasets demonstrate that
integrating data clustering and representation learning can effectively
improve the performances of both tasks.

Bio: Pengtao Xie is a graduate student in the Language Technologies Institute,
working with Professor Eric Xing. His primary research interests lie in latent
space models and large scale distributed machine learning. He received a M.E.
from Tsinghua University in 2013 and a B.E. from Sichuan University in 2010.
He is the recipient of Siebel Scholarship, Goldman Sachs Global Leader
Scholarship and National Scholarship of China.

Abstract

An accurate model of a patient's individual survival distribution can
help determine the appropriate treatment and care of terminal
patients. The common practice of estimating such survival
distributions uses only population averages for (say) the site and
stage of cancer; however, this is not very precise, as it ignores many
important individual differences among patients. This paper describes
a novel technique, PSSP (patient-specific survival prediction), for
estimating a patient's individual survival curve, based on the
characteristics of that specific patient, using a model that was
learned from earlier patients. We describe how PSSP works, and explain
how PSSP differs from the more standard tools for survival analysis
(Kaplan-Meier, Cox Proportional Hazard, etc). We also show that PSSP
is "calibrated", which means that its probabilistic estimates are
meaningful. Finally, we demonstrate, over many real-world datasets
(various cancers, and liver transplantation), that PSSP provides
survival estimates that are helpful for patients, clinicians and
researchers. This tool is freely available at
http://pssp.srv.ualberta.ca/.

Bio:
After earning a PhD from Stanford, Russ Greiner worked in both
academic and industrial research before settling at the University of
Alberta, where he is now a Professor in Computing Science and the
founding Scientific Director of the Alberta Innovates Centre for
Machine Learning, which won the ASTech Award for "Outstanding
Leadership in Technology" in 2006. He has been Program Chair for the
2004 "Int'l Conf. on Machine Learning", Conference Chair for 2006
"Int'l Conf. on Machine Learning", Editor-in-Chief for "Computational
Intelligence", and is serving on the editorial boards of a number of
other journals. He was elected a Fellow of the AAAI (Association for
the Advancement of Artificial Intelligence) in 2007, and was awarded a
McCalla Professorship in 2005-06 and a Killam Annual Professorship in
2007. He has published over 200 refereed papers and patents, most in
the areas of machine learning and knowledge representation, including
4 that have been awarded Best Paper prizes. The main foci of his
current work are (1) bioinformatics and medical informatics; (2)
learning and using effective probabilistic models and (3) formal
foundations of learnability.

Abstract

In this talk, I will describe some theoretical results that provide new perspectives on why kernel methods might work well in practice. In particular, we provide some partial answers to (Q) why might the Gaussian RBF kernel (which projects the data into infinite dimensions) often yield high classification accuracies without overfitting in practice?

To answer this question, I first present results from a simpler setting called two-sample testing (TST), a hypothesis testing problem that has close ties with classification. For TST, we prove that* the power of an RBF kernel test statistic equals the power of the linear kernel test statistic in high dimensions, whenever the linear kernel actually suffices to differentiate the two distributions. I will then relate the linear kernel TST to the Fisher LDA classifier, and present a result claiming that* the accuracy of this classifier also performs optimally for TST.

Based on this theory, my conjectured answer for (Q) has two parts - the first is that practical examples have fairly simple decision boundaries (low polynomials), and the second is that the RBF kernel is "automatically adaptive" to such simple boundaries (i.e. it behaves similar to a linear kernel whenever a linear classifier suffices).

* means under some reasonable assumptions.

I am currently proving similar results for the related problems of independence testing and regression. Many people have contributed to this sequence of works (some in preparation), including Larry Wasserman, Aarti Singh, Sashank Reddi, and Barnabas Poczos.

Abstract

Many emerging applications of machine learning involve time series and
spatio-temporal data. In this talk, I will discuss a collection of machine
learning approaches to effectively analyze and model large-scale time series
and spatio-temporal data, including temporal causal models, sparse
extreme-value models, and fast tensor-based forecasting models. Experiment
results will be shown to demonstrate the effectiveness of our models in
climate science and social media applications.

Abstract

Many developments in Statistics and Machine Learning involve the computation
of higher order derivatives of Gaussian density functions. In the
multidimensional case, it is necessary to first establish a convenient
formulation to assemble all the partial derivatives. Theoretically, it is
possible to derive succinct explicit expressions for the multidimensional
Gaussian density derivatives of an arbitrarily high order if we formally
arrange all the partial derivatives into a high-dimensional vector. However,
the huge matrices involved in these general expressions have traditionally
made them of little practical use. In this talk we propose several recursive
algorithms to overcome these difficulties that allow to compute these higher
order derivatives in a very efficient way. We will also highlight the
underlying connections between higher order derivatives of Gaussian density
functions, the expected value of products of quadratic forms in Gaussian
random variables, and V-statistics of degree two based on Gaussian density
functions, and their applications to nonparametric kernel smoothing.

Abstract

Proteins are the workhorses of the cellular machinery. Disease causing
pathogens such as bacteria and viruses, introduce their proteins into the host
cells. There they interact with the host's proteins and enable the
pathogen to obtain nutrients, replicate and survive inside the host. Often,
multiple diseases involve organisms that are related phylogenetically or share
some biological properties. For instance, related viruses will employ similar
strategies to infect the host cells. Therefore, knowledge can be shared across
diseases to better understand various biological phenomena.

In this talk I will present two approaches towards integrating host-pathogen
protein interactions from various diseases, with the goal of building good
predictive models for each disease. While one approach uses the commonality in
the infection mechanisms to enable sharing of information across tasks, the
other enforces a shared low-rank structure along with a task-specific sparse
structure. We see significant improvements in the prediction performance. This
is joint work with Jaime Carbonell and Judith Klein-Seetharaman.

Abstract

Contextual bandit learning is a fairly recent learning paradigm that captures
a fundamental tension between exploration and exploitation in many real-world
learning problems. In this setting, a learner, for several rounds, observes a
context, makes a decision, and receives reward for her decision, with a goal
of having low regret relative to a class of policies mapping contexts to
actions.

In this talk, I will give a brief overview of the many algorithms and
techniques for contextual bandit learning, highlighting the positive and
negative aspects of these algorithms. I will also, time permitting, talk about
several extensions to the classical set up, and the approaches used for these
settings.

Abstract

I will present a method that detects compact low-dimensional structures if
they exist, and uses them to construct compact interpretable models for
different machine learning tasks that can benefit practical applications. To
start with, I will formalize Informative Projection Recovery, the problem of
extracting a small set of low-dimensional projections of data that jointly
support an accurate model for a given learning task. Our solution to this
problem is a regression-based algorithm that identifies informative
projections by optimizing over a matrix of point-wise loss estimators. It
generalizes to multiple types of machine learning problems, offering solutions
to classification, clustering, regression, and active learning tasks.
Experiments show that our method can discover and leverage low-dimensional
structures in data, yielding accurate and compact models. Our method is
particularly useful in applications in which expert assessment of the results
is essential, such as classification tasks in the healthcare domain.

Additionally, we developed an active learning framework that works with the
obtained compact models in finding the most informative unlabeled data. For
this purpose, we enhance standard active selection criteria using the
information encapsulated by the trained model. The advantage of our approach
is that the labeling effort is expended mainly on samples that benefit models
from the hypothesis class we are considering. Experiments show that this
results in an improved learning rate over standard selection criteria for data
from the clinical domain, while the comprehensible view of the data supports
the labeling process and helps preempt labeling errors.

Abstract

The development of sparse methods for high-dimensional supervised and
unsupervised learning has been one of the main directions of research in
machine learning and statistics for several years. Recently, this trend has
seen the proposal of several feature-sparse algorithms for the twin problems
of clustering and Gaussian Mixture Model (GMM) learning. The motivation for
the development of these methods is twofold. First is the typical statistical
benefit, namely the potential reduction of the effects of the curse of
dimensionality. Second, as clustering and GMM learning are often used for
exploratory data analysis, sparse results may improve the intuitive
interpretability of results in a high-dimensional setting.

In this talk, I will give an overview of some of the existing feature-sparse
algorithms for clustering and GMM learning, focusing mainly on what is known
about their statistical properties. I will describe recent advances made
towards developing methods with proven statistical benefits, and the large
gaps left to fill to make these methods practical.

Abstract

Unsupervised discovery of synonymous phrases in useful in a variety of tasks
ranging from text mining and search engines to semantic analysis and machine
translation. The Near Synonym System (NeSS) presents an unsupervised
corpus-based conditional model for finding phrasal synonyms and near synonyms
that require only a large monolingual corpus. The method is based on
maximizing information-theoretic combinations of shared contexts and is
parallelizable for large-scale processing. An evaluation framework with
crowd-sourced judgments is proposed and the results are compared with
alternate methods, demonstrating considerably superior results to the
literature and to thesaurus look up for multi-word phrases. Moreover, the
results show that the statistical scoring functions and overall scalability of
the system are more important than language specific NLP tools. The method is
language-independent and is practically useable due to accuracy and real-time
performance via parallel decomposition.

Abstract

How is information organized in the brain when it reads? Where and when do the
required processes occur, such as perceiving the individual words, combining
them with the previous words and maintaining a representation of the overall
meaning?

I will present results from a recent experiment in which we align
context-based neural network language models and brain activity during
reading. When processing a text word by word, both the brain and the neural
networks perform the same processes. They both maintain a representation for
the previous context. They both represent the properties of the incoming word
and then integrate it with context. We study the alignment between the latent
vectors used by these neural networks and the brain activity observed via
Magnetoencephalography (MEG) when subjects read a chapter from Harry Potter
and the Sorcererâs Stone. For that purpose we apply the neural network to the
same chapter the subjects are reading, and explore the ability of these vector
representations to predict the observed word-by-word brain activity.

Our novel results reveal that context is more predictive of brain activity
than the properties of the current word, hinting that more brain activity is
involved in storing context than in perceiving the current word. We uncover
the time-line of how the brain updates its representation of context. We
demonstrate the incremental perception of every new word starting early in the
visual cortex, moving next to the temporal lobes and finally to the frontal
regions. We show the integration process occurring in the temporal lobes after
the new word has been perceived.

This is joint work with Ashish Vaswani, Kevin Knight and Tom Mitchell, and is
a part of a larger effort to understand how the brain organizes information in
natural reading. I will describe this general research direction. I will also
mention results from a sister experiment in which we demonstrate how the brain
areas involved in reading are processing different types of information (such
as syntax, semantics or narrative information). This second experiment is
joint work with Brian Murphy, Partha Talukdar, Alona Fyshe, Aaditya Ramdas and
Tom Mitchell.

Abstract

One of the fundamental challenges in reinforcement learning (RL) is to
guarantee that a newly proposed policy that has not yet been deployed will be
an improvement upon the current policy---that the RL algorithm is "safe". Such
an algorithm would be a significant step towards widespread application of RL
to real-life problems where deployment of an RL algorithm can be costly or
dangerous if the policies it proposes perform worse than the current policy.

I will discuss my recent work (in collaboration with Adobe Research) on such a
safe policy search algorithm. The viability of our approach hinges on a new
concentration inequality particularly well suited to this application. I will
present results from a large real-world digital marketing application before
concluding with discussion of potential future research directions.

Bio: Philip is a PhD student supervised by Professor Andrew Barto at the
University of Massachusetts Amherst. During his PhD he has been an intern with
Adobe Research and Kenji Doya's Neural Computation Unit. He received his B.S.
and M.S. in computer science from Case Western Reserve University, where he
was supervised by Professor Michael Branicky and where he was a member of Team
Case in the DARPA Urban Challenge.

Abstract

We study the distributed computing setting in which there are multiple
servers, each holding a set of points, who wish to compute functions on the
union of their point sets. We first discuss two popular center-based
clustering objectives, k-median and k-means. Following a classic approach in
clustering by Har-Peled and Mazumdar, we reduce the problem of finding a
clustering with low cost to the problem of finding a coreset (i.e., a summary)
of small size. We provide a distributed method for constructing a global
coreset which improves over the previous methods by reducing the communication
cost.

We then consider Principal Component Analysis (PCA) where the servers would
like to compute a low dimensional subspace capturing as much of the variance
of the union of their point sets as possible. We provide a computation and
communication efficient algorithm for distributed PCA and also demonstrate how
this can be used to further improve the communication and computational costs
of k-means clustering and related problems. Our empirical study on real world
data shows a speedup of orders of magnitude, preserving communication with
only a negligible degradation in solution quality.

Abstract

Communication costs, resulting from synchronization requirements during
learning, can greatly slow down many parallel machine learning algorithms. We
present a parallel Markov chain Monte Carlo (MCMC) algorithm in which subsets
of data are processed independently, with very little communication. First, we
arbitrarily partition data onto multiple machines. Then, on each machine, any
classical MCMC method may be used to draw samples from a posterior
distribution given the data subset. Finally, the samples from each machine are
combined to form samples from the full posterior. This embarrassingly parallel
algorithm allows each machine to act independently on a subset of the data
(without communication) until the final combination stage. We prove that our
algorithm generates asymptotically exact samples and empirically demonstrate
its ability to parallelize burn-in and sampling in several models.

This is joint work with Chong Wang and Eric Xing.

This is joint work with Ryan Tibshirani (Statistics and ML, Carnegie Mellon
University).

Abstract

This talk will be about a common practical problem - estimating piecewise
constant/linear/quadratic fits to low-dimensional data. I will first introduce
Trend Filtering, a recently proposed tool for this problem, and compare it to
the popular smoothing splines. Given its theoretical optimality (that I will
briefly touch upon), the only roadblock to using it in practice is having
robust and efficient algorithms. We take a major step in overcoming this
problem with a state-of-the-art specialized "ADMM" algorithm. Furthermore, the
proposed implementation is very simple, and importantly, it is flexible enough
to extend to many interesting related problems, such as sparse trend filtering
and isotonic trend filtering, as I will demonstrate with an application to a
neuroscience problem.

Software for our method is easy to use, and it is made freely available,
highly optimized in C++, and interfaced through an R package (function
"trendfilter" in package "genlasso").

This is joint work with Ryan Tibshirani (Statistics and ML, Carnegie Mellon
University).

Abstract

Dependency trees used in syntactic parsing often include a root node
representing a dummy word prefixed or suffixed to the sentence, a device that
is generally considered a mere technical convenience and is tacitly assumed to
have no impact on empirical results. We demonstrate that this assumption is
false and that the accuracy of data-driven dependency parsers can in fact be
sensitive to the existence and placement of the dummy root node. In
particular, we show that a greedy, left-to-right, arc-eager transition-based
parser consistently performs worse when the dummy root node is placed at the
beginning of the sentence (following the current convention in data-driven
dependency parsing) than when it is placed at the end or omitted completely.
Control experiments with an arc-standard transition-based parser and an
arc-factored graph-based parser reveal no consistent preferences but
nevertheless exhibit considerable variation in results depending on root
placement. We conclude that the treatment of dummy root nodes in data-driven
dependency parsing is an underestimated source of variation in experiments and
may also be a parameter worth tuning for some parsers.

Miguel is a Visiting lecturer - Postdoc in Pompeu Fabra University, Barcelona,
Spain. He works on natural language processing and machine learning with a
special interest on linguistic structure prediction problems, such as
dependency parsing and phrase structure parsing. He completed his BsC, MsC
and PhD at the Universidad Complutense de Madrid. During the last years, he
was a Visiting Researcher in Universities of Uppsala, Birmingham and
Singapore.

Abstract

A common assumption in Economics is that agents are utility maximizers: an
agent, facing prices for a set of goods, will choose to buy the bundle of
goods that she most prefers among all bundles that she can afford, according
to some concave, non-decreasing utility function. In the classical revealed
preference analysis, the goal is to fit some concave, non-decreasing model to
data from an agents past purchases. However, due to the richness of this
model class, this approach will not generalize to predict future purchases
well. A recent line of work, starting with Beigman and Vohra (2006) and
Zadimoghaddam and Roth (2012), has addressed this issue by suggesting to learn
restricted classes of utility function from revealed preference data.

This talk will present recent advances in this line of work. We provide
sample complexity guarantees and efficient algorithms for learning linear
utility functions from revealed preference data. At a technical level, our
work establishes connections between learning from revealed preferences and
problems of multi-class learning, combining recent advances on intrinsic
sample complexity of multi-class learning based on compression schemes with a
new algorithmic analysis yielding time- and sample-efficient procedures. Our
technique yields numerous generalizations including the ability to learn
other well-studied classes of utility functions, to deal with a misspecified
model, and with non-linear prices. We believe it may lead to new solutions to
a variety of learning problems in economic and game theoretic contexts.

This talk is based on a joint work with Nina Balcan, Amit Daniely, Ruta Mehta
and Vijay V. Vazirani, that will appear at WINE '14.

Abstract

Datamining--i.e. finding repeated, informative patterns in large datasets--has
proven extremely difficult for visual data. A key issue in visual data mining
is the lack of a reliable way to tell whether two images or image patches even
depict the same thing. In this talk, I'll cover two recent works which can
successfully mine discriminative sets of image patches in both
weakly-supervised and fully unsupervised settings.

Our first work proposes discriminative mode seeking, an extension of Mean
Shift to weakly-labeled data. Instead of finding the local maxima of a
density, we exploit the weak label to partition the data into two sets and
find the maxima of the density ratio. Given a dataset of image patches, and
weak labels such as scene categories, these 'discriminative modes' correspond
to remarkably meaningful visual patterns, including objects and object parts.
Using these discriminative patches as an image representation, we obtain
state-of-the-art results on a challenging indoor scene classification
benchmark.

In the second part of the talk, I will discuss how we can extend this
formulation to a fully unsupervised setting. Instead of using weak labels as
supervision, we use the ability of an object patch to predict the rest of the
object (its context) as supervisory signal to help discover visually
consistent object clusters. The proposed method outperforms previous
unsupervised as well as weakly-supervised object discovery approaches, and is
shown to provide correspondences detailed enough to transfer keypoint
annotations, even for extremely difficult datasets intended for benchmarking
fully supervised object detection algorithms (e.g. Pascal VOC).

Abstract

Recently, many cloud based machine learning (ML) services have been
launched, including Microsoft Azure Machine Learning, GraphLab, Google
Prediction API and Ersatz Labs. Cloud ML makes machine learning very easy to
use for common users. However, it invades the privacy and security of users'
data. How to protect users' privacy in cloud ML is a big challenge. In this
work, we focus on neural network which is a backbone model in machine
learning, and investigated how to perform privacy-preserving neural network
prediction on encrypted data. Users encrypt their data before uploading them
to the cloud. Cloud performs neural network predictions over the encrypted
data and obtains the results which are also in encrypted form that the cloud
cannot decipher. The encrypted results are sent back to users and users do
the decryption to get the plaintext results. In this process, cloud never
knows users' input data and output results since they are both encrypted.
This achieves a strong protection of users' privacy. Meanwhile, with the
help of homomorphic encryption, predictions made on encrypted data are
nearly the same as those on plaintext data. The predictive performance of
neural network is guaranteed.

Abstract

Inferring the "tree of life" from genetic data is an important problem in
evolutionary biology. It has been customary to think about this problem as one
of learning the evolutionary history of a single gene. However, individual
genes might have evolutionary histories that are (topologically) distinct from
each other and, of course, from the underlying tree of life. In this talk, I
will discuss a probabilistic model of evolution that takes this into account.
I will then outline our recent theoretical explorations into questions such as
reliable algorithms for learning the tree of life from several genes and the
amount of data required for such algorithms to succeed.

Abstract

Vector space models (VSMs) represent word meanings as points in a high
dimensional space. VSMs are typically created using a large text corpora, and
so represent word semantics as observed in text. We present a new algorithm
(JNNSE) that can incorporate a measure of semantics not previously used to
create VSMs: brain activation data recorded while people read words. The
resulting model takes advantage of the complementary strengths and weaknesses
of corpus and brain activation data to give a more complete representation of
semantics. Evaluations show that the model 1) matches a behavioral measure of
semantics more closely, 2) can be used to predict corpus data for unseen words
and 3) has predictive power that generalizes across brain imaging technologies
and across subjects. We believe that the model is thus a more faithful
representation of mental vocabularies.

Abstract

We consider the question of how unlabeled data can be used to estimate the
true accuracy of learned classifiers. This is an important question for any
autonomous learning system that must estimate its accuracy without
supervision, and also when classifiers trained from one data distribution must
be applied to a new distribution (e.g., document classifiers trained on one
text corpus are to be applied to a second corpus). We show how to accurately
estimate error rate from unlabeled data when given a collection of competing
classifiers that make independent errors, based on the agreement rates between
subsets of these classifiers. We further show that even when the classifiers
do not make independent errors, both their accuracies and error dependencies
can be estimated in a multitask learning setting under practical assumptions.
Experiments on two data sets demonstrate accurate estimates of accuracy from
unlabeled data. These results are of practical significance in situations
where labeled data is scarce, and shed light on the more general question of
how the consistency among multiple functions is related to their true
accuracies.

Abstract

Commanding robots through unconstrained natural language directions is
intuitive, flexible, and does not require specialized interfaces or training.
Providing this capability would enable effortless coordination in human robot
teams that operate in non-specialized environments. However, natural language
direction following through unknown environments requires understanding the
meaning of language, using a partial semantic world model to generate actions
in the world, and reasoning about the environment and landmarks that have not
yet been detected.

We address the problem of robots following natural language directions through
complex unknown environments. By exploiting the structure of spatial
language, we can frame direction following as a problem of sequential decision
making under uncertainty. We learn a policy using imitation learning from
demonstrations of people following directions. The trained policy predicts a
sequence of actions that follow the directions, explores the environment
(discovering new landmarks), backtracks when necessary, and explicitly
declares when it has reached its destination. By training explicitly in
unknown environments we can generalize to situations that have not been
encountered previously.

Abstract

Ensemble methods are widely used in practice with the hope of obtaining better
predictive performance than could be obtained from any of the constituent
classifiers in the ensemble. Most of the existing literature is concerned with
learning ensembles in a supervised setting. In this paper we propose an
unsupervised iterative algorithm to combine the discriminant scores from
different binary classifiers. We prove that (under certain assumptions) the
Area Under the ROC Curve (AUC) of the resulting ensemble is greater than or
equal to the AUC of the best classifier (with maximum AUC). We also
experimentally validate this claim on a number of datasets and also show that
the performance is better than the supervised ensembles.

Abstract

In this talk, I will present some of my recent work with my collaborators on
building models for inferring political ideologies from text. Given political
candidate speeches from the 2008 and 2012 US elections, we seek to measure
their ideological positioning. To accomplish this, we infer ideological cues
from a corpus of political writings annotated with known ideologies. We then
represent the speeches of U.S. presidential candidates as sequences of cues
and lags (filler distinguished only by its length in words). We apply a
domain-informed Bayesian HMM to infer the proportions of ideologies each
candidate uses in each campaign. The results are validated against a set of
preregistered, domain expert authored hypotheses. I will also present some
preliminary results on our work with briefs and data from the US Supreme
Court, studying the latent behaviors of amicus brief filers from a utility
maximizing perspective.

Abstract

Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large
number of applications. These methods classify initially unlabeled nodes by propagating label
information over the structure of graph starting from seed nodes. Graph-based SSL algorithms
usually scale linearly with the number of edges (|E|) and also in the number of distinct labels (m),
and require O(m) space on each node. Unfortunately, there exist many applications of practical
significance with very large m over large graphs, demanding better space and time complexity.
In this talk, we propose MAD-Sketch, a novel graph-based SSL algorithm which compactly
stores label distribution on each node using Count-min Sketch, a randomized data structure. We
present theoretical analysis showing that under mild conditions, MAD-Sketch can reduce space
complexity at each node from O(m) to O(log m), and achieve similar savings in time complexity
as well. We support our analysis through experiments on multiple real world datasets. We
observe that MAD-Sketch achieves similar performance as existing state-of-the-art graph-based
SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to
10x speedup. We find that MAD-Sketch is able to scale to datasets with one million labels, which
is beyond the scope of existing graph-based SSL algorithms.

Abstract

In many modern applications built on massive data and using high-dimensional
models, such as web-scale content extraction via topic models, genome-wide
association mapping via sparse regression, and image understanding via deep
neural networks, one needs to handle BIG machine learning problems that
threaten to exceed the limit of current infrastructures and algorithms.
While ML community continues to strive for new scalable algorithms, and
several attempts on developing new system architectures for BIG ML have
emerged to address the challenge on the backend, good dialogs between ML and
system remain difficult --- most algorithmic research remain disconnected
from the real system/data they are to face; and the generality,
programmability, and theoretical guarantee of most systems on ML programs
remain largely unclear. In this talk, I will present Petuum -- a
general-purpose framework for distributed machine learning, and demonstrate
how innovations in scalable algorithms and distributed systems design work
in concert to achieve multiple orders of magnitude of scalability on a
modest cluster for a wide range of large scale problems in social network
(mixed-membership inference on 40M node), personalized genome medicine
(sparse regression on 100M dimensions), and computer vision (classification
over 20K labels), with provable guarantee on correctness of distributed
inference.

Abstract

In recent years, with the advancement of large-scale data acquisition
technology in various engineering, scientiï¬c, and socio-economical
domains, traditional machine learning and statistical methods have
started to struggle with massive amounts of increasingly
high-dimensional data. Luckily, in many problems there is a simple
structure underlying high-dimensional data that can be exploited to
make learning feasible.

In this talk, I will focus on the problem of detection and
localization of a contiguous block of weak activation in a large
matrix, from a small number of noisy, possibly adaptive, compressive
measurements. This is closely related to the problem of compressed
sensing, where the task is to estimate a sparse vector using a small
number of linear measurements. Contrary to results in compressed
sensing, where it has been shown that neither adaptivity nor
contiguous structure help much, we show that for reliable localization
the magnitude of the weakest signals is strongly inï¬uenced by both
structure and the ability to choose measurements adaptively while for
detection neither adaptivity nor structure reduce the requirement on
the magnitude of the signal. We characterize the precise tradeoï¬s
between the various problem parameters, the signal strength and the
number of measurements required to reliably detect and localize the
block of activation. The suï¬cient conditions are complemented with
information theoretic lower bounds.

Joint work with Sivaraman Balakrishnan, Alessandro Rinaldo and Aarti Singh.

Abstract

Human language is the result of cognitive processes whose contours
are---at best---incompletely understood. Given the incomplete
information we have about the processes involved, the frequently
disappointing results obtained from attempts to use unsupervised
learning to uncover latent linguistic structures (e.g., part-of-speech
sequences, syntax trees, or word alignments in parallel data) can be
attributed---in large part---to model misspecification.

This work introduces a novel framework for unsupervised learning of
structured predictors with overlapping, global features. Each input's
latent representation is predicted conditional on the observable data
using a feature-rich conditional random field. Then a reconstruction
of the input is generated, conditional on the latent structure, as
drawn from cheaply-estimated multinomials. The autoencoder structure
enables efficient inference without unrealistic independence
assumptions, enabling us to incorporate the often conflicting,
overlapping theories (in the form of hand-crafted features) about how
latent structures relate to observed data in a coherent model. We
contrast our approach with traditional joint unsupervised models that
are learned to maximize the marginal likelihood of observed data. We
show competitive results with instantiations of the model for two
canonical NLP tasks: part-of-speech induction and bitext word
alignment, and show that training our model is substantially more
efficient than training feature-rich models.

Abstract

Modern applications awaiting next generation machine intelligence systems
have posed unprecedented scalability challenges. These scalability needs
arise from at least two aspects: 1) massive data volume, such as
societal-scale social graphs with up to hundreds of millions of nodes; and
2) massive model size, such as the Google Brain deep neural network
containing billions of parameters. Although there exist means and theories
to support reductionist approaches like subsampling data or using small
models, there is an imperative need for sound and effective distributed ML
methodologies for users who cannot be well-served by such shortcuts. To this
end, we propose a parameter server system for distributed ML, which follows
a Stale Synchronous Parallel (SSP) model of computation that maximizes the
time computational workers spend doing useful work on ML algorithms, while
still providing correctness guarantees. The parameter server provides an
easy-to-use shared interface for read/write access to an ML model's values
(parameters and variables), and the SSP model allows distributed workers to
read older, stale versions of these values from a local cache, instead of
waiting to get them from a central storage. This significantly increases the
proportion of time workers spend computing, as opposed to waiting.
Furthermore, the SSP model ensures ML algorithm correctness by limiting the
maximum age of the stale values. We provide a proof of correctness under
SSP, as well as empirical results demonstrating that the SSP model achieves
faster algorithm convergence on several different ML problems, compared to
fully-synchronous and asynchronous schemes.

Abstract

Being able to effectively model latent structure in data is a key challenge in
modern AI research, particularly in Natural Language Processing (NLP) where it
is crucial to discover and leverage syntactic and semantic relationships that
may not be explicitly annotated in the training set. Unfortunately, while
incorporating latent variables to represent hidden structure can substantially
increase representation power, the key problems of model design and learning
become significantly more complicated. For example, unlike fully observed
models, latent variable models can suffer from non-identifiability, making it
difficult to distinguish the desired latent structure from the others.
Moreover, learning is usually formulated as a non-convex optimization problem,
leading to the use of local search methods that may become trapped in local
optima.

In this talk, we take a different perspective and approach two key problems in
NLP, unsupervised parsing and language models, through the lens of linear
algebra. By exploiting the connection between hidden variables and low rank
factorization, we propose a method for unsupervised constituent parsing that
has theoretical guarantees on latent structure recovery. Empirically our
approach performs favorably to the Constituent Context Model of Klein and
Manning (2002) without the need for careful initialization.

In our on-going work in language modeling, we leverage matrix factorization to
generalize existing n-gram language models to non-integer n. Our method
includes existing smoothing methods such as Absolute Discounting and Kneser
Ney as special cases, and gives us noticeable improvements in perplexity over
state-of-the-art Kneser Ney baselines.

This is joint work with Shay Cohen, Avneesh Saluja, Chris Dyer, and Eric Xing.

Abstract

In this talk I will try to convince all the fervent believers in proximal
gradient methods, ADMM, or coordinate descent that there is a better method
for optimizing general (smooth) objectives with an L1 penalty: Newton
coordinate descent. The method is the current state-of-the-art in tasks like
sparse inverse covariance estimation, and I will highlight two examples from
my group's work that use this approach to achieve substantial speedups over
existing algorithms. In particular, I will discuss how we use this algorithm
to learn sparse Gaussian conditional random field models (applied to energy
forecasting), and to design sparse optimal control laws (applied to
distributed control in a smart grid).

Abstract

Being part of the Machine Learning team at Amazon is one of the most
exciting engineering job opportunities in the world today. Our Machine
Learning (ML) team is comprised of technical leaders with different
backgrounds who create and develop novel and infinitely-scalable
applications that optimize Amazonâs systems using cutting edge machine
learning techniques. We develop innovative algorithms that model patterns
within data to drive automated decisions at scale in all corners of the
company, including our e-Commerce site and subsidiaries, Amazon Web
Services, Seller and Buyer Services and Digital Media including Kindle. In
this talk, I will give you an overview of some of the technical challenges
we face and I will present some examples of ML-based solutions that had a
significant impact inside the company.

Glenn Fung received a B.S. in pure mathematics from Universidad Lisandro
Alvarado in Barquisimeto, Venezuela. He then earned an M.S. in applied
mathematics from Universidad Simon Bolivar, Caracas, Venezuela, where later
he worked as an assistant professor for two years. He also earned an M.S.
degree and a Ph. D. degree in computer sciences from the University of
Wisconsin-Madison. His main interests are optimization approaches to
machine learning and data mining, with emphasis in kernel methods. In the
summer of 2003 he joined the computer aided diagnosis group at Siemens
Healthcare in Malvern, PA where he worked for 10 years developing and
applying novel machine learning techniques to solve challenging problems
that arise in the medical domain. In September of 2013 he joined the Amazon
where he has been working on applying novel machine learning techniques to
solve challenging problems that arise in e-commerce retail.

Abstract

Determinantal Point Processes (DPPs) are random point processes well-suited
for modelling repulsion. In machine learning and statistics, DPPs are a
natural model for subset selection problems where diversity is desired. For
example, they can be used to select diverse sets of sentences to form document
summaries, or to return relevant but varied text and image search results, or
to detect non-overlapping multiple object trajectories in video. Among many
remarkable properties, they offer tractable algorithms for exact inference,
including computing marginals, computing conditional probabilities, and
sampling. In our recent work, we extended these algorithms to approximately
infer non-linear DPPs defined over a large amount of data, as well as DPPs
defined on continuous spaces using low-rank approximations. We demonstrated
the advantages of our models on several machine learning and statistical
tasks: motion capture video summarization, repulsive mixture modelling and
synthesizing diverse human poses. Given time, I will also briefly touch on our
other related works such as extending DPPs into a temporal process that
sequentially select multiple diverse subsets across time and how we go about
learning the parameters of a DPP kernel. These are joint works with Emily Fox,
Ben Taskar and Alex Kulesza.

Abstract

We study the problem of learning in the presence of a drifting target concept.
Specifically, we provide bounds on the expected number of mistakes on a
sequence of i.i.d. points, labeled according to a target concept that can
change by a given amount on each round. Some of the results also describe an
active learning variant of this setting, and provide bounds on the number of
queries for the labels of points in the sequence sufficient to obtain the
stated bounds on the number of mistakes.

Abstract

We develop upper and lower bounds for the probability of Boolean functions by
treating multiple occurrences of variables as independent and assigning them
new individual probabilities. We call this approach dissociation and give an
exact characterization of optimal oblivious bounds, i.e. when the new
probabilities are chosen independent of the probabilities of all other
variables. Our motivation comes from the weighted model counting problem (or,
equivalently, the problem of computing the probability of a Boolean function),
which is #P-hard in general. By performing several dissociations, one can
transform a Boolean formula whose probability is difficult to compute, into
one whose probability is easy to compute, and which is guaranteed to provide
an upper or lower bound on the probability of the original formula by choosing
appropriate probabilities for the dissociated variables. Our new bounds shed
light on the connection between previous relaxation-based and model-based
approximations and unify them as concrete choices in a larger design space. In
the second part of the talk, we focus on the problem of query evaluation over
probabilistic databases. We show how our theory allows a standard relational
database management system (DBMS) to both upper and lower bound hard
probabilistic queries in guaranteed polynomial time.

Abstract

Equipping machines with knowledge, through the construction of machine-readable
knowledge bases, presents a key asset for semantic search, question answering
and other applications. In this talk, I will present methods for knowledge
base construction. The first is a method for fact extraction. The second is a
method for generating a large collection of binary relations organized by
synonym sets and into a taxonomy based on relation subsumptions. The third is
a method for discovering and semantically typing new entities as they emerge in
dynamic Web sources such as news articles and social media.

Abstract

Over the past few years, I have been developing and deploying
interactive crowd-powered systems that solve characteristic âhardâ
problems to help people get things done in their everyday lives. For
instance, VizWiz answers visual questions for blind people in less
than a minute, Legion drives robots in response to natural language
commands, Chorus holds helpful conversations with human partners, and
Scribe converts streaming speech to text in less than five seconds.

The future envisioned by my research is one in which the intelligent
systems that we have dreamed about for decades, which have inspired
generations of computer scientists from its beginning, are brought
about for the benefit of people. My work illustrates a path for
achieving this vision by leveraging the on-demand labor of people to
fill in for components that we cannot currently automate, and by
building frameworks that allow groups to do together what even expert
individuals cannot do alone. A crowd-powered world may seem counter to
the goals of computer science, but I believe that it is precisely by
creating and deploying the systems of our dreams that will learn how
to advance computer science to create the machines that will someday
realize them.

Abstract

A fundamental challenge in robotics is the so-called ``critter problem:ââ a
robot, capable of performing actions and receiving observations, is placed in
an unknown environment. The robot has no interpretation for its actions or
observations and no knowledge of the structure of the environment. The problem
is to program the robot to learn about its observations, actions, and
environment well enough to make predictions of future observations given
sequences of actions.

This modeling problem is especially challenging given the wide variety of
sensors, actuators, and domains that are encountered in modern robotics. As a
result, researchers have largely abandoned the critter problem and have instead
focused on leveraging extensive domain knowledge to develop special-purpose
tools for special cases of the problem: e.g., system identification to learn
Kalman filters, body schema learning for manipulators, structure-from-motion in
vision, or the many methods for simultaneous localization and mapping.

In this talk I will revisit the original critter problem from a modern machine
learning perspective. I will discuss how spectral learning algorithms can unify
disparate tools and special cases encountered in different sub-areas of
robotics into a single general-purpose toolkit. Finally, I will show how
spectral methods have achieved state-of-the-art performance on real-world
robotics problems in several different domains.

Abstract

Bayes nets are not only useful for developing AI systems, but can also
help to explain how humans reason under uncertainty. I will present a
Bayes net framework for reasoning about multiple causal systems and will
apply it to two aspects of human reasoning. The first application
considers how people make inferences that depend on both causal
relationships between features (e.g. animals with wings often fly) and
similarity relationships between objects (e.g. eagles and hawks are rather
similar). The second application explores how people make counterfactual
inferences, or inferences about scenarios that differ from the real world
in some respect.

Charles Kemp is an associate professor in CMU's psychology department. His
research focuses on probabilistic models of human learning and inference.

Abstract

In this talk, I will be summarizing our work on enabling robots to produce
motion that is suitable for human-robot collaboration and co-existence. Most
motion in robotics is purely functional: industrial robots move to package
parts, vacuuming robots move to suck dust, and personal robots move to clean up
a dirty table. This type of motion is ideal when the robot is performing a task
in isolation. Collaboration, however, does not happen in isolation. In
collaboration, the robot's motion has an observer, watching and interpreting
the motion. In this work, we move beyond functional motion, and introduce the
notion of an observer into motion planning, so that robots can generate motion
that is mindful of how it will be interpreted by a human collaborator. We
formalize predictability and legibility as properties of motion that naturally
arise from the inferences that the observer makes, drawing on action
interpretation theory in psychology. We propose models for these inferences
based on the principle of rational action, and use a combination of constrained
trajectory optimization and machine learning techniques to enable robots to
plan motion for collaboration.

Bio: Anca Dragan is a PhD candidate at Carnegie Mellon's Robotics Institute,
and a member of the Personal Robotics Lab. She was born in Romania and received
her B.Sc. in Computer Science from Jacobs University Bremen in 2009. Her
research lies at the intersection of robotics, machine learning, and
human-computer interaction: she is interested in enabling robots to seamlessly
work with and around people. Anca is an Intel PhD Fellow for 2013, a Google
Anita Borg Scholar for 2012, and serves as General Chair in the Quality of Life
Technology Center's student council.

Abstract

Many models in machine learning, computer vision or speech processing have the
form of a sequence of nested, parameterized functions, such as a multilayer
neural net, an object recognition pipeline, or a "wrapper" for feature
selection. Joint estimation of the parameters of all the layers and selection
of an optimal architecture is widely considered to be a difficult numerical
nonconvex optimization problem, difficult to parallelize for execution in a
distributed computation environment, and requiring significant human expert
effort, which leads to suboptimal systems in practice. We describe a general
mathematical strategy to learn the parameters and, to some extent, the
architecture of nested systems, called the method of auxiliary coordinates
(MAC). MAC has provable convergence, is easy to implement reusing existing
algorithms for single layers, can be parallelized trivially and massively,
applies even when parameter derivatives are not available or not desirable (so
computing gradients with the chain rule does not apply), and is competitive
with state-of-the-art nonlinear optimizers even in the serial computation
setting, often providing reasonable models within a few iterations.

If time permits, I will illustrate how to use MAC to derive training algorithms
for a range of problems, such as deep nets, best-subset feature selection,
joint dictionary and classifier learning, supervised dimensionality reduction,
and others.

This is joint work with Weiran Wang.

BIOGRAPHY:
Miguel Ã. Carreira-Perpinan is an associate professor in Electrical Engineering
and Computer Science at the University of California, Merced. He received the
degree of "licenciado en informÃ¡tica" (MSc in computer science) from the
Technical University of Madrid in 1995 and a PhD in computer science from the
University of Sheffield in 2001. Prior to joining UC Merced, he did
postdoctoral work at Georgetown University (in computational neuroscience) and
the University of Toronto (in machine learning), and was an assistant professor
at the Oregon Graduate Institute (Oregon Health and Science University). He is
the recipient of an NSF CAREER award, a Google Faculty Research Award and a
best student paper award at Interspeech. He is an associate editor for the IEEE
Transactions on Pattern Analysis and Machine Intelligence and an area chair for
NIPS. His research interests lie in machine learning, in particular
unsupervised learning problems such as dimensionality reduction, clustering and
denoising, with an emphasis on optimization aspects, and with applications to
speech processing (e.g. articulatory inversion and model adaptation), computer
vision, sensor networks and other areas.

Abstract

While machine learning plays a significant role in the community of practice
known as "computational social science", one area ripe for interdisciplinary
work but fraught with its own challenges are the humanities, which encompass
such domains as English, Literary Studies, History and Archaeology (among many
others). These areas have a long history of engaging with quantitative and
computational methods (pre-dating modern notions of the "digital humanities"),
and offer a fascinating, complex proving ground for classic ML problems of
learning and inference.

In this talk, I will discuss recent and ongoing work into two probabilistic
latent variable models that fall in this domain: the first is a model for
inferring character types (or "personas") in text, where a "persona" is defined
as a set of mixtures over fine-grained latent lexical classes. These lexical
classes capture the stereotypical actions of which a character is the agent and
patient (villains "kill" and "are foiled"), as well as the attributes by which
they are described (e.g., "evil"); I present results applying this model to a
collection of movie plot summaries.

The second model addresses the problem of jointly inferring the identity and
social rank of members of an Old Assyrian trade network from the 2nd millennium
BCE, leveraging evidence in the form of local, partial ranks over observed name
mentions in cuneiform tablets to learn a global rank over (latent) individuals.

Abstract

The rapid growth in the size and scope of datasets in science and technology
has created a need for novel foundational perspectives on data analysis that
blend the statistical and computational sciences. That classical perspectives
from these fields are not adequate to address emerging problems in "Big Data"
is apparent from their sharply divergent nature at an elementary level---in
computer science, the growth of the number of data points is a source of
"complexity" that must be tamed via algorithms or hardware, whereas in
statistics, the growth of the number of data points is a source of "simplicity"
in that inferences are generally stronger and asymptotic results can be
invoked. Indeed, if data are a data analyst's principal resource, why should
more data be burdensome in some sense? Shouldn't it be possible to exploit the
increasing inferential strength of data at scale to keep computational
complexity at bay? I present three research vignettes that pursue this theme,
the first involving the deployment of resampling methods such as the bootstrap
on parallel and distributed computing platforms, the second involving
large-scale matrix completion, and the third introducing a methodology of
"algorithmic weakening," whereby hierarchies of convex relaxations are used to
control statistical risk as data accrue.

Abstract

The internet has revolutionized the way we communicate, leading to a constant
flood of informal text available in electronic format, including: email,
Twitter, SMS and the clinical text found in electronic medical records. This
presents a big opportunity for Natural Language Processing (NLP) and
Information Extraction (IE) technology to enable new large scale data-analysis
applications by extracting machine-processable information from unstructured
text at scale.

In this talk I will discuss several challenges and opportunities which arise
when applying NLP and IE to informal text, focusing specifically on Twitter,
which has recently rose to prominence, challenging the mainstream news media as
the dominant source of realtime information on current events. I will describe
several NLP tools we have adapted to handle Twitterâs noisy style, and present
a system which leverages these to automatically extract a calendar of popular
events occurring in the near future.

I will further discuss fundamental challenges which arise when extracting
meaning from such massive open-domain text corpora. Several probabilistic
latent variable models will be presented, which are applied to infer the
semantics of large numbers of words and phrases and also enable a principled
and modular approach to extracting knowledge from large open-domain text
corpora.

Abstract

Large amounts of data are routinely collected in the Electronic Medical Record
yet their use in informing patient care is limited to manual assessment by the
caregivers. In this talk, we discuss probabilistic approaches motivated by
clinical practice for analyzing continuously measured physiologic data. We
develop our models on data collected from instrumenting a neonatal intensive
care unit. Based on insights derived from modeling this data, we tackle the
application of risk stratification. As part of routine care, every infant at
birth is risk stratified based on the Apgar score. We develop a cheap,
non-invasive and simple risk stratification tool using markers from physiologic
data for predicting infants at risk, dubbed by Science news as the modern
electronic Apgar.

Bio:
Suchi Saria is an Assistant Professor at Johns Hopkins University within the
Schools of Engineering and Public Health. She received her PhD in machine
learning from Stanford University. Her research interests span computational
modeling of diverse, large temporal data, and in particular those from sensing
devices and the electronic health record for improving patient management. Her
work on predictive modeling for clinical data from infants has been covered by
numerous national and international press sources. She is the recipient of
multiple awards including a best student, a best paper finalist, Microsoft Full
Scholarships, Rambus Corporation fellowship, an NSF Computing Innovation
fellowship, and a Gordon and Betty Moore foundation award.

Abstract

Regularized ERM is the engine driving numerous machine learning techniques such
as SVMs and the LASSO, and is to thank for many modern results on fast
optimization, model recovery, and noise tolerance. How then is it that the
popular algorithm AdaBoost not only corresponds to an unregularized ERM
problem, but moreover AdaBoost's optimization problem fails many of the basic
sanity checks --- e.g., existence of minimizers --- provided by regularization?

This talk bridges this apparent rift by presenting the two theoretical
guarantees substantiating the strong performance of AdaBoost --- convergence to
the Bayes predictor in the general case, and fast convergence in the margin
/ weak-learnable case --- in a way that emphasizes the connection to
regularization. In particular, in lieu of an algorithm-specified
regularization parameter, the data itself constrains the behavior of the
algorithm.

(Technical note: some results will focus on Lipschitz losses (e.g., the
logistic loss), but the exponential loss will be discussed throughout.)

Abstract

In the first part of this talk, we investigate information validation tasks
that are initiated as queries from either automated agents or humans. We
introduce OpenEval, a new online information validation technique, which uses
information on the web to automatically evaluate the truth of queries that are
stated as multi-argument predicate instances (e.g., DrugHasSideEffect(Aspirin,
GI Bleeding))). OpenEval gets a small number of instances of a predicate as
seed positive examples and automatically learns how to evaluate the truth of a
new predicate instance by querying the web and processing the retrieved
unstructured web pages. We show that OpenEval is able to respond to the queries
within a limited amount of time while also achieving high F1 score. In
addition, we show that the accuracy of responses provided by OpenEval is
increased as more time is given for evaluation.

In the second part of the talk, we explain how OpenEval can be used to provide
knowledge to anytime intelligent agents, in particular for a find-deliver task
in a real mobile robot (CoBot), for a trip planner agent, and for the
knowledge-on-demand system which is part of the NELL project.

These projects are joint work with Manuela Veloso, Manuel Blum, Thomas Kollar,
and Tom Mitchell

Abstract

In many real-world applications, data come as discrete metric spaces sampled
around 1-dimensional linear structures that can be seen as metric trees or
graphs. In this talk we will consider the reconstruction problem of such
filamentary structures from data sampled around them. We will present two
reconstruction methods. The first one comes with topological and metric
guarantees but relies on parameter choices that might be tricky. The second one
comes with metric guarantees but is much less sensitive to parameter choices
and can be very efficiently implemented.

Abstract

Models of prediction from text have seen an increase in interest with the rise
of Social Media. Applications vary from predicting political sentiment,
financial indicators or flu rates. Most of the approaches treat the problem as
linear regression, learning to relate word frequencies to the response
variable.

In this talk I will present a bilinear approach to text regression with a
sparse regulariser. This model learns a sparse representation of two distinct
sets of variables, in our case the words and the users. Our method is inspired
by the fact that most Social Media users posts are not indicative to our
prediction task. Further, we explore a multi-task learning approach in order to
exploit the relationship between the output variables.

Abstract

Uniformity is a parsimonious description of randomness. Unfortunately, a
probability measure can come nowhere close to approximating uniformity on
unbounded sets. This talk takes up that problem in the setting of the
integers. By replacing the countable additivity axiom of probability with
finite additivity, one can assign to each integer the same probability.
Because assigning to each integer the same probability does not uniquely define
a uniform distribution, various additional properties have been proposed. In
this talk, three such properties will be reviewed, and a new one introduced.
It turns out these notions can be ordered so that one implies the next. While
stronger notions may be required to attain uniqueness for certain applications,
the weaker notions have the virtue of being more parsimonious and
interpretable. Moreover, stronger properties may be decomposed into weaker
ones, and the weaker notions may have special number theoretic properties. For
instance, the family of uniform distributions introduced in this talk may
assign positive probability to the set of prime numbers.

Abstract

In some reinforcement learning problems an agent may be provided with a set of
input policies, perhaps learned from prior experience or provided by advisors.
We present a reinforcement learning with policy advice (RLPA) algorithm which
leverages this input set and learns to use the best policy in the set for the
reinforcement learning task at hand. We prove that RLPA has a sublinear regret
of $\widetilde O(\sqrt{T})$ relative to the best input policy, and that both
this regret and its computational complexity are independent of the size of the
state and action space. Our empirical simulations support our theoretical
analysis. This suggests RLPA may offer significant advantages in large domains
where some prior good policies are provided.

Biography

Abstract

Nowadays, the scale of graph data that needs to be processed is massive. For
example, in the context of online services, the Web graph amounts to at least
one trillion of links, and Facebook recently reported more than 1 billion of
users and 140 billion of friend connections.

In the first part of the talk, we will discuss the balanced graph partitioning
problem in the context of big dynamic graph data, a key problem to enable
efficient solving of a wide range of computational tasks. There exist two
widely-used families of heuristics for graph partitioning in the streaming
setting: place the newly arrived vertex in the cluster with the largest number
of neighbors or in the cluster with the least number of non-neighbors. We will
present a framework that unifies these two seemingly orthogonal approaches and
allows us to interpolate between them, obtaining a superior performance.
Surprisingly, this performance is even comparable to non-streaming algorithms.
For instance, for the Twitter graph with more than 1.4 billion of edges, our
method partitions the graph in about 40 minutes achieving a balanced partition
that cuts as few as 6.8% of edges, whereas it took more than 8$\tfrac{1}{2}$
hours by METIS to produce a balanced partition that cuts 11.98\% of edges.
Finally, we provide an O(logk/k)-approximation algorithm and we show
experimentally using Apache Giraph that the partitions we obtain result in
significant gains in terms of the communication cost and runtime.

In the second half of the talk we will discuss the problem of finding dense
subgraphs in large-scale graphs. We introduce a general framework which
subsumes popular density functions and provide theoretical insights into our
framework. We introduce a special instance of our framework as a density
function which favors small, dense, small-diameter subgraphs. We provide an
additive approximation algorithm, and a multiplicative approximation algorithm
with tight guarantees. We also develop applications of our method in data
mining and bioinformatic tasks, such as forming a successful team of domain
experts and finding highly correlated genes from a microarray dataset.

Abstract

Much work in optimal control and inverse control has assumed that the
controller has perfect knowledge of plant dynamics. However, if the controller
is a human or animal subject, the subjectâs internal dynamics model may differ
from the true plant dynamics. Here, we consider the problem of learning the
subjectâs internal model from demonstrations of control and knowledge of task
goals. Due to sensory feedback delay, the subject uses an internal model to
generate an internal prediction of the current plant state, which may differ
from the actual plant state. We develop a probabilistic framework and exact EM
algorithm to jointly estimate the internal model, internal state trajectories,
and feedback delay. We applied this framework to demonstrations by a nonhuman
primate of brain-machine interface (BMI) control. We discovered that the
subjectâs internal model deviated from the true BMI plant dynamics and provided
significantly better explanation of the recorded neural control signals than
did the true plant dynamics.

Abstract

Story comprehension is a rich and rapid phenomenon, requiring multiple
simultaneous processes (e.g. letter recognition, word understanding, sentence
parsing...). Our goal is to study how the brain processes this complex
information, by modeling the fMRI brain activity during story reading at a
close to normal speed. This is a challenging goal, one reason being the coarse
time-resolution of fMRI and the lack of a comprehensive model of word meaning
composition. Classically, fMRI has been used to localize brain areas that
process specific elements of text processing, e.g. which areas are involved in
syntactic processing, but not model how the brain represents different
instances of these elements, e.g. how do those areas represent different
syntactic structures.

We created a generative model that predicts the fMRI activity created when
subjects read a complex story where the words are presented in a serial manner,
for 0.5 seconds each. We then performed an exploratory analysis in which we
tested several types of story features (e.g. word length, syntax, semantics,
story characters) to search for a good basis of features for story
comprehension. We found different patterns of representation in the brain for
different types of features. These patterns align with the predictions from
the field. To test the validity of our model, we performed a classification
task that decodes what passage of the story a time segment of brain activity
corresponds to. The classification accuracy was significantly higher than
chance (p < 10^-6). Our approach has the advantage of being flexible: any feature
of language can be added to the model and tested, and features can range from
simple perceptual features, to compositional semantics, to higher order
reasoning about narrative structure and story comprehension.

Abstract

Stochastic convex optimization (SCO) under the first order oracle model deals
with approximately optimizing a convex function over a convex set, given access
to noisy function and gradient values at any point, using as few queries as
possible. Active learning of one-dimensional threshold (ALT) classifiers, is a
classification problem that deals with approximately locating a "threshold" on
a subinterval of the real line (which is a point to the left of which labels
are more likely to be negative, and to the right of which labels are more
likely to be positive), given access to an oracle that returns these noisy
labels, using as few queries as possible.

Exploiting the sequential nature of both problems, we establish a concrete
similarity between the "Tsybakov Noise Condition" from ALT theory and
"Strong/Uniform Convexity Condition" from SCO theory, and show how
information-theoretic lower-bound techniques from ALT can be used to get very
similar lower bound rates in SCO (and show these are tight with matching upper
bounds). Time permitting, I will also show a kind of algorithmic reduction from
SCO to ALT for strongly/uniformly convex functions as well as a new adaptive
ALT algorithm, that was inspired from a recent adaptive SCO algorithm.

Abstract

Estimation methods in high-dimensional linear models, as studied in compressed
sensing and sparse linear regression, basically rely on minimization of a
squared error subject to a sparsity constraint. In first part of this talk we
examine the problem of sparsity-constrained minimization of non-quadratic
objective functions that can arise in models with non-linearities. We propose a
greedy algorithm for these problems and prove its accuracy objectives with
"stable restricted hessian" or "stable restricted linearization".

In the second part of the talk, we inspect the non-convex $\ell_p$-constrained
least squares. In particular, we obtain accuracy guarantees for the projected
gradient descent method under the restricted isometry property. We further
discuss the implication of the result and derive necessary conditions for
projection onto an $\ell_p$ ball.

Abstract

A common classifier for unlabeled nodes on undirected graphs uses label
propagation from the labeled nodes, a.k.a the harmonic predictor on Gaussian
random fields (GRFs). For active learning on GRFs, the commonly used
V-optimality criterion queries nodes that reduce the L2 (regression) loss. It
has a submodularity property showing that greedy application of it produces a
(1 - 1/e) globally optimal solution. However, L2 loss may not characterize the
true nature of 0/1 loss in classification problems and thus may not be the best
choice for active learning.

We propose a new criterion we call Sigma-optimality, which queries the node that
minimizes the sum of the elements in the predictive covariance. Theoretically,
we extend submodularity guarantees from V-optimality to Sigma-optimality using
properties specific to GRFs. The proofs are interesting because we further show
that GRFs have the suppressor-free condition in addition to the conditional
independence inherited from Markov random fields.

Sigma-optimality directly optimizes the loss of the surveying problem, which is to
determine the proportion of nodes belonging to one class. We test Sigma-optimality
on several real-world graphs and show that in addition to the surveying
problem, it also outperforms V-optimality and expected error reduction on the
classification problem.

Abstract

Convex optimization is a key tool in computer science, with applications
ranging from machine learning to operational research. Due to the fast growth
of data sizes, the development of faster algorithms is becoming a more pressing
question. This talk aims to discuss several emerging approaches for faster and
more accurate optimization algorithms using techniques from combinatorial
algorithms, numerical analysis and spectral graph theory.

For quadratic, or L_2 minimization, we will present faster algorithms when
the underlying matrix has highly uneven dimensions, or when the problem has
graph-like structure. Key to these algorithms are methods for sampling the rows
of a matrix that preserve the structure of its outer product. We will also
discuss the close connections between L_1 regression and quadratic
minimization, and describe some ongoing work in image processing using these
ideas.

Abstract

Nonparametric mixture models based on the Dirichlet process are an
elegant alternative to finite models when the number of underlying
components is unknown, but inference in such models can be
slow. Existing attempts to parallelize inference in such models have
relied on introducing approximations, which can lead to inaccuracies
in the posterior estimate. In this talk, I will construct auxiliary
variable representations for the Dirichlet process and the
hierarchical Dirichlet process that facilitate the development of
distributed Markov chain Monte Carlo schemes that use the correct
equilibrium distribution. Experimental analyses show that this
approach allows scalable inference without the deterioration in
estimate quality that accompanies existing methods.

Abstract

As large scale methods of measuring gene expression via images have
been developed, the question arises: how can we predict gene
interaction networks from images? The four major machine learning
challenges are (a) how do you deal with multiple images per data
source, and (b) multiple data sources, in (c) an unsupervised manner,
while analyzing (d) global conditional independencies instead of
pairwise interactions.

In this talk, I will describe a two-step solution to this
problem. First, we present an algorithm, Gin-IM, for learning
interaction networks from images in a single data source. Gin-IM
combines multi-instance kernels with recent work in learning sparse
undirected graphical models to predict interactions between genes.

Next, we propose NP-MuScL (nonparanormal multi source learning) to
estimate a gene interaction network that is consistent with multiple
sources of data, having the same underlying relationships between the
nodes. NP-MuScL uses the semiparametric Gaussian copula to model the
distribution of the different data sources, with the different copulas
sharing the same covariance matrix.

We apply our algorithms on Drosophila embryonic ISH images from the
Berkeley Drosophila Genome Project. Data from different time steps in
Drosophila embryonic development are treated as separate data
sources. With spatial gene interactions predicted via Gin-IM, and
temporal predictions combined via NP-MuScL, we can finally predict
spatiotemporal gene networks from these images.

Parts of this work were presented at ECCV 2012, and will be presented
at RECOMB 2013. This is joint work with my advisor Eric P. Xing. No
knowledge of biology is needed to follow this talk.

Abstract

In this talk, I will discuss several interesting data mining / machine
learning challenges for making personalized recommendations of TV
shows and movies to each user at Netflix, and I will take a deeper
dive into two of them. While the presented experiments are based on
Netflix data, the challenges and approaches apply also to other
domains besides movies.

The objective of our recommender system is to rank movies according to
each user's preferences. Users' preferences can be estimated from
their feedback data like plays or ratings of movies, among others.
While the extreme data sparsity is a well-known problem (each user
interacts with only a small fraction of all movies), I will in
particular discuss the following two aspects: (1) the feedback data
are missing not at random (MNAR), and (2) the distribution of the data
is very skewed.

The MNAR nature of the feedback data originates from the fact that
users can choose which movies to play or to rate. As a result, the
fact which movies a user interacted with carries useful information.
From a statistical perspective, however, MNAR data pose interesting
challenges--not only for training a recommender system, but also for
designing meaningful test/validation procedures on such data.

The second aspect addressed in detail in this talk is the skewed
distribution of these data: while there are a few popular movies that
receive most of the attention, the majority of movies is in the long
tail of the popularity distribution. Recommendations from the long
tail are generally considered to be particularly valuable for the
user, as they are otherwise difficult to discover. On the other hand,
recommendation accuracy tends to decrease towards the long tail due to
lack of data.

I will discuss machine learning approaches to tackle both challenges,
and present empirical evidence that large improvements can be achieved
by tackling these key properties of the feedback data.

Biography

Harald Steck is a data scientist at Netflix, where he develops
personalization algorithms and recommender systems. He has over ten
years of experience in machine learning, in particular in graphical
models and more recently in recommender systems. He has conducted
research at various industrial and academic organizations, including
Bell Labs, ETH Zurich in Switzerland, MIT AI Lab, as well as Technical
University of Munich in Germany, where he obtained a PhD degree in
Computer Science in 2001.

Abstract

User feedback has become an invaluable source of training data for
optimizing recommender systems in a rapidly expanding range of
domains, most notably content recommendation (e.g., news, movies,
ads). When designing recommender systems that adapt to user feedback,
two important challenges arise. First, the system should recommend
optimally diversiï¬ed content that maximizes coverage of the
information the user ï¬nds interesting (to maximize positive feedback).
Second, the system should make appropriate exploratory recommendations
in order to learn a reliable model from feedback.

In this talk, I will describe the Linear Submodular Bandits Problem,
which is a framework for jointly modeling the utility of a set of
recommendations (so as to encourage diversity), as well as the
exploration/exploitation trade-off that arises when learning from user
feedback. In particular, the utility of a set of recommendations is
modeled as a parameterized submodular function, which naturally
encodes a notion of diminishing returns that encourages diversity. For
this setting, I will also present an online learning algorithm that
can efficiently converge to a near-optimal model.

As with any bandit learning problem, the inefficiency (or regret) of a
recommendation algorithm is due primarily to the cost of exploration
(i.e., making exploratory recommendations due to not knowing the
user's preferences a priori). One way to reduce the cost of
exploration is by leveraging prior knowledge. Intuitively, most users
bear some similarity to "stereotypical users" that can be represented
in a low-dimensional, or coarse, feature space. I will show how to
construct a coarse-to-fine hierarchy of feature spaces from the
preference profiles of existing users, and also how to conduct bandit
learning using this feature hierarchy to drastically reduce the amount
of exploration required.

I will present a live user study, where these approaches were applied
to the setting of personalized news recommendation. Our results
demonstrate improved performance against approaches that do not
directly model diversification, do not employ exploration, or do not
incorporate prior knowledge to reduce the amount of exploration required.

Abstract

We discuss a general notion of âsparsity structureâ and present a
unified framework for the recovery of "sparse-structured" signals from
their linear image of reduced dimension possibly corrupted with noise.
This unified treatment covers usual sparse and block-sparse recoveries
via commonly used $\ell_1$ regularization as well as low-rank matrix
reconstruction via nuclear norm minimization. We present null-space
type sufficient conditions for the recovery to be precise in the
noiseless case, derive error bounds for imperfect recovery (nearly
sparse signal, presence of observation noise) and relate to the other
well-known conditions (Restricted Isometry Property, Mutual
Incoherence) from the literature. Our emphasis is on efficiently
verifiable sufficient conditions on the problem parameters (sensing
matrix and sparsity structure) for the validity of the associated
nullspace properties. While the efficient verifiability of a condition
is by no means necessary for the condition to be meaningful and
useful, we believe that verifiability has its value and is worthy of
being investigated. In particular, verifiability allows us to design
new recovery routines with explicit confidence bounds for the recovery
error, which can then be optimized over the method parameters leading
to recovery procedures with improved statistical properties.

This is joint work with Anatoli Juditsky, Arkadi Nemirovski and Boris
Polyak.

Abstract

Given a large repository of geotagged imagery, we seek to automatically
find visual elements, e.g. windows, balconies, and street signs, that are
most distinctive for a certain geo-spatial area, for example the city of
Paris. This is a tremendously difficult task as the visual features
distinguishing architectural elements of different places can be very
subtle. In addition, we face a hard search problem: given all possible
patches in all images, which of them are both frequently occurring and
geographically informative? To address these issues, we propose to use a
discriminative clustering approach able to take into account the weak
geographic supervision. We show that geographically representative image
elements can be discovered automatically from Google Street View imagery in
a discriminative manner. We demonstrate that these elements are visually
interpretable and perceptually geo-informative. The discovered visual
elements can also support a variety of computational geography tasks, such
as mapping architectural correspondences and influences within and across
cities, finding representative elements at different geo-spatial scales,
and geographically-informed image retrieval.

Abstract

We consider the high-dimensional heteroscedastic regression model,
where the mean and the log variance are modeled as a linear
combination of input variables. Existing literature on
high-dimensional linear regression models has largely ignored
non-constant error variances, even though they commonly occur in a
variety of applications ranging from biostatistics to finance. In
this paper we study a class of non-convex penalized pseudolikelihood
estimators for both the mean and variance parameters. We show that
the Heteroscedastic Iterative Penalized Pseudolikelihood Optimizer
(HIPPO) achieves the oracle property, that is, we prove that the
rates of convergence are the same as if the true model was known. We
demonstrate numerical properties of the procedure on a simulation
study and real world data.

Abstract

Semantic parsing is the problem of automatically converting natural
language into a computer-understandable formal representation
(essentially, a statement in a programming language). The promise of
semantic parsing is its generality: many language understanding tasks
can be posed as semantic parsing problems, including question
answering and information extraction. However, current approaches to
semantic parsing suffer from two serious defects: (1) semantic parsers
are trained on manually annotated sentences, which are impractical to
obtain when working at web-scale and (2) semantic parsers rely on
manually constructed knowledge bases, which are challenging to
construct and typically incomplete or nonexistent. The second problem
is especially problematic in grounded domains, such as robotics, where
language may refer to objects in the real-world.

This talk presents two weakly supervised training algorithms for
semantic parsers that overcome these two limitations. The first
algorithm eliminates the need for annotated sentences; we demonstrate
this algorithm by training an accurate semantic parser for Freebase
that has the most expressive knowledge representation of any published
semantic parser. The second algorithm eliminates the need for a
prespecified knowledge representation; we demonstrate this algorithm
on a natural language grounding task, identifying the objects in an
image referred to by natural language expressions such as "the mug to
the left of the monitor." In both cases, reducing the supervision
requirements of semantic parsing allows us to tackle problems which
are infeasible in the traditional, supervised paradigm.

Abstract

One way to build natural language processing tools is to
ask human experts to annotate examples of the desired output for
real-world text inputs, then apply supervised learning. Another is to
start with text and a model family and apply unsupervised learning.
In either case, how do we know whether the human or machine
annotations are "correct"? In this talk, I'll give a little bit of
background about the research area, and I'll discuss some of the
weaknesses of our current evaluation methodology. I'll present a new
abstract framework for evaluation. The central idea is to make
explicit certain adversarial roles among researchers, so that the
different roles in an evaluation are more clearly defined and
participants in all roles are offered ways to make measurable
contributions to the larger goal. This framework can be instantiated
in many ways, simulating some familiar intrinsic and extrinsic
evaluations as well as some new evaluations. This talk is entirely
based on preliminary ideas (no theoretical or experimental results)
and is intended to spark discussion.

Abstract

Many important scientific and data-driven problems involve quantities
which vary over both space and time. Examples include
functional magnetic resonance imaging (fMRI), climate data, or
experimental studies in physics and chemistry. Principal goals of many
methods in statistics, machine learning, and signal processing are to
use this data and
i) extract informative structures and remove noisy, uninformative parts;
ii) understand and reconstruct underlying spatio-temporal dynamics that
govern these systems; and
iii) forecast the data, i.e. describe the system in the future.

In this talk I present generally applicable, statistical
methods that address all three problems in a unifying approach. I
introduce two new techniques for optimal nonparametric forecasting of
spatio-temporal data: hard and mixed LICORS (Light Cone Reconstruction
of States). Hard LICORS is a consistent estimator of the predictive
state space of continuous-valued data. Mixed LICORS builds on a new,
fully probabilistic model of light cones and predictive states
mappings, and is an EM-like version of hard LICORS. These
estimators can then be used to estimate local statistical complexity
(LSC), a fully automatic technique for pattern discovery in dynamical
systems. Simulations and applications to fMRI data demonstrate
that the proposed methods work well and
give useful results in very general scientific settings.

Abstract

Timestamped data present a challenge to machine learning for
predictions in the future: ignoring timestamps and assuming data are
i.i.d. is scalable but risks distracting a model with irrelevant
``ancient history,'' while using only the most recent portion of the
data risks overfitting to current trends and missing important
time-insensitive effects. We seek a general approach to learning model
parameters that consider the variation in how different effects change
over time.

We construct two novel prior distributions that allows parameters of
probabilistic models to vary over time. Our priors encourage
correlation between parameters at successive timesteps. We show how to
do learning and inference under these priors. We test the approaches
on several real-world datasets, demonstrating significant improvements
over time-series-ignorant priors. Moreover, inspecting feature
coefï¬cients in the model allows us to identify trends and changes over
time.

Abstract

Current systems for graph computation require a distributed computing
cluster to handle very large real-world problems, such as analysis on
social networks or the web graph. While distributed computation
resources have become more accessible, developing distributed graph
algorithms still remains challenging, especially to non-experts.

In this work, we present GraphChi, a disk-based system for computing
efficiently on graphs with billions of edges. By using a well-known
method to break large graphs into small parts, and a novel parallel
sliding windows method, GraphChi is able to execute several advanced
data mining, graph mining, and machine learning algorithms on very
large graphs, using just a single consumer-level computer. We further
extend GraphChi to support graphs that evolve over time, and
demonstrate that on a single computer, GraphChi can process over a
hundred of thousands of graph updates per second, while simultaneously
performing computation. We show by experiments and theoretical
analysis, that GraphChi performs well on SSDs and srprisingly
also on rotational hard drives.

By repeating experiments reported for existing distributed systems,
we show that with only fraction of the resources, GraphChi can solve
the same problems in very reasonable time. Our work brings large-scale
graph computation available to anyone with a modern PC.

This work has been accepted to OSDI '12, and will be presented in
Hollywood on October 8, 2012.

Abstract

We analyze the problem of partitioning a 0-1 array or bipartite graph
into subgroups (also known as co-clustering), under a relatively mild
assumption that the data is generate by a general nonparametric
process. We show that detection of co-clusters in the data implies
with high probability the existence of co-clusters of similar
proportion and connectivity in the generative process. Our main
application is the analysis of a crude model for networks -- the
stochastic co-blockmodel -- when the data is not assumed to be
generated (even approximately) by a blockmodel, but rather by an
unknown exchangeable process. Our result suggests that the stochastic
co-blockmodel and other community detection algorithms may be robust
to model misspecification.

Abstract

We consider two nonparametric hypothesis testing problems: (1) Given
samples from distributions p and q, a two-sample test determines
whether to reject the null hypothesis p=q; and (2) Given a joint
distribution p_xy over random variables x and y, an independence test
determines whether to reject the null hypothesis of independence, p_xy
= p_x p_y. In testing whether two distributions are identical, or
whether two random variables are independent, we require a test
statistic which is a measure of distance between probability
distributions. One choice of test statistic is the maximum mean
discrepancy (MMD), a distance between embeddings of the probability
distributions in a reproducing kernel Hilbert space. The kernel used
in obtaining these embeddings is critical in ensuring the test has
high power, and correctly distinguishes unlike distributions with high
probability.

In this talk, I will provide a tutorial overview of kernel distances
on probabilities, and show how these may be used in two-sample and
independence testing. I will then describe a strategy for optimal
kernel choice, and compare it with earlier heuristics (including other
multiple kernel learning approaches).

Abstract

We study the challenging problem of learning decision lists
attribute-efficiently, giving both positive and negative results.

Our main positive result is a new tradeoff between the running time
and mistake bound for learning length-k decision lists over n Boolean
variables. When the allowed running time is relatively high, our new
mistake bound improves significantly on the mistake bound of the best
previous algorithm of Klivans and Servedio.

Our main negative result is a new lower bound on the weight of
any degree-d polynomial threshold function (PTF) that computes a
particular decision list over k variables. This lower bound
establishes strong
limitations on the effectiveness of the Klivans and Servedio approach
and
suggests that it may be difficult to improve on our positive
result. The main tool used in our lower bound is a new variant of
Markov's classical inequality which may be of independent interest; it
provides a bound on the derivative of a univariate polynomial in terms
of both its degree and the size of its coefficients.

Abstract

Deep learning and unsupervised feature learning offer the potential to
transform many domains such as vision, speech, and NLP. However,
these methods have been fundamentally limited by our computational
abilities, and typically applied to small-sized problems. In this
talk, I describe the key ideas that enabled scaling deep learning
algorithms to train a very large model on a cluster of 16,000 CPU
cores (2000 machines). This network has 1.15 billion parameters,
which is more than 100x larger than the next largest network reported
in the literature.

Such network, when applied at the huge scale, is able to learn
abstract concepts in a much more general manner than previously
demonstrated. Specifically, we find that by training on 10 million
unlabeled images, the network produces features that are very
selective for high-level concepts such as human faces and cats. Using
these features, we also obtain significant leaps in recognition
performance on several large-scale computer vision tasks.

Abstract

This talk motivates the study of degrees of freedom in
statistical machine learning problems. Essentially, the degrees of
freedom of a fitting procedure is its effective number of parameters.
Though this is a vague concept, it has a precise definition for a wide
class of problems. I will discuss what is known for some key problems
in this class, and how some other important problems are not
particularly well-understood.

Abstract

We study the problem of active learning in a stream-based setting, allowing
the
distribution of the examples to change over time. We prove upper bounds on
the number of prediction mistakes and number of label requests for
established
disagreement-based active learning algorithms, both in the realizable case
and
under Tsybakov noise. We further prove minimax lower bounds for this
problem.

It is often computationally hard to run these methods with the 0-1 loss.
Passive
learning often resolves this problem by replacing the 0-1 loss with a
convex relaxation,
called a surrogate loss. We examine the extent to which this trick can
also be useful
in active learning. We start with a negative result for active learning
with convex losses,
where we prove that even under bounded noise constraints, the minimax rates
for
optimizing a convex loss with proper active learning are often no better
than for
passive learning. Then we explore a strategy that makes use of a given
surrogate
loss function in a different way, so that although the algorithm does not
necessarily
optimize the surrogate loss, it does optimize the 0-1 loss under certain
conditions.
We further present label complexity results for this method, showing that
it sometimes
improves over the analogous passive learning method.

Abstract

Finding the k nearest neighbors (k-NNs) of a given vertex
in a graph has many applications such as link prediction,
recommendation systems and keyword search. One robust
measure of vertex-proximity in graphs is the Personalized
Page Rank (PPR) score based on random walk with restarts.
Since PPR scores have long-range correlations, computing
them accurately and efficiently is challenging when the
graph is too large to fit in memory, especially when it also
changes over time. In this work, we propose ClusterRank,
an efficient algorithm to answer PPR-based k-NN queries in
large time-evolving graphs. ClusterRank represents a given
graph as a collection of dense vertex-clusters with their
inter connections. Each vertex-cluster maintains certain
information related to internal random walks and updates
this information as the graph changes. At query time,
ClusterRank combines this information from a small set
of relevant clusters and computes ppr scores efficiently.
While ClusterRank can perform exact computations, we
also propose several heuristics in order to reduce its query
response time while only sacrifi cing a little on the accuracy.
We validate the effectiveness of our method on several
synthetic and real-world graphs from diverse domains.

Biography

Leman Akoglu is a Ph.D. candidate in the Computer Science Department
at Carnegie Mellon University, advised by Prof. Christos Faloutsos.
She received her B.S. at Bilkent University in 2007. She won 2 best
paper awards and published 15 refereed articles in major data mining
venues. She is one of the inventors of 3 U.S. patents (pending), filed
by IBM T. J. Watson Research Labs. Her research focuses on large-scale
data analytics, with an emphasis on anomaly and event detection in
large, time-varying graphs using scalable algorithms and tools.

Abstract

We present the first PAC bounds for learning parameters of Conditional
Random Fields (Lafferty et al., 2001) with general structures over discrete
and real-valued variables. Our bounds apply to composite likelihood
(Lindsay, 1988), which generalizes maximum likelihood and pseudolikelihood
(Besag, 1975). Moreover, we show that the only existing algorithm with a
PAC bound for learning high-treewidth discrete models (Abbeel, 2006) can be
viewed as a computationally inefficient method for computing
pseudolikelihood. We present an extensive empirical study of the
statistical efficiency of these estimators, as predicted by our bounds.
Finally, we use our bounds to show how to construct computationally and
statistically efficient composite likelihood estimators.

Abstract

How can machine learning help you write a song? In this talk, I will
discuss two preliminary projects that have grown out of February Album
Writing Month (
http://fawm.org/),
an online community for
thousands of international musicians that I organize. The goal of each
participant is to write 14 new songs in the month of February. First, I
will discuss simple generative language models which were used to create a
suite of online "computational creativity tools" called The Muse (
http://muse.fawm.org)
. Despite their simplicity, these tools were
successful at helping hundreds of participants write new songs. Second, I
will present work in modeling user behavior that begins to explain and
characterize what social interactions are most associated with successful
outcomes (number of songs written, attaining the 14-song goal, donating
money, returning the following year, etc.). I will conclude by proposing a
few open directions for how machine learning can support individual and
group creativity.

Abstract

Mitchell (2008) demonstrated that brain states can be understood in
terms of a componential semantics - that is that certain conceptual
dimensions or other semantic properties could be used to interpret, and
to predict, the activity of the brain when it is processing the meaning
of a word. Meaning can be characterized in many computational ways,
including hand-crafted ontologies, electronic thesauri, and corpus-based
distributional models like HAL or LSA. Broad-coverage "people-based"
distributional models are also emerging through crowd-sourcing of word
associations, properties and similarity judgements. Here I'll examine
corpus-based models to see which kinds of co-occurrence data are most
informative for understanding patterns of activity in the brain, using
regularized regressions. I'll also compare how labour- and
computationally-intensive they are relative to human benchmarks, and
speculate on how a combination of automatically derived and
hand-tailored resources might provide a more comprehensive description
of the human lexicon.

Abstract

Many biological processes can be viewed as algorithms that
nature has designed to solve real-world computational problems. In
this talk, we will discuss our preliminary work on connecting
"synaptic pruning" (the neurodevelopmental process in which synapses
in the brain exuberantly form and then retract) to problems in
distributed network design. We will present numerical results showing
the benefit of pruning-based algorithms compared to other strategies
for creating robust, efficient, and adaptive networks, and we will
talk about our efforts to experimentally quantify the rate of synapse
elimination in the mouse cortex.

This is joint work with Yoshio Takashima, Alison Barth, and Ziv Bar-Joseph.

Abstract

This talk will focus on the contextual bandit problem, which
essentially captures many situations from medical testing to
computational advertising. In this setting, a learner only discovers
about the actions it takes, creating a need for balancing exploiting
good strategies and exploring new ones. I will discuss some recent
results on the contextual bandit problem, including the first
high-probability optimal algorithm, a reduction from bandits to cost
sensitive learning, as well as an interesting generalization of the
setting. These advances point to the possibility of launching better
and more effective algorithms in practice.

Abstract

Differential Privacy is a criteria used to judge whether a randomized
algorithm operating over a database of individuals may be deemed to
preserve privacy. In this work we apply the notion of Differential
Privacy to reproducing kernel Hilbert spaces of functions. As in the
finite dimensional Differential Privacy literature, we achieve privacy
via noise addition where the variance is calibrated to the
"sensitivity" of the output. In our setting the noise in question is
the sample path of a Gaussian process, and the sensitivity is measured
in the RKHS norm rather than the euclidean norm. We give examples of
private versions of kernel density estimators and support vector
machines.

This talk will be self contained in that it will not assume the prior
knowledge about differential privacy, stochastic processes etc.

Abstract

We derive generalization error bounds â bounds on the expected
inaccuracy of the predictions â for traditional time series
forecasting models. These bounds allow forecasters to select among
competing models and to guarantee that with high probability, their
chosen model will perform well without making strong assumptions about
the data generating process or appealing to asymptotic theory.
Extending results from statistical learning theory, we demonstrate how
these techniques can benefit time-series forecasters interested in
choosing models which behave well under uncertainty and
misspecification.

Abstract

Hierarchical Bayesian models have become a popular tool for analyzing
large-scale real-world data, such as text and images. Through these
models, people can build useful tools for latent structure discovery,
browsing and recommendations. In this talk, I will present several
of my work in the area of hierarchical Bayesian modeling, with an
emphasis on topic modeling, Bayesian nonparametrics and approximate
posterior inference. Specifically, I will first talk about an online
variational inference framework that can scale up to millions of
articles on a single machine and also does model selection using
Bayesian nonparametric models. Then I will describe a novel
recommendation model on scientific articles that provides an
interpretable latent structure for both users and items. This is
important for end-user interactions, however usually difficult to
obtain and ignored in the literature. A demo is at
http://www.cs.princeton.edu/~chongw/citeulike/
.

Abstract

The problem of estimating high-dimensional network structures arises
naturally in the analyses of many physical, biological and
socio-economic systems. Examples include stock price fluctuations in
financial markets and gene regulatory networks representing effects of
regulators (transcription factors) on regulated genes in Genetics. In
many of these applications, the variables have inherent grouping
structures, that when incorporated, can result in improved estimation
and prediction. We aim to learn the structure of the network over time
employing the framework of Granger causal models, under the
assumptions of sparsity of its edges and inherent grouping structure
among its nodes. I introduce a truncated penalty variant of group
lasso to discover the Granger causal interactions among the nodes of
the network. Asymptotic results on the consistency of the new
estimation procedure are developed. The performance of the proposed
methodology is assessed through an extensive set of simulation studies
and comparisons with existing techniques. Finally, various extensions
of the framework to more complex Granger causal structures are
discussed.

Joint work with Sumanta Basu and George Michailidis (University of Michigan)

Abstract

Most machine learning algorithms, such as classification or
regression, treat the individual data point as the object of interest.
Here we consider extending machine learning algorithms to operate on
groups of data points. We suggest treating a group of data points as
a set of i.i.d. samples from an underlying feature distribution for
the group. Our approach is to generalize kernel machines from
vectorial inputs to i.i.d. sample sets of vectors. For this purpose,
we use a nonparametric estimator that can consistently estimate the
inner product and certain kernel functions of two distributions. The
projection of the estimated Gram matrix to the cone of semi-definite
matrices enables us to employ the kernel trick, and hence use kernel
machines for classification, regression, anomaly detection, and
low-dimensional embedding in the space of distributions. Our numerical
results demonstrate that in many cases this approach can outperform
state-of-the-art competitors on both simulated and challenging
real-world datasets.

Abstract

How do you learn a linear predictor on a dataset with 2 trillion
nonzero features in a reasonable
amount of time? Allsingle-machine algorithms fail, because the time
to even stream the data
through a network interface is too great. I will discuss an algorithm
for doing this based on a
combination of online learning, LBFGS, parallel learning, and a new
communication
infrastructure---Hadoop compatible Allreduce that we created.
Deployed on 1000 nodes,
we can learn faster than any single machine _ever_ will be able to
learn a linear predictor
on this hardware, the first such learning algorithm for which this
claim can be made. The code behind this is open source in Vowpal
Wabbit (http://hunch.net/~vw/).

Abstract

I will discuss some work on sequential decision making under
uncertainty in large, partially observable domains. During the talk I
will demonstrate that by leveraging structural properties in the
dynamics model we can scale to domains orders of magnitude larger than
generic approaches. The talk will particularly focus on the strengths
and technical challenges that arise in designing automated, adaptive
instructional sequences for computerized tutoring systems.

Abstract

Fielding ability remains a difficult quantity to estimate in baseball.
I present a sophisticated hierarchical model that uses current
ball-in-play data to evaluate individual fielders. I will discuss
continuing efforts to extend these fielding models to examine the
evolution of fielding ability over multiple seasons. Many challenges
in this area remain: our modeling efforts are constrained by the
aspects of fielding measured in the current data. These limitations
will be discussed with a look towards the potential availability of
much higher resolution data in the near future.

Abstract

Modeling the purposeful behavior of imperfect agents from a small
number of observations is a challenging task. When restricted to the
single-agent decision-theoretic setting, inverse optimal control
techniques assume that observed behavior is an approximately optimal
solution to an unknown decision problem. These techniques learn a
utility function that explains the example behavior and can then be
used to accurately predict or imitate future behavior in similar
observed or unobserved situations.
In this work, we consider similar tasks in competitive and cooperative
multi-agent domains. Here, unlike single-agent settings, a player
cannot myopically maximize its reward --- it must speculate on how the
other agents may act to influence the game's outcome. Employing the
game-theoretic notion of regret and the principle of maximum entropy,
we introduce a technique for predicting and generalizing behavior, as
well as recovering a reward function in these domains.

Abstract

Many interesting applications require solving the following
problem: by watching an incoming stream of sensor data, hypothesize a
dynamical system model which explains that data. The general problem
of learning a dynamical system from a sensor data stream is difficult:
to discover the right latent state representation and model
parameters, one must solve difficult temporal and structural credit
assignment problems, often leading to a search space with a host of
(bad) local optima. In this talk, I will discuss how to overcome these
problems by pairing an expressive class of models called Predictive
State Representations with statistically consistent spectral learning
algorithms. I will show that this framework is very general, easy to
implement, computationally efficient, and can be used to unify and
solve a number of different learning problems that are typically
addressed in isolation.

Abstract

We explore a transfer learning setting, in which a finite sequence of
target concepts are sampled independently with an unknown distribution
from a known family. We study the total number of labeled examples
required to learn all targets to an arbitrary specified expected
accuracy, focusing on the asymptotics in the number of tasks and the
desired accuracy. Our primary interest is formally understanding the
fundamental benefits of transfer learning, compared to learning each
target independently from the others. Our approach to the transfer
problem is general, in the sense that it can be used with a variety of
learning protocols. The key insight driving our approach is that the
distribution of the target concepts is identifiable from the joint
distribution over a number of random labeled data points equal the
Vapnik-Chervonenkis dimension of the concept space. This is not
necessarily the case for the joint distribution over any smaller
number of points.

This work has particularly interesting implications when applied to
active learning methods. In particular, we study in detail the
benefits of transfer for self-verifying active learning; in this
setting, we find that the number of labeled examples required for
learning with transfer is often significantly smaller than that
required for learning each target independently.

Abstract

Latent feature models are an appropriate choice for image modeling,
since images generally contain multiple objects or features. However,
many latent feature models either do not account for the fact that
objects can appear at different locations in different images or they
require pre-segmentation of images and cluster the resulting segments.
While the recently-proposed transformed Indian buffet process (tIBP)
provides a method for modeling transformation-invariant features in
simple binary images without the need for pre-segmentation, it cannot
be applied to real images because of both computational constraints
and its modeling assumptions. In this talk, I will show how the tIBP
can be combined with an appropriate likelihood to create a model
applicable to real images, and describe a novel Metropolis-Hastings
inference algorithm that significantly improves the scalability of the
tIBP.

This is joint work with Ke Zhai, Yuening Hu and Jordan Boyd-Graber at
the University of Maryland.

Abstract

The Lemonade Stand Game Tournament is a multiagent competition focused
on understanding the issues faced when playing in a multi-agent
environment. In contrast to Robocup, the annual computer poker
competitions, and the TAC Competitions, the computational aspects of
the game are intentionally kept to a minimum, forcing players to focus
on the far more enigmatic task of understanding their opponents. While
Nash equilibria are easily computed, the game is unsolvable, meaning
that if everyone plays a different equilibrium, there is no guarantee
that there joint behavior is meaningful.

In this talk, I will discuss how, in the past three years of the
competition, new and interesting strategies have arisen which have no
single agent equivalent: strategies that attempt not only to
understand and adapt to the world, but strategies that also force the
world to adapt to them. Although the competition is about maximizing
utility, teams have designed radically new decision methodologies to
achieve this. It is a known failing of Nash equilibria that it is a
static concept and does not incorporate learning. The ultimate goal of
this competition is to create a "group thought experiment" where not
just an understanding of the static states of multi-agent systems can
be developed, but also new empirical laws governing their dynamics can
be explored, to be then tested and analyzed in a wider setting.

Biography

Martin Zinkevich is a Senior Research Scientist at Yahoo! Research,
where he has worked since 2007. His primary interests are anti-abuse,
large scale machine learning, and multi-agent learning. He completed
his Ph.D. with Avrim Blum at CMU in 2004 with a thesis entitled,
"Theoretical Guarantees in Multiagent Settings". In between, he
worked as a postdoc with Amy Greenwald at Brown University, looking at
multi-agent reinforcement learning, and as a postdoc with Michael
Bowling at the University of Alberta, building (among other things)
poker programs that played at professional level, a project which
culminated in a exhibition match in Vegas at the World Series of Poker
and a Wired article. At his day job, he works on blocking spam e-mails
and spam comments under news articles.

Abstract

Facebook is the most popular social network in the world and millions
check news from their friends on its home page every day. There is a
machine learning system that creates personalized news stream for
every user on every load of the page. Filtering and ranking stories in
the news feed are unique Facebook problems but they are similar to
many other machine learning challenges. The large scale of growing
social network, changes in interface and user behavior do not give us
an opportunity to find the best solution once and forever, this is an
endless competition of ideas and algorithms. The presentation
contains some details about the problem, architecture of the system,
our ideas, and algorithms that we use.

Abstract

Using Lipschitz extensions for classification in metric spaces was
apparently first proposed by von Luxburg and Bousquet (2004), who also
noted that algorithmically, the solution can be realized as a
nearest-neighbor search. In a COLT 2010 paper, we showed how to
exploit the intrinsic geometry of the metric space to construct highly
efficient classifiers and to derive data-dependent generalization
bounds. We employed the doubling dimension on two fronts:
information-theoretically, to control the fat-shattering dimension of
Lipschitz functions (which yields error estimates), and
algorithmically, to perform approximate nearest-neighbor search
exponentially faster than the exact one. Since then, we have extended
this technique to regression and anomaly detection. The talk, intended
for a broad audience, will present an overview of our recent results,
obtained in collaboration with: Daniel Berend, Lee-Ad Gottlieb, Danny
Hendler, Eitan Menahem, Robert Krauthgamer.

Biography

Aryeh (Leonid) Kontorovich received his undergraduate degree in
mathematics with a certiï¬cate in applied mathematics from Princeton
University in 2001. His M.Sc. and Ph.D. are from Carnegie Mellon
University, where he graduated in 2007. After a postdoctoral
fellowship at the Weizmann Institute of Science, he joined the
Computer Science department at Ben-Gurion University of the Negev in
2009 as an assistant professor; this is his current position. His
research interests are mainly in machine learning, with a focus on
probability, statistics, and automata theory.

Abstract

The variety and complexity of potentially-related data resources
available for querying --- webpages, databases, data warehouses ---
has been growing ever more rapidly. There is a growing need to pose
integrative queries across multiple such sources, exploiting foreign
keys and other means of interlinking data to merge information from
diverse sources. This has traditionally been the focus of research
within Information Extraction (IE) and Information Integration (II)
communities, with IE focusing on converting unstructured sources into
structured sources, and II focusing on providing a unified view of
diverse structured data sources. However, most of the current IE and
II methods, which can potentially be applied to the problem of
integration across sources, require large amounts of human
supervision, often in the form of annotated data. This need for
extensive supervision makes existing methods expensive to deploy and
difficult to maintain. Faced with this challenge, in this talk, I
shall present an overview of my research into graph-based
weakly-supervised methods for IE and II.

Biography

Partha Pratim Talukdar is a Postdoctoral Fellow in the Machine
Learning Department at CMU, working with Tom Mitchell. He received his
PhD (2010) in CIS from the University of Pennsylvania, working under
the supervisions of Fernando Pereira, Mark Liberman, and Zack Ives. He
is broadly interested in Machine Learning, Natural Language
Processing, Data Integration, and Cognitive Science, with particular
interest in large-scale learning and inference over graphs. Partha has
worked at a variety of industrial research labs, including HP Labs,
Google Research, and most recently Microsoft Research where he spent a
year before coming to Carnegie Mellon.

Abstract

The saliency of regions or objects in an image can be signiï¬cantly
boosted if they recur in multiple images.
Leveraging this idea, cosegmentation jointly segments common regions
from multiple images. In this paper, we propose CoSand, a distributed
cosegmentation approach for a highly variable large-scale image
collection. The segmentation task is modeled by temperature
maximization on anisotropic heat diffusion, of which the temperature
maximization with ï¬nite K heat sources corresponds to a Kway
segmentation that maximizes the segmentation conï¬dence of every pixel
in an image. We show that our method takes advantage of a strong
theoretic property in that the temperature under linear anisotropic
diffusion is a submodular function; therefore, a greedy algorithm
guarantees at least a constant factor approximation to the optimal
solution for temperature maximization. Our theoretic result is
successfully applied to scalable cosegmentation as well as diversity
ranking and single-image segmentation. We evaluate CoSand on MSRC and
ImageNet datasets, and show its competence both in competitive
performance over previous work, and in much superior scalability.

"Presented in partial fulfillment of the CSD speaking skill
requirement".

Abstract

Although spectral clustering has enjoyed considerable empirical
success in machine
learning, its theoretical properties are not yet fully developed. We analyze
the performance of a spectral algorithm for hierarchical clustering
and show that
on a class of hierarchically structured similarity matrices, this
algorithm can tolerate
noise that grows with the number of data points while still perfectly recovering
the hierarchical clusters with high probability. We additionally improve upon
previous results for k-way spectral clustering to derive conditions under which
spectral clustering makes no mistakes. Further, using minimax
analysis, we derive
tight upper and lower bounds for the clustering problem and compare the
performance of spectral clustering to these information theoretic
limits. We also
present experiments on simulated and real world data illustrating the
strength of
our results.

Joint work with Sivaraman Balakrishnan, Akshay Krishnamurthy, and Aarti Singh

Biography

Abstract

When dealing with time series with complex and uncertain non-stationarities,
low retrospective regret on individual realizations is in general a more
appropriate goal than low prospective risk in expectation. Online learning
algorithms provide powerful guarantees of this form and have often been
proposed for use with non-stationary processes because of their ability to
switch between different forecasters or "experts." However, existing methods
assume that this set of experts whose forecasts are to be combined is given
at the start and fixed over time, and such assumptions are not generally
plausible when dealing with genuinely historical or evolutionary systems. I
show how to modify the "fixed shares" algorithm for tracking the best expert
to handle a steadily growing set of experts, in which new experts are fitted
to new data as they become available, and obtain regret bounds for the
growing ensemble.
Joint work with Kristina Klinkner, Abigail Jacobs, and Aaron Clauset.

Abstract

Extracting useful knowledge from large network datasets has become a
fundamental challenge in many domains, from scientific literature to
social networks and the web. We introduce Apolo, a system that uses a
mixed-initiative approach --- combining visualization, rich user
interaction and machine learning --- to guide the user to
incrementally and interactively explore large network data and make
sense of it. Apolo engages the user in bottom-up sensemaking to
gradually build up an understanding over time by starting small,
rather than starting big and drilling down. Apolo also helps users
find relevant information by specifying exemplars, and then using a
machine learning method called Belief Propagation to infer which other
nodes may be of interest. We evaluated Apolo with twelve participants
in a between-subjects study, with the task being to find relevant new
papers to update an existing survey paper. Using expert judges,
participants using Apolo found significantly more relevant papers.
Subjective feedback of Apolo was also very positive.

Abstract

The rise of social media has yielded tremendous amounts of behavioral and
textual data recording people's attitudes and interests in various social
contexts. I will present several projects, employing techniques from simple
statistics to LDA-style graphical models, that infer cultural phenomena from
these data:
(1) Relating opinion polls to Twitter sentiment analysis
(2) Geographic linguistic communities on Twitter
(3) Cultural interest groupings in the Facebook "Like" graph

Abstract

Many activities traditionally regarded as based on individual choice are in
fact strongly influenced by the social connections of each customer. In my
talk I will present a novel method for analyzing social connections called
the Group-First approach. This method exploits the structure of customer
interactions to identify social leaders, and through it predict their
behavior, as well as that of their peers.
I will demonstrate the applicability of this approach to prediction of
customer churn in cellular networks. I will present our results from several
carriers, which confirm the unique advantages of our approach. Finally, I
will discuss the parallel architecture which allows us to process the large
volumes of data required for this analysis.

Biography

Elad Yom-Tov is a Senior Research Scientist at Yahoo Research. Before
joining Yahoo in 2010, he was with the Machine Learning group at IBM
Research Haifa Lab and Rafael. Dr. Yom-Tov received his B.Sc from Tel-Aviv
University and his M.Sc and Ph.D. from the Technion Ã¢ÂÂ Israel Institute of
Technology. Dr. Yom-Tov has co-authored two books and over 40 publications
in top international conferences and journals, and filed over 30 patents (6
of which have been granted so far). His primary research interests are in
large-scale Machine Learning, Information Retrieval, and in the past few
years, social analysis.

Abstract

Every month, users spend 700 billion minutes on Facebook.
Surfacing personalized relevant content to each user at this scale, and
for each of the billions of page views, presents interesting machine
learning challenges. As a sample application for matching content to user
interests, I will describe a method for improving the relevance of online
advertising. The method is based on a user interest model that is trained
on past clickthrough data. I will also talk briefly about our work on a
recommendation system to suggest new friends. This system is responsible
for 40% of all new friend connections made on Facebook.

Abstract

This talk describes recent progress on generally characterizing the
number of label requests sufficient for active learning to achieve a
given accuracy (i.e., the label complexity), in both noisy and
noise-free settings. Specifically, we begin by discussing a
disagreement-based approach to the design and analysis of active
learning algorithms, which has recently gained popularity in the
literature. We find that the label complexities achieved by
algorithms of this type are typically well-characterized by a simple
quantity known as the disagreement coefficient. We then proceed to
discuss a newer approach to the design of active learning algorithms,
based on shatterable sets, and characterize the label complexities
achievable by these methods in terms of a quantity analogous to the
disagreement coefficient. In particular, these label complexities are
often significantly better than those achievable by disagreement-based
methods.

Abstract

Models of interacting Self-Propelled Particles (SPPs) have proven
adept at reproducing realistic-looking simulations of animal swarms
and form the basis for our understanding of how collective behaviour
emerges from individual interactions. With the continued progression
of animal tracking technology it is now possible to consider inferring
the optimal interaction model based on empirical data of individual
movements within co-moving groups.

Group simulations often result in similar patterns of collective
behaviour despite including different fine-scale interactions rules.
The emergence of similar collective patterns in simulations of animal
groups points to a restricted set of "universal" classes for these
patterns. Universality presents a challenge to inferring such
interactions from macroscopic group dynamics since these can be
consistent with many underlying interaction models. As such,
fine-scale movements of the animals must be considered if one is to
understand exactly how individuals interact.

I will present a study of using Bayesian model selection on simulated
data of animal swarms to distinguish between competing interaction
models and demonstrate the limitations posed by accepted standards of
experimental design.

Biography

Richard Mann is a postdoc in the Centre for Interdisciplinary
Mathematics at Uppsala University, Sweden. He works on applying
methods of inference and machine learning to the analysis of data in
animal behaviour experiments. He completed his PhD at the University
of Oxford, UK in 2010.

Abstract

I will discuss two projects, both of which are concerned with finding
structure in data. The first is "Graph-Valued Regression" where we
estimate an undirected graph for a random vector Y, as a function of a
second random vectors X. (Joint work with Han Liu, Xi Chen, and John
Lafferty).

The second is the problem of "Minimax Estimation of Manifolds From
Noisy Data, in the Hausdorff Distance". (Joint work with Chris
Genovese, Marco Perone-Pacifico and Isabela Verdinelli).

Abstract

For naturally occurring data, the dimension of the given input space is
often very large while the data themselves have a low intrinsic
dimensionality. Spectral kernel methods are non-linear techniques for
transforming data into a coordinate system that efficiently reveals the
underlying structure -- in particular, the "connectivity" -- of the data. In
this talk, I will focus on one particular technique -- diffusion maps -- but
the analysis can be used for other spectral methods as well. I will give
examples of various applications of the method in high-dimensional
inference. I will also present a new extension of the diffusion framework to
comparing distributions in high-dimensional spaces with an application to
content-based image retrieval. (Part of this work is joint with R.R.
Coifman, S. Lafon, C. Schafer and L. Wasserman)

Abstract

As computing hardware becomes more portable and more instrumented with
sensors, algorithms to parse this data become ever more important. We
introduce a framework for building context awareness into standard
smart-phones, which allows us to personalize the behavior of the phone--or
other devices communicating remotely with the phone--using observations made
about the owner's usage patterns.

In this test-bed application, we utilize data from a GPS, microphone,
proximity sensor, accelerometer as well as from user interactions to predict
whether the phone should have it's ringer set to silent or loud in the
current context. The system must face challenges originating from noise in
the sensor data, inconsistent user feedback, and it must face these
challenges using only a minimal amount of battery power.

Initial results from the In-Context system will be presented, and a variety
of other applications for this technology will be introduced.

Abstract

Graphical models or Markov random fields provide a graph-based
framework for capturing dependencies between random variables of a
large-scale multivariate distribution. This interdisciplinary topic
has found widespread application in a variety of areas including image
processing, bioinformatics, combinatorial optimization and machine
learning. Estimating the graph structure of the model using samples
drawn from it forms an important task, since the structure reveals
important relationships between the variables.
However, structure estimation has several challenges: in general
graphical models, it is NP-hard, the models are typically in the
high-dimensional regime where the number of variables is much larger
than the number of samples obtained, and there could be many latent
variables which are unobserved. I will address these challenges in the
talk and provide solutions for certain classes of models.

I will focus on latent tree models in the first part of the talk.
These are tree graphical models where there are latent variables, but
there is no knowledge of the number or the location of the latent
variables. We have developed novel algorithms which are
consistent, computationally efficient and have low sample
complexity. These algorithms are based on the presence of an additive
metric on the tree, due to the properties of correlations on a tree
model. The first algorithm uses these properties to check for sibling
relationships between node pairs and builds the tree in a recursive
fashion. The second algorithm initially builds a tree over the
observed variables, and then adds hidden nodes in a step-by-step
fashion by only operating on small subsets of variables. This leads
to considerable computational savings compared to the first algorithm.
We modify the second algorithm for experiments on real data by trading
off number of added latent variables with the accuracy of resulting
model fitting via the Bayesian Information Criterion (BIC).
Experiments on the S&P 100 monthly returns data and on the occurrence
of words in newsgroups reveal interesting relationships.

In the second part, I will talk about recent results on learning
graphical models on sparse Erdos-Renyi random graphs. These random
graphs are relevant in social networks. Since these graphs are locally
tree-like, it is a natural question if structure learning is feasible
in these models, given that learning tree models is tractable. We
provide a positive answer when the model is in the so-called
uniqueness regime, where there is a decay of long-range correlations.
The algorithm is based on a set of conditional mutual information
tests and is shown to be consistent for structure estimation with
almost order-optimal sample complexity. A simpler algorithm based on
correlation thresholding is also consistent, but under more stringent
conditions. Finally, depending on the time
availability, I will briefly mention related works on consistent
estimation of high-dimensional forest distributions and the
characterization of extremal tree structures with respect to error
rates for structure learning.

Biography

Anima Anandkumar received her B.Tech in Electrical
Engineering from the Indian Institute of Technology (IIT) Madras in
2004 and her MS and PhD degrees in Electrical Engineering from
Cornell University, Ithaca, NY in 2009. She was at the Stochastic
Systems Group at MIT, Cambridge, MA as a post-doctoral researcher. She
has been an assistant professor at EECS Dept. at U.C.Irvine since July
2010. She is the recipient of the 2009 Best Thesis Award by the ACM
Sigmetrics Society, 2008 IEEE Signal Processing Society Young Author
Best Paper Award, 2008 IBM Fran Allen PhD fellowship, and student
paper award at 2006 IEEE ICASSP. Her research interests are in the
area of statistical-signal processing, network theory and information
theory.

Abstract

We sharply characterize the performance of different penalization
schemes for the problem of selecting the relevant variables in the
multi-task setting. Previous work focuses on the regression problem
where conditions on the design matrix complicate the analysis. A
clearer and simpler picture emerges by studying the Normal means
model. This model, often used in the field of statistics, is a
simplified model that provides a laboratory for studying complex
procedures. These theoretical results will be presented together with
implications for practitioners.

Abstract

Markov switching processes, such as hidden Markov models (HMMs) and
switching linear dynamical systems (SLDSs), are often used to
describe rich classes of dynamical phenomena. They describe complex
temporal behavior via repeated returns to a set of simpler models:
imagine, for example, a person alternating between walking, running
and jumping behaviors, or a stock index switching between regimes of
high and low volatility.

Traditional modeling approaches for Markov switching processes
typically assume a fixed, pre-specified number of dynamical models.
Here, in constrast, we develop Bayesian nonparametric approaches that
define priors on an unbounded number of potential Markov models.
Using stochastic processes including the beta and Dirichlet process,
we develop methods that allow the data to define the complexity of
inferred classes of models, while permitting efficient computational
algorithms for inference. The new methodology also has
generalizations for modeling and discovery of dynamic structure shared
by multiple related time series.

Interleaved througout the talk are results from studies of the NIST
speaker diarization database, stochastic volatility of a stock index,
the dances of honeybees, and human motion capture videos.

Abstract

An embedding of probability distributions into a reproducing kernel
Hilbert space (RKHS) has been introduced: like the characteristic
function, this provides a unique representation of a probability
distribution in a high dimensional feature space. This representation
forms the basis of an inference procedure on graphical models, where
the likelihoods are represented as RKHS functions. The resulting
algorithm is completely nonparametric: all aspects of the model are
represented implicitly, and learned from a training sample. Both exact
inference on trees and loopy belief propagation on pairwise Markov
random fields are demonstrated.

Kernel message passing can be applied to general domains where kernels
are defined, handling challenging cases such as discrete variables
with huge domains, or very complex, non-Gaussian continuous
distributions. We apply kernel message passing and competing
approaches to cross-language document retrieval, depth prediction from
still images, protein configuration prediction, and paper topic
inference from citation networks: these are all large-scale problems,
with continuous-valued or structured random variables having complex
underlying probability distributions. In all cases, kernel message
passing performs outstandingly, being orders of magnitude faster than
state-of-the-art nonparametric alternatives, and returning more
accurate results.

Abstract

The number of triangles is a computationally expensive graph statistic which is frequently used in complex network analysis (e.g., transitivity ratio), in various random graph models (e.g., exponential random graph model) and in important real world applications such as spam detection, uncovering of the hidden thematic structure of the Web and link recommendation. Counting triangles in graphs with millions and billions of edges requires algorithms which run fast, use small amount of space, provide accurate estimates of the number of triangles and preferably are parallelizable.

In this paper we present an efficient triangle counting algorithm which can be adapted to the semistreaming model. The key idea of our algorithm is to combine the sampling algorithm of Tsourakakis et al. and the partitioning of the set of vertices into a high degree and a low degree subset respectively as in Alon, Yuster and Zwick treating each set appropriately. We obtain a running time $O \left( m + \frac{m^{3/2} \Delta \log{n} }{t \epsilon^2} \right)$ and an $\epsilon$ approximation (multiplicative error), where $n$ is the number of vertices, $m$ the number of edges and $\Delta$ the maximum number of triangles an edge is contained. Furthermore, we show how this algorithm can be adapted to the semistreaming model with space usage $O\left(m^{1/2}\log{n} + \frac{m^{3/2} \Delta \log{n}}{t \epsilon^2} \right)$ and a constant number of passes (three) over the graph stream. We apply our methods in various networks with several millions of edges and we obtain excellent results (e.g., for the Orkut graph with ~120M edges, and ~286M triangles our method runs in ~5sec with 98% accuracy). Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an estimate with low variance.

Abstract

Dynamic Programming, since its introduction by Richard Bellman in the 1940s, is one of the most important problem solving techniques, with numerous applications in operations research, databases (histogram construction) times series analysis, speech recognition, robotics, biology and in many other fields. In this talk we will present two new techniques for performing dynamic programming approximately, for a recurrence not treated efficiently by existing methods. The basis of our first algorithm is the definition of a constant-shifted variant of the objective function that can be efficiently approximated using state of the art methods for range searching. Our technique approximates the optimal value of our objective function within additive $\epsilon$ error and runs in $\tilde{O}(n^{4/3+\delta} \log{ (\frac{U}{\epsilon}) )}$ time, where $\delta$ is an arbitrarily small positive constant and $U = \max \{ \sqrt{C},(P_i)_{i=1,\ldots,n} \}$. The second algorithm we provide solves a similar recurrence that's within a multiplicative factor of (1+$\epsilon$) and runs in $O(n \log{n} / \epsilon )$. The new technique introduced by our algorithm is the decomposition of the initial problem into a small (logarithmic) number of Monge optimization subproblems which we can speed up using existing techniques. Finally, we demonstrate a biological application of our recurrence where we obtain results superior to leading competitors both on benchmarks and real data.

Abstract

In this work, we describe PeGaSus, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node, finding the connected components, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PeGaSus is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce.

Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PeGaSus, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V.

Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ~ 6,7 billion edges.

Abstract

We consider the problem of identifying an activation pattern in a
complex, large-scale network that is embedded in very noisy
measurements. This problem is relevant to several applications, such
as identifying traces of a biochemical spread by a sensor network,
expression levels of genes, and anomalous activity or congestion in
the Internet. Extracting such patterns is a challenging task specially
if the network is large (pattern is very high-dimensional) and the
noise is so excessive that it masks the activity at any single node.
However, typically there are statistical dependencies in the network
activation process that can be leveraged to fuse the measurements of
multiple nodes and enable reliable extraction of high dimensional
noisy patterns. In this paper, we analyze an estimator based on the
graph Laplacian eigenbasis, and establish the limits of mean square
error recovery of noisy patterns arising from a probabilistic
(Gaussian or Ising) model based on an arbitrary graph structure. We
consider both deterministic and probabilistic network evolution
models, and our results indicate that by leveraging the network
interaction structure, it is possible to consistently recover
high-dimensional patterns even when the noise variance increases with
network size.

Biography

I am currently a Ph.D student in Machine Learning and Statistics at
CMU. I have a masters in Statistics and a bachelors in Math and
Physics from the Ohio State University. I have worked on exploiting
structure for statistical estimation and pattern localization.
Currently, I am working on optimal design and active learning for
Ising and infection models, structured sparsity, and density
estimation over large graphs.

Abstract

Latent variable models are an essential tool for the analysis of data in
numerous application areas, whether it be source separation, image analysis,
matrix completion or preference elicitation; since they provide a means of
understanding the inherent structure in data. This talk will, in two parts,
examine the natural evolution of latent variable models and the statistical
tools - Bayesian tools in particular - used for learning with these models.

In the first part, the motivation for generalising such models to the
exponential family will be given, which will allow for the modelling of data
that may be binary, categorical, counts or non-negative. The focus will be
on Bayesian analysis of this class of models, showing how inference is
performed using Hybrid Monte Carlo sampling, the relationship between the
generalised models and various other existing models, and examining aspects
of model identifiability.

In the second part, the discussion will switch to the priors that are used
to learn latent representations of data, and in the context of the
generalised latent variable models just discussed. The focus will be on
sparse Bayesian learning: weak sparsity achieved through priors based on the
Gaussian scale mixture construction; and strong sparsity using discrete
mixture priors. A comparison of these two methods will be given and an
efficient sampler for learning with discrete mixture priors will be
described, which has many advantages over a corresponding optimisation
approach to learning.

Biography

Shakir Mohamed is a PhD Candidate in the Machine Learning Group at the
University of Cambridge working under the supervision of Prof. Zoubin
Ghahramani. At Cambridge, he is a Commonwealth scholar and a
member of St John's College. His research interests lie in Bayesian
statistics and latent variable models, sparse Bayesian learning and its
connections to compressed sensing and learning in infinite dimensional
settings, and the probabilistic modelling of tensor data, amongst others.

Abstract

Permutations are ubiquitous in many real world problems, such as voting,
rankings and data association. Representing uncertainty over permutations
is challenging, however, since there are $n!$ possibilities. A pervasive technique
in machine learning for making large problems tractable is to exploit probabilistic
independence structures for decomposing large problems into much smaller ones.
However, it is not obvious how one might exploit independence for
permutation data due to the mutual exclusivity constraints which
disallow, for example, two items to map to the same rank.

I will talk about recent progress on tracking and ranking problems.
First, I will discuss probabilistically tracking multiple moving targets by decomposing
large sets of targets into smaller independent subsets.
Via independence decompositions, we are able to track much larger collections of targets
Unlike multiobject tracking, distributions over rankings are typically
not amenable to independence assumptions. Instead, I will present a
novel generalization of independence, called \emph{riffled independence},
encompassing a more expressive family of distributions while retaining
many of the properties necessary for performing efficient inference and
reducing sample complexity. In riffled independence, one draws two
permutations independently, then performs the \emph{riffle shuffle},
common in card games, to combine the two permutations to form a
single permutation. In ranking, riffled independence corresponds to
ranking disjoint sets of objects independently, then interleaving those rankings.

This is joint work with Carlos Guestrin, Leo Guibas, Xiaoye Jiang
and Ashish Kapoor

Abstract

Given a contact network that changes over time (say, day
vs night connectivity), and the SIS (susceptible/infected/susceptible, flu
like) virus propagation model, what can we say about its epidemic threshold?
That is, can we determine when a small infection will "take-off" and
create an epidemic? Consequently then, which nodes should we immunize
to prevent an epidemic? This is a very real problem, since, e.g. people
have different connections during the day at work, and during the night
at home. Static graphs have been studied for a long time, with numerous
analytical results. Time-evolving networks are so hard to analyze, that
most existing works are simulation studies.
Specifically, our contributions in this paper are: (a) we formulate the
problem by approximating it by a Non-linear Dynamical system (NLDS),
(b) we derive the first closed formula for the epidemic threshold of timevarying
graphs under the SIS model, and finally (c) we show the usefulness
of our threshold by presenting efficient heuristics and evaluate
the effectiveness of our methods on synthetic and real data like the MIT
reality mining graphs.

Abstract

Topic modeling has been popularly used for data analysis in various domains including text documents. Previous topic models, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA), have shown impressive success in discovering low-rank hidden structures for modeling text documents. These models, however, do not take into account the manifold structure of data, which is generally informative for the non-linear dimensionality reduction mapping. More recent models, namely Laplacian PLSI (LapPLSI) and Locally-consistent Topic Model (LTM), have incorporated the local manifold structure into topic models and have shown the resulting benefits. But these approaches fall short of the full discriminating power of manifold learning as they only enhance the proximity between the low-rank representations of neighboring pairs without any consideration for non-neighboring pairs. In this paper, we propose Discriminative Topic Model (DTM) that separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving the local consistency. We also present a novel model fitting
algorithm based on the generalized EM and the concept of Pareto improvement. As a result, DTM achieves higher classification performance in a semi-supervised setting by effectively exposing the manifold structure of data. We provide empirical evidence on text corpora to demonstrate the success of DTM in terms of classification accuracy and robustness to parameters compared to state-of-the-art techniques.

Abstract

High-level parallel frameworks like MapReduce (Hadoop) have begun to receive considerable attention within the machine learning (ML) community. However, while MapReduce is ideal for many large-scale data processing tasks, it does not naturally or efficiently express asynchronous iterative computation with sparse dependencies. Unfortunately, many popular machine learning algorithms like belief propagation, Gibbs sampling, CoEM, and the lasso (shooting algorithm) require asynchronous iterative computation and impose sparse parameter dependencies. To fill this critical void, we developed GraphLab which naturally expresses asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. As part of our recent UAI'10 publication, we released a shared-memory (multicore) implementation of the GraphLab API and are in the process of developing cluster and GPU versions.

In this talk I will briefly review the MapReduce abstraction and demonstrate how coercing popular iterative machine learning algorithm into MapReduce can lead to a highly inefficient parallel algorithms. I will then introduce the GraphLab framework and explain how it addresses the critical limitations of the MapReduce framework while retaining the advantages of a high-level abstraction. I will show how the GraphLab abstraction can be used to represent efficient provably correct versions of several popular sequential machine learning algorithms. Finally, I will present scaling results from our UAI'10 paper which describes our initial Shared-Memory API (currently available for download). I will conclude by briefly discussing some of our ongoing work on cluster (distributed-memory) API.

Abstract

In this talk, I will describe an interactive approach for learning
ranking functions. In particular, this approach leverages
clickthrough data collected from online interleaving experiments.

Interleaving experiments are an effective methodology for eliciting
reliable implicit feedback. For any query, the rankings computed by
two retrieval functions are interleaved together and then presented to
the user. Afterward, clicks can be interpreted as a relative
preference of one retrieval function over the other.

I will present a novel online learning framework, called the Dueling
Bandits Problem, that characterizes the exploration/exploitation
trade-off of running online interleaving experiments (since they might
lower current retrieval quality). For example, intuitively, we should
quickly stop running experiments using low quality retrieval functions
once we have discovered that they are of low quality. The Dueling
Bandits Problem differs from existing online learning frameworks in
that only relative information (e.g., is A better than B?) is assumed
to be available to the learning algorithm.

I will also describe a learning approach to improve the statistical
power of interleaving experiments. This approach is motivated by the
intuition that not all clicks are equally informative. By learning a
weighting function on clicks collected from interleaving experiments,
we can arrive at an improved statistical test that will more
efficiently tease apart the relative quality between pairs of
retrieval functions.

Abstract

We develop a penalized kernel smoothing method for the problem of selecting non-zero elements of the conditional precision matrix, known as conditional covariance selection. This problem has a key role in many modern applications such as finance and computational biology. However, it has not been properly addressed. Our estimator is derived under minimal assumptions on the underlying probability distribution and works well in the high-dimensional setting. The efficiency of the algorithm is demonstrated on both simulation studies and the analysis of the stock market.

Abstract

With the increasing popularity of large- scale probabilistic graphical models, even "lightweight" approximate inference methods are becoming infeasible. Fortunately, often large parts of the model are of no immediate interest to the end user. Given the variable that the user actually cares about, we show how to quantify edge importance in graphical models and to signi?cantly speed up infer- ence by focusing computation on important parts of the model. Our algorithm empiri- cally demonstrates convergence speedup by multiple times over state of the art.

Abstract

Hidden Markov Models (HMMs) are important tools for modeling
sequence data. However, they are restricted to discrete latent
states, and are largely restricted to Gaussian and discrete
observations. And, learning algorithms for HMMs have predominantly
relied on local search heuristics, with the exception of spectral
methods such as those described below. We propose a nonparametric
HMM that extends traditional HMMs to structured and non-Gaussian
continuous distributions. Furthermore, we derive a local-minimum-free
kernel spectral algorithm for learning these HMMs. We apply our method
to robot vision data, slot car inertial sensor data and audio event
classi?cation data, and show that in these applications, embedded
HMMs exceed the previous state-of-the-art performance.

Abstract

One potentially effective way to deal with a large collection of data samples is to discover a subspace
(usually much lower dimensional and latent) representation, which can be used for further knowledge
discovery or data management tasks. To automatically discover a low dimensional representation,
many methods have been developed, such as the classic PCA, CCA, and the probabilistic topic models
(e.g., latent Dirichlet allocation or LDA). One of the advocated advantages of such models is that they
do not require supervision during training, which is arguably preferred over supervised learning that
would necessitate extra cost. But with the increasing availability of free on-line information such as
image tags,user ratings, etc., various forms of side-information that can potentially offer “free”
supervision have led to a need for new models and training schemes that can make effective use of
such information to achieve better results, such as more discriminative topic representations of image
contents, and more accurate image classifiers. The standard LDA and many other models are unsupervised
and ignore the commonly available supervision information,and thus can discover a sub-optimal representation
for prediction tasks. Extensions to supervised models which can explore side information for discovering predictive
subspace representations have been proposed and their training are typically performed with maximum likelihood
estimation, which may not yield conclusive results or results in an unbalanced prediction rule. Our goal is to
investigate how the arguably more discriminative maximum margin principle can be effectively
applied to discover predictive subspace representations from a large collection of data, which can be
bag-of-word text documents or images with multiple types of features. We aim to develop a generic learning framework
for discovering predictive latent subspace representations and provide several useful tools for helping users in managing
the huge amount of online content. In this talk, I will present one recent work on multi-view data analysis.

In this talk, I introduce the density ratio framework, review methods
of density ratio estimation, and show various real-world applications
including brain-computer interface, speech recognition, image
recognition, and robot control.

Abstract

We consider the problem of re-ranking search results by
incorporating user feedback. We present a graph theoretic measure for
discriminating irrelevant results from relevant results using a few
labeled examples provided by the user. The key intuition is that nodes
relatively closer (in graph topology) to the relevant nodes than the
irrelevant nodes are more likely to be relevant. We present a simple
sampling algorithm to evaluate this measure at specific nodes of interest,
and an efficient branch and bound algorithm to compute the top k nodes
from the entire graph under this measure. On quantifiable prediction tasks
the introduced measure outperforms other diffusion-based proximity
measures which take only the positive relevance feedback into account. On
the Entity-Relation graph built from the authors and papers of the entire
DBLP citation corpus (1.4 million nodes and 2.2 million edges) our branch
and bound algorithm takes about 1.5 seconds to retrieve the top 10 nodes
w.r.t. this measure with 10 labeled nodes.

Abstract

Considerable speed-ups to machine learning problems have been achieved by two developments: distributed computing (either on multi-core or "cloud" architectures) and rapidly converging online learning algorithms. In this talk, we combine these two. Distributed computing has largely been paired with "batch" algorithms like EM and L-BFGS, in which the entire training dataset is processed once per iterative update; our approach makes more frequent online updates asynchronously, either in a pure online or mini-batch setting. Asynchronous updates can introduce error, but the approach has similar convergence guarantees to other online learning algorithms in certain settings, such as the case of online gradient-based optimization for convex objectives. We first consider this setting, and present a series of experiments exploring practical issues for a structured prediction task in natural language processing, named-entity rec

ognition.
We also consider settings that are not yet supported by theoretical results. We apply an online version of EM (Cappe and Moulines, 2009) to two unsupervised structured learning tasks: (1) word alignment for machine translation, and (2) unsupervised part-of-speech tagging. For the former we use a model that actually has a concave log-likelihood function, while the latter fits the more common unsupervised learning scenario with a non-concave objective. In both cases we find significant speed-ups over batch algorithms with no observable problems arising from the use of asynchronous updates. In addition, we present experimental results when running asynchronous mini-batch algorithms on M45, a large cluster running the Hadoop MapReduce framework. We find that, while MapReduce is not an ideal fit for these algorithms, they do converge faster than batch algorithms on the same hardware and we expect that the MapReduce framework may become more appropriate for asynchronous learning as problem sizes continue to grow.

Abstract

Genome rearrangement refers to large structural changes on genomes
which effect their chromosomal organizations. Over the past 15 years,
significant progress has been achieved for problems with two genomes.
However problems with three or more genomes, e.g., to infer phylogeny
and ancestral genome organizations simultaneously, had not been
satisfactorily solved, due to the underlying computational
difficulties.

In this talk I present my solutions to these challenging problems. I
start with the problem with three genomes; then introduce a new
mathematical theory, which captures its essential combinatorial
structures and whose iterative application quickly finds optimal
solutions to most instances. For the general problem with any number
of genomes, I develop a method which extends this theory and
systematically exploits information from known genomes. The new method
gives extremely accurate results within a few minutes on large
datasets, which are far beyond the reach of previous methods. Finally
this talk presents results of the new method applied on UCSC and
ENSEMBL high-resolution datasets.

Biography

Andrew Wei Xu received a Bachelor degree in Biophysics (2002, Nanjing
University, China) and a Master degree in Physics (2005, Mcmaster
University). Then he studied after David Sankoff and received his
Ph.D. in mathematics (2008, University of Ottawa) with specification
on probability theory and discrete math. Now he works with Bernard
Moret as a postdoctoral fellow in the School of Computer and
Communication Sciences, at Swiss Federal Institute of Technology,
Lausanne. His research interests center on computational biology, with
emphasis on solving genetic and genomic problems using various
quantitative analysis methods, including probability analysis,
statistical inference, algorithm design and combinatorial analysis.
Now he is looking forward to developing and applying statistical
machine learning methods to solve computational biology problems.

Abstract

In dual-income families, attending to the detail required to make and
monitor transportation plans for kids’ activities requires much
parental attention. Families rely on routines as a mechanism to
support this transportation process. Families face their largest
challenges, however, on non-routine days, where the effectiveness of
routine resources and behavioral patterns are significantly
diminished. When Dad takes over a chore Mom usually manages, for
example, the task is significantly more at risk of coordination
breakdowns. Especially when making or changing plans, information
routine to Mom could easily be unknown to Dad.

In this talk, I will discuss work towards the creation of technologies
that can support busy families during such moments. In particular, I
will describe the collection of a massive dataset on family
coordination, and demonstrate how only simple GPS is needed to create
intelligent applications that can support family coordination. I will
demonstrate how learned models of location can be used to detect when
a parent has forgotten to pick up a child, and propose a variety of
applications that can capture and reveal important but invisible (to
people) aspects of family movement patterns. These applications open a
space where Machine Learning and HCI research can collaborate to
create novel and valuable services for families

Abstract

A line of recent work has demonstrated that sparsity is a powerful
technique in signal reconstruction and in statistical estimation.

Given n noisy samples with p dimensions, where n << p, we show that the
multi-step thresholding procedure based on the Lasso -- we call it the
Thresholded Lasso, can accurately estimate a sparse vector beta in p
dimensional space in a linear model. We show that under the Restricted
Eigenvalue (RE) condition (Bickel-Ritov-Tsybakov 09), it is possible to
achieve the L2 loss within a logarithmic factor of the ideal mean square
error one would achieve with an oracle while selecting a sufficiently
sparse model -- hence achieving "sparse oracle inequalities"; the oracle
would supply perfect information about which coordinates are non-zero and
which are above the noise level.

Biography

Shuheng Zhou received her Ph.D. from Carnegie Mellon University in August
2006, co-advised by Professors Greg Ganger and Bruce Maggs; Her
dissertation work focused on combinatorial optimization problems in
network routing. She then continued as a postdoc fellow at CMU, working
with Professors John Lafferty and Larry Wasserman on statistical and
machine learning algorithms and theory. She has been a postdoc fellow with
Seminar for statistics in Department of Mathematics at ETH Zurich, since
August 2008. She is currently visiting Department of Statistics at UC
Berkeley, hosted by Professor Bin Yu.

Abstract

A lot of research in machine learning and data mining is concentrated on
building predictive models with the best possible performance. In most cases
such models act as black boxes: they make good predictions, but do not
provide much insight into the decision making process. This is
unsatisfactory for domain scientists who also want to answer questions like:
What effects do important features have on the response variable? Which
features are involved in complex effects and should be studied only together
with some other features? How can we visualize and interpret such complex
effects? Separate post-processing techniques are needed to answer these
questions.

The term statistical interaction is used to describe the presence of
non-additive effects among two or more variables in a function. When
variables interact, their effects must be modeled and interpreted
simultaneously. Thus, detecting statistical interactions can be critical for
an understanding of processes by domain researchers.
In this talk I will describe an approach to interaction detection based on
comparing the performance of unrestricted and restricted prediction models,
where restricted models are prevented from modeling an interaction in
question. I will show that an additive model-based regression ensemble,
Additive Groves, can be restricted appropriately for use with this
framework, and thus has the right properties for accurately detecting
variable interactions.

Apart from being useful for interaction detection, Additive Groves is a
strong predictive model by itself. I will show that it consistently
outperforms other state-of-the-art ensembles of trees for regression and
yields high performance across a variety of classification problems.
In the last part of the talk I will describe several applications of
Additive Groves and interaction detection techniques to real data, such as
tracking environment change based on birds abundance, Salmonella risk
prediction on meat-processing establishments and successful participation in
recent KDD and ICDM data mining competitions.

Abstract

Programming robots is hard. While demonstrating a desired behavior may
be easy, designing a system that behaves this way is often difficult,
time consuming, and ultimately expensive. Machine learning promises to
enable "programming by demonstration" for developing high-performance
robotic systems. Unfortunately, many approaches that utilize the
classical tools of supervised learning fail to meet the needs of
imitation learning.

Perhaps foremost, classical statistics and supervised machine learning
exist in a vacuum: predictions made by these algorithms are explicitly
assumed to not affect the world in which they operate.
I'll discuss the problems that result from ignoring the effect of
actions influencing the world, and I'll highlight simple "reduction-
based" approaches that, both in theory and in practice, mitigate these
problems.

Additionally, robotic systems are often built atop sophisticated
planning algorithms that efficiently reason far into the future;
consequently, ignoring these planning algorithms in lieu of a
supervised learning approach often leads to poor and myopic
performance. While planners have demonstrated dramatic success in
applications ranging from legged locomotion to outdoor unstructured
navigation, such algorithms rely on fully specified cost functions
that map sensor readings and environment models to a scalar cost. Such
cost functions are usually manually designed and programmed. Recently,
our group has developed a set of techniques that learn these functions
from human demonstration by applying an Inverse Optimal Control (IOC)
approach to find a cost function for which planned behavior mimics an
expert's demonstration. These approaches shed new light on the
intimate connections between probabilistic inference and optimal
control.

I'll consider case studies in activity forecasting of drivers and
pedestrians as well as the imitation learning of robotic locomotion
and rough-terrain navigation. These case-studies highlight key
challenges in applying the algorithms in practical settings.

Abstract

The exploration-exploitation tradeoff is crucial to
reinforcement-learning (RL) agents, and a number of sample-efficiency
results guaranteeing near-optimal behavior with a relatively small
amount of exploration have been derived for agents in propositional
domains. But these algorithms and results do not cover many
traditional AI representations that rely on relational state
descriptions (like STRIPS rules). In this talk, we consider
sample-efficient RL algorithms for a rich class of representations:
relational action schemas. These high-level representations,
combined with our sample-efficient algorithms, allow us to specify
transitions in a general and compact form that is still efficiently
learnable under some conditions, and have important implications in a
number of real-world domains described using relational models.

Biography

Tom Walsh is a PhD candidate at Rutgers University being advised by
Dr. Michael Littman. His research interests include reinforcement
learning, relational representations, machine learning and data
mining. His undergraduate degree and research were done at UMBC and he
has also had stints at the NIST and Siemens Corporate Research.

Abstract

Dependence measures between random variables play a fundamental role
in many subtasks of machine learning, such as clustering, feature
selection, active learning, image registration and more. One of the
best-founded measures of dependence is Renyi's mutual information, a
generalization of Shannon's mutual information. The question studied
here is how to estimate Renyi's mutual information given a finite
sample. Here we give a new estimator, show its consistency (under some
regularity assumptions) and demonstrate that it is both more efficient
and robust than its competitors. For illustration we use examples from
image registration and independent subspace analysis. The beauty of
the new estimator is that it uses the rank-statistics of the sample
only, and connects copulae, information theory and Euclidean graph
optimization techniques.

Biography

Barnabas Poczos received his M.Sc. degree (summa cum laude) in applied
mathematics in 2001 at the Eotvos Lorand University, Budapest,
Hungary. He won the first prize in the Hungarian national scientific
student competition in Informatics in 2001. From 2005-2007 he was an
assistant lecturer at the Eotvos Lorand University. He received there
his Ph.D. (summa cum laude) in computing science in 2007. His thesis
was about independent subspace analysis, a multidimensional
generalization of the independent component analysis problem. He is
currently a Postdoctoral Fellow at the Department of Computing Science
of the University of Alberta under the supervision of Dr. Csaba
Szepesvari. His areas of expertise include machine learning,
unsupervised learning, Bayesian methods, active learning,
bioinformatics, independent component analysis, information and
entropy estimation.

Abstract

In recent years, the blogosphere has experienced a substantial increase in the number of posts published daily, forcing users to cope with information overload. The task of guiding users through this flood of information has thus become critical. To address this issue, we present a principled approach for picking a set of posts that best covers the important stories in the blogosphere.

We define a simple and elegant notion of coverage and formalize it as a submodular optimization problem, for which we can efficiently compute a near-optimal solution. In addition, since people have varied interests, the ideal coverage algorithm should incorporate user preferences in order to tailor the selected posts to individual tastes. We define the problem of learning a personalized coverage function by providing an appropriate user-interaction model and formalizing an online learning framework for this task. We then provide a no-regret algorithm which can quickly learn a user's preferences from limited feedback.

Abstract

Mixed integer linear programming (MILP) is a powerful representation often used to formulate decision-making problems under uncertainty. However, it lacks a natural mechanism to reason about objects, groups of objects, and relations. First-order logic (FOL), on the other hand, excels at reasoning about groups of objects, but lacks a natural representation of uncertainty. While representing propositional logic in MILP has been extensively explored, no theory exists yet for fully combining FOL with MILP. We propose a new representation, called first-order programming or FOP, which fully subsumes both FOL and MILP. We establish formal methods for reasoning about first order programs, including a sound and complete lifted inference procedure based on Gomory cuts. Since FOP can offer exponential savings in representation size compared to FOL, and since our inference procedure can directly duplicate any FOL resolution proof, we anticipate that inference in FOP will be more tractable than inference in FOL for corresponding problems.

Abstract

There has been an explosion of interest in statistical models for
analyzing network data, and considerable interest in the class of
exponential random graph (ERG) models.

In this talk I will relate the properties of ERG models to the
properties of the broader class of discrete exponential families. I
will describe a general geometric result about discrete exponential
families with polyhedral support. I will show how the properties of
these families can be well captured by some fundamental geometric
objects of polyhedral form.

I will discuss the relevance of such results to maximum likelihood
estimation, both from a theoretical and computational standpoint. I
will then apply these results to the analysis of ERG models. By means
of a detailed example, I will provide some characterization and a
partial explanation of certain pathological features of ERG models
known as degeneracy.

Abstract

Different representation languages are useful for capturing different inductive biases. What is the language that best captures human inductive biases? I will present several models and studies which suggest that human knowledge is mentally represented in a logical language, and that human learning can be characterized in terms of probabilistic inference over these logical representations. The applications presented will include category learning, relational learning, and one-shot learning.

Abstract

Inferring spatially co-located regions of interest is an important problem in several applications, such as identifying activation regions in the brain or contamination regions in environmental monitoring. In this talk, I will present multi-resolution methods for passive and active learning of sets that aggregate data at appropriate resolutions, to achieve optimal bias and variance tradeoffs for set estimation. In the passive setting, we observe some data such as a noisy fMRI image of the brain and then extract the regions with statistically significant brain activity. Active setting, on the other hand, involves feedback where the location of an observation is decided based on the data observed in the past. This can be used for rapid extraction of set estimates, such as a contamination region in environmental monitoring, by designing data-adaptive spatial survey paths for a mobile sensor. I will describe a method that uses information gleaned from coarse surveys to focus sampling around informative regions (boundaries), thus generating successively refined multi-resolution set estimates.
I will also discuss some current research directions which aim at efficient extraction of spatially distributed sets of interest by exploiting non-local dependencies in the data.

Abstract

Human-computer interaction researchers use diverse methods—from eye-tracking to ethnography—to understand human activity, and machine learning is growing in popularity as a method within the community. This talk will survey projects from HCI researchers at CMU combining machine learning with other techniques to problems such as adapting interfaces to individuals with motor impairments, predicting routines in dual-income families, classifying controversial Wikipedia articles, and identifying rhetorical strategies newcomers use in online support groups that elicit responses. Researchers without strong computational backgrounds can face practical challenges as consumers of machine learning tools; this talk will highlight opportunities for tool design and collaboration across communities.

Abstract

Different representation languages are useful for capturing different inductive biases. What is the language that best captures human inductive biases? I will present several models and studies which suggest that human knowledge is mentally represented in a logical language, and that human learning can be characterized in terms of probabilistic inference over these logical representations. The applications presented will include category learning, relational learning, and one-shot learning.

Abstract

A machine learning approach to learning to rank trains a model to optimize a target evaluation measure with respect to training data. Currently, existing information retrieval measures are impossible to optimize directly except for mod- els with a very small number of parameters. This poses a major challenge: how to optimize IR measures of interest directly? We have shown that LambdaRank, which smoothly approximates the gradient of the target measure, can be adapted to work with four popular IR target eval- uation measures using the same underlying gradient con- struction. It is likely, therefore, that this construction is extendable to other evaluation measures. We empirically show that LambdaRank finds a locally optimal solution for mean NDCG@10, mean NDCG, MAP and MRR with a 99% confidence rate. We also show that the amount of effective training data varies with IR measure and that with a suf- ficiently large training set size, matching the training op- timization measure to the target evaluation measure yields the best accuracy. In this talk, I will first review LambdaRank and then present the local optimality testing and results.

Abstract

We consider the problem of learning structured-output regression, where the output consists of multiple response variables that are related by a graph and correlated response variables are dependent on the input variables in a sparse but synergistic fashion, rather than independently. Such problems can be found in genetic investigation of causal DNA variations that lead to perturbations of multiple related genes in a module, or in engineering problems, where complex responses must be jointly mapped to causal factors to overcome weak statistical power in independent analysis of individual responses. We propose the graph-guided fused lasso methods, which incorporate couplings among response variables on top of the standard lasso penalty for overall sparsity. Our approach represents the dependency structure among the output variables explicitly as a network, and leverages this network to encode structured regularizations in a multivariate regression model, so that the inputs that jointly influence subgroups of highly correlated outputs can be detected with high sensitivity and specificity. Experimental comparisons with standard lasso are performed on both simulated and real data, and our results show that there is a significant advantage in detecting the true relevant inputs when we incorporate the correlation pattern in outputs using our proposed methods.

Abstract

We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/ undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven static and dynamic laws that real graphs follow (we formally prove that for several of the (power) laws and we estimate their exponents as a function of the model parameters); (c) parsimonious, requiring only four parameters. (d) fast, being linear on the number of edges; (e) simple, intuitively leading to the generation of macroscopic patterns. We empirically show that our model mimics two real-world graphs very well: Blognet (unipartite, undirected, unweighted) with 27K nodes and 125K edges; and Committee-to-Candidate campaign donations (bipartite, directed, weighted) with 23K nodes and 880K edges. We also show how to handle time so that edge/weight additions are bursty and self-similar.

Abstract

Despite the widespread use of Clustering, we are only beginning to understand the general unifying principles behind clustering functions. Questions like "Are there any principles governing all clustering paradigms?" and "How should a user choose an appropriate clustering algorithm for a particular task?", etc. are almost completely unanswered by the existing body of clustering literature. We consider an axiomatic approach to the theory of Clustering. We adopt the framework of Kleinberg. By relaxing one of Kleinberg's clustering axioms, we sidestep his impossibility result and arrive at a consistent set of axioms. We use this set of axioms to put forward a theory of clustering to help answer the mentioned questions.

Abstract

We present a cyclical blockwise coordinate descent algorithm for the multi-task Lasso that efficiently solves problems with thousands of features and tasks. The main result shows that a closed-form Winsorization operator can be obtained for the sup-norm penalized least squares regression. This allows the algorithm to find solutions to very large-scale problems far more efficiently than existing methods. This result complements the pioneering work of Friedman, et al. (2007) for the single-task Lasso. As a case study, we use the multi-task Lasso as a variable selector to discover a semantic basis for predicting human neural activation. The learned solution outperforms the standard basis for this task on the majority of test participants, while requiring far fewer assumptions about cognitive neuroscience. We demonstrate how this learned basis can yield insights into how the brain represents the meanings of words.

Abstract

We present a cyclical blockwise coordinate descent algorithm for the multi-task Lasso that efficiently solves problems with thousands of features and tasks. The main result shows that a closed-form Winsorization operator can be obtained for the sup-norm penalized least squares regression. This allows the algorithm to find solutions to very large-scale problems far more efficiently than existing methods. This result complements the pioneering work of Friedman, et al. (2007) for the single-task Lasso. As a case study, we use the multi-task Lasso as a variable selector to discover a semantic basis for predicting human neural activation. The learned solution outperforms the standard basis for this task on the majority of test participants, while requiring far fewer assumptions about cognitive neuroscience. We demonstrate how this learned basis can yield insights into how the brain represents the meanings of words.