The Machine Learning and Intelligence Group at Microsoft Research, Redmond

I'm a Principal Researcher in, and manager of, the Machine Learning and Intelligence (MLI) group in the Machine Learning Group at Microsoft Research. Our work covers a broad spectrum of research activities, from fundamental, theoretical machine learning to a wide variety of applied research projects, such as social network-based question answering and url selection for Bing's index. Our machine learning technologies are used throughout Microsoft's Online Services division (in particular, Bing, AdCenter and MSN): for example our ranking technology is used for core Web Search, for Web Search verticals like image and commerce search, and for Ads relevance. This is just one example: I invite you to browse the team's web pages to find out more.

My Own Research

I'm interested in the theory and practice of machine learning, optimization methods, and the intersection of machine learning with natural language processing and semantic modeling. I'm currently particularly interested in machine reading and semantic modeling, with models that leverage large amounts of unlabeled text data. Here are a few of the things I've been working on recently - for more information please visit my publications page.

Modeling Meaning in Text

M. Richardson, E. Renshaw and I have created a freely available dataset called MCTest for work on the machine comprehension of text. We gathered 500 short stories using a vocabulary that a typical seven year old would know. In this sense, the dataset is open domain, yet its scope is still limited. For each story, we also gathered four multiple choice questions, and for each question, four answers, one of which is correct. Many of the questions are designed to require more than one sentence from the story to be answered correctly. A simple token-based baseline system gets about 60% of the questions correct (when guessing would achieve 25%, in expectation). It is our hope that the gap between this number and how well a human does (which is in the high nineties) will spur research in semantic modeling. The data can be downloaded, and publications and results found, here.

G. Zweig and I also constructed a sentence completion challenge dataset that we hope will prove useful for testing text-based semantic modeling methods - you can find the details here.

The Predicate Depth: Finding Robust, Accurate Classifiers

Typically, one approaches a supervised machine learning problem by writing down an objective function and finding a hypothesis that minimizes it. This is equivalent to finding the Maximum A Posteriori (MAP) hypothesis for a Boltzmann distribution. However, MAP is not a robust statistic: one can construct examples where the MAP solution disagrees with the weighted majority vote on every instance. Instead of directly finding the MAP solution, one can divide the problem into a part that assigns a posterior distribution over the function class, based on training data, and a part that, given this posterior, finds a function (or ensemble of functions) that it is hoped will generalize well. We define the predicate depth, and its median, over binary classification functions, and we show that the predicate median is robust and has good generalization guarantees. It also has the advantage of being a single deterministic function, and so avoids problems with sampling models (non-deterministic solutions) and ensembles (speed, and interpretability). We present algorithms to approximate it and also show how they can be used to approximate the Tukey median in polynomial time. To appear in JMLR. This is joint work with R. Gilad-Bachrach.

Ranking for Information Retrieval

Information retrieval measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR), are difficult to optimize directly, since viewed as functions of the model parameters, they are either flat or discontinuous everywhere. LambdaRank was an attempt to solve this, and in fact, in this [paper, tech report], we verified that LambdaRank indeed directly optimizes NDCG, and we further demonstrated that LambdaRank can easily be adapted to directly optimize MAP and MRR. This work was done using neural nets as the underlying model. In this paper, we showed that boosted tree classifiers can make excellent rankers; this, together with the fact that on large, artificial data sets, boosted trees can do arbitrarily well, whereas neural nets cannot, suggests trying the LambdaRank idea (which applies to any model for which a gradient of the cost can be defined, not only neural nets) with boosted trees. The resulting algorithm is called LambdaMART [paper, tech report]. Our first ranking algorithm for Search, RankNet, is still an excellent method for training using pairwise preferences (for example, from clicks), and can also easily be adapted to work with boosted trees. For an overview of these algorithms see this [tech report].

We recently won the Yahoo! Learning to Rank Challenge (Track 1). We used an ensemble of 12 models, 8 of which were LambdaMART (plus 2 LambdaRank neural nets and 2 MART regression models). We were the only team we know of who directly optimized for the measure used in the competition (Expected Reciprocal Rank), demonstrating the flexibility of the LambdaMART approach. In Track 1, which was the standard Web search ranking problem, 312 teams submitted at least two models (and over a thousand submitted at least one). See here for more details (our team name was Ca3Si2O7, the chemical formula for Rankinite).

Review Articles and Talks

While MSR does not have research-specific courses per se, there are still many opportunities to teach and learn collaboratively. In MSR Redmond's Machine Learning Group (in which MLI is a subgroup) we have a Learning Theory Book Club, a Machine Learning Seminar, bi-weekly Brainstorming Tea Times, and bi-weekly group seminars. (By attending these, together with MSR-Redmond's very rich seminar series, it's quite possible to get no real work done at all.)

Here's a tutorial review article on Dimension Reduction. It covers many well-known, and some less well-known, methods for dimension reduction for which the inferred variables are continuous. Here's a lecture I gave recently at the University of Washington - part 1 of 2 - on the mathematical foundations for machine learning. It's an updated version of lectures I gave at the machine learning summer school in the Max Planck institute in Tuebingen in 2003; here's a condensed version of those lectures. Finally, here's an older review article on support vector machines.

Audio Fingerprinting

You have an incoming stream of audio and you'd like to know what's playing. Our RARE (Robust Audio Recognition Engine) system can identify any one of about a quarter million songs in real time using about 10% CPU on an 833 MHz PC. On 36 hours of noisy test audio, it achieves 0.2% false positives at 4.10-6 false negative rate. Confirmation fingerprints can be used to significantly further improve these error rates, with almost no extra CPU cost. Our work is currently used in Windows Media Player and in the Zune media player. Audio fingerprinting has lots of applications: for example, to automatically construct audio thumbnails, and to automatically find duplicate audio clips on your PC. Our main innovations are a method to train for robustness to distortions, and a lookup method that is over an order of magnitude faster than competing methods for this problem. See here for details. Joint work with J. Platt, J. Goldstein, E. Renshaw, C. Herley.