Rich Caruana's Home Page

Click here
or on one of
the photos above to see photo albums with more pictures.

Click here
to see fun we had with computational photography for infinite depth-of-field.

Click on the
thumbnails below to see a video clip of a large pod of dolphins Dayne, Britt, Alex, Greta,
Diane and I ran across while sailing to Catalina Island, and also to see clips of jelly fish at the Monterey Bay
Aquarium:

I joined Cornell's Department of Computer
Science in Fall 2001. Here's a
recent CV.

Research

Most of my research
is in data mining and machine learning, and the application of these to problems in medicine, ecology, and microprocessor design. I
do work on inductive transfer (a.k.a. multitask learning), ensemble
learning, probabilistic prediction, model compression, and regression. In general, I like to work on real problems,
and develop new learning methods by abstracting what is required to achieve good
performance on those problems.

We're doing new work
on what we call Model Compression where we take a large, slow, but
accurate model and compress it into a much smaller, faster, yet still accurate
model. This allows us to separate the models used for learning from the
models used to deliver the learned function so that we can train large, complex
models such as ensembles, but later make them small enough to fit on a PDA, hearing aid, or
satellite. With model compression we can make models 1000 times smaller
and faster with little or no loss in accuracy. Here's our first paper on model
compression. Google has been funding this work.

One of my students,
Alex Niculescu-Mizil, has developed a method for multitask learning of Bayes Net
structures. The first paper on this topic was presented at AIStats07.
Here's a preprint.

We developed
a new ensemble learning method called Ensemble Selection. In
ensemble selection we train thousands of different models on the same train set
(no sampling or weighting), then carefully select from this library of models a small set
that yield best performance when combined in an ensemble. Noteworthy
features of ensemble selection are that we train many different kinds of models (e.g.
SVMs, neural nets, bagged, boosted, and vanilla decision trees, kNN, boosted
stumps), the
performance of the ensemble can be optimized to nearly any performance measure,
and the method outperforms bagging, boosting, Bayesian model averaging, and all other learning methods we've
compared it to. Here's a paper on ensemble selection that was presented at ICML 2004: (caruana.icml04.crc.ps)
For an updated draft look at: caruana.icml04.revised.rev2.ps. For
a bundle that contains both the revised ICML 2004 as well as a long version of
an ICDM 2006 paper that describes how to get even better performance from
ensemble selection get: caruana.icml04.icdm06long.pdf.

Along with the
ensemble selection work, we have been performing a comprehensive
empirical evaluation of machine learning methods. So far we have looked at
SVMs, neural nets, logistic regression, naive bayes, many flavors of decision trees, bagged and boosted
decision trees, random forests, boosted stumps, and many k-nearest neighbor
methods. We are evaluating the performance of these
learning methods on a variety of performance metrics: accuracy, ROC area,
precision/recall break-even point, Lift, squared error, cross-entropy,
probability calibration, ... An ICML 2006 paper with the latest
results is at: modelsperf.icml.2006.pdf.

While doing these experiments we discovered that boosted decision trees had
excellent performance on metrics such as accuracy, AUC, Lift, and
precision/recall, but predicted poorly calibrated probabilities and thus had
very bad
squared error and cross entropy. By applying calibration to
the predictions made by boosting, we are able to get well-calibrated
probabilities from boosting, and boosted trees now outperform all other learning
methods we have tested on squared error and cross entropy. Here's our AI
Stats 2005 paper on this.

We have also begun
analyzing how the different performance metrics relate to each other, and
presented a paper at KDD2004 that uses multidimensional scaling and
correlation analysis to study ten metrics: perfs.kdd04.revised.rev1.pdf.
This paper compares Accuracy, F-score, Lift, AUC (Area under the ROC), Average
Precision, Precision/Recall Break-Even Point, Squared Error, Cross-Entropy, and
Probability Calibration.

Thorsten Joachims and
I chaired the KDD-Cup in 2004. If you are interested in our code for
evaluating performance metrics (PERF), the best place to get it is from the
KDD-Cup 2004 web site: http://kodiak.cs.cornell.edu/kddcup/.
PERF calculates more than 20 different performance metrics, and can also
generate plots for AUC, precision/recall, Lift, accuracy vs. threshold, weighted
cost vs. threshold, ...

We presented a paper titled "Evaluating the C-Section Rate of Different
Physician Practices: Using Machine Learning to Model Standard Practice"
at the AMIA'2003 (American Medical Informatics Association) Conference. In
this work, bagged smoothed decision trees turned out to be the model of choice
(because they yielded probabilities with excellent calibration) for modeling the
risk of c-section for 22,157 expectant mothers. (This paper was nominated for a best paper award.)

With colleagues at CMU and the
University of Pittsburgh, we've been clustering proteins. Based on this
work we've developed a new approach to clustering called Meta
Clustering. Instead of laboriously defining a clustering distance
metric and then tuning the distance metric and clustering algorithm until you
get a useful clustering, Meta Clustering automatically generates many
qualitatively different, yet good, alternate clusterings of the data for
you. These alternate clusterings are then themselves clustered at a meta
level (yielding a clustering of clusterings) so that the user can efficiently
navigate to the clustering most useful for their purposes. This work is
supported by NSF CAREER Award #0347318. Here's the Meta Clustering web
page: http://www.cs.cornell.edu/~nhnguyen/metaclustering.htm.
Here's our
first paper on MetaClustering: ICDM06.metaclust.caruana.pdf

David Cohn, Andrew
McCallum and I did some of the first work in semi-supervised clustering back in
1999. See http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cis/TR2003-1892
for a tech report we published years later. By modern standards it's
somewhat passe`, but it was cool stuff that was ahead of its time back when we did it. For a better,
more recent paper related to this topic see the MetaClustering paper in the
paragraph above.

Pictures

Jordan Erenrich and I spent an afternoon taking pictures of a chess board
with the camera set to different focus distances to create one combined image
with infinite depth-of-focus. See the results at: http://www.cs.cornell.edu/~erenrich/dof/

Conference Papers

Rich Caruana, Art Munson, and Alexandru
Niculescu-Mizil, "Getting the Most Out of Ensemble Selection," to
appear in the Proceedings of the Sixth International Conference on Data Mining
(ICDM'06), December 2006.

Rich Caruana, Mohamed Elhawary, Nam Nguyen,
and Casey Smith, "Meta Clustering," to appear in the Proceedings of
the Sixth International Conference on Data Mining (ICDM'06), December 2006.

Engin Ipek, Sally McKee, Bronis de Supinski,
M. Schulz, and Rich Caruana, "Efficiently Exploring Architectural Design
Spaces via Predictive Modeling," to appear in The Proceedings of the 12th
International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), October 2006.

Alexandru
Niculescu-Mizi and Rich Caruana, l“Predicting Good Probabilities," The
Proceedings of the 22nd International Conference on Machine Learning (ICML*05),
pp. 625-632. (Received best student paper award at ICML. Also an
oral presentation at the 2005 Snowbird Workshop on Machine Learning.)