My research is generally concerned with machine learning and data mining, with an underlying interest in
producing features and models that are robust to changes in the distribution of their underlying data (Thesis proposal, ICDM 2007). To this end, I am particularly interested in transfer learning with an emphasis on domain adaptation.

I am currently working on The Querendipity Project, whole goal
is to more accurately integrate and exploit the many heterogeneous sources of information available to a modern scientist.
Taking advantage of, among other sources, citation networks (such as CiteSeer),
full text archives (such as PubMed Central), and curated databases (such as
the Saccharomyces Genome Database), we are able to help users discover both
relevant and novel research related to their interests (ICWSM 2009).

I am also a member of the SLIF team working on mining text
and images together for bioinformatics applications. Our team has recently been announced one of four finalists in the $50,000 Elsevier Grand Challenge. Specifically, my work deals with using the text of biological journal articles (e.g. captions, abstracts and main text) along with their associated images (depicting cells, proteins, graphs, etc) in order to better identify entities in both media. The combination of these two expressions (text and images) of the same underlying concept (the experiment being
performed) into new features, jointly describing both the text and images, is a closer representation of the actual
object a user would be interested in, rather than disjoint features of text and images alone. A related problem is that of
transfer learning. In this case, we use models and named entity extractors trained on one type of data (abstract
text, for instance) and adapt them to be applied to a related, but distinct type of data (caption text) (CIKM 2008, ACL 2008). The intuition is
that it is easier to learn a certain concept once a related concept has already been mastered.

I have also been lucky to pursue related work outside of school during summer internships. While working with Hang Li and Tie-Yan Liu in the Web
Search and Mining group at Microsoft ResearchAsia we developed novel semi-supervised and transfer learning based methods for improving internet search through
query-dependent ranking (SIGIR 2008). The idea behind this work is that, regardless of the specific topic users are interested
in, there are common features linking certain types of queries together. For instance, users searching for either
a
person or company name might both be most interested in the corresponding home page (a navigational query), while
searchers for a disease or country name might be more interested in authoritative sources of information about these
topics
(informational queries). By modeling and leveraging these distributions of types of queries we can better decide
what, exactly, users
want and deliver that to them.

Relatedly, while in the Data
Analytics group at IBM ResearchWatson, I worked with Naoki Abe and Yan Liu on methods
for learning
causal models from temporally ordered data (KDD 2007). We felt that the interpretability offered by a causal model was quite
valuable for the end user in understanding the process being studied. This type of understanding is an essential
component of the scientific process since it leads the researcher to an idea of what experiment to perform next.
An accurate predictive model, without interpretation, provides little insight as to what direction is best to pursue.
This was also the motivation behind my work with Richard
Scheines and Joseph E. Beck on discovering predictive,
semantically and scientifically interpretable high-level features as functions of raw, event level data (AAAI 2006, 2005).