My research on data analytics focuses on the design of
interactive visualizations and
statistical models to enable
people and algorithms to work in tandem to yield insights from complex data.

My work has appeared in top-tier venues in human-computer interaction (CHI, TOCHI), machine learning (ICML), and natural language processing (EMNLP), and has contributed to publications in
multiple disciplines including social sciences and genetics research.

We surveyed and analyzed how experts organize concepts in their domain of expertise. Based on human analytic strategies, we formalized four categories of topical misalignments representing the full range of errors that can occur when matching model outputs to a set of reference concepts. We examined topic models built using 10,000 parameter settings and evaluated the performance of hyper-parameter optimization, inference algorithms, and intrinsic measures of topical quality.

Sentiment Classification and VisualizationAnalytics Vis ML Domain

At EMNLP 2013, we presented a deep learning algorithm that achieves the current highest accuracy in sentiment classification of sentences.

During algorithm development, we applied visual analytics to investigate how well our technique captures the effects of sentiment negations. My visualizations enabled more rapid exploration of manually-labeled sentiment datasets, and revealed patterns of negations. I incorporated interactions, such as highlighting sentences in which the linguistic signifiers of negations are situated far apart, to help machine learning researchers gain insights about the model.

Termite: Topic Model VisualizationAnalytics Vis ML HCI Domain

As originally introduced at AVI 2012 with updates to appear at a NIPS 2013 workshop, I designed Termite, a visual analysis tool for builders and users of statistical topic models. My tool enables more rapid and accurate evaluation of model quality.

We incorporate a matrix view to support the assessment of topical term distributions and aid the comparison of latent topics. We developed a seriation algorithm that arranges words to reveal the clustering of related terms and promote the legibility of multi-word phrases. We devised a saliency measure to highlight distinctive vocabulary.

Design of Model-Driven VisualizationsAnalytics Vis ML HCI Domain

At CHI 2012, I presented a set of principles and processes for designing model-driven visualizations. I described my experiences building the Dissertation Browser, a tool for investigating the impact of inter-disciplinary collaborations at Stanford University.

During tool development, we sought expert feedback, and jointly explored modeling capabilities and visual designs to address interpretation and trust issues that hinder analysis. Our iterative design process led to a novel “word borrowing” algorithm, judged as the most accurate by domain experts, outperforming all other models considered at the start of the analysis.

Mapping Intellectual Changes in AcademiaAnalytics Vis ML Domain

We mapped the evolution of 30 years of academic discourse based on topical analysis of 1.05 million Ph.D. dissertations. My visualizations helped machine learning researchers verify model stability, and allowed collaborating social scientists to test alternative hypotheses and verify their discoveries.

I studied how people summarize text using descriptive phrases, and developed a novel algorithm for extracting keyphrases from documents.

In my TOCHI 2012 article, we described our user study on human-generated keyphrases. We systematically examined linguistic features predictive of high-quality summary terms, and developed a model to automatically extract descriptive phrases from text. We identified issues of specificity and redundancy through crowd-sourced user evaluations, and proposed additional algorithms to support adaptive selection of keyphrases. We demonstrated novel text visualizations enabled by our algorithms.

Color CategoriesVis HCI

I devised a probabilistic model for quantifying the effects of languages on color perception, based on an analysis of the World Color Survey and English color naming data collected on the web.

In my CIC 2008 paper, we demonstrated that the model can identify well-named regions of the color space.

Interactive Machine TranslationML HCI Domain

While machine translation quality has improved considerably over the last decade, its adoption rate by professional translators remains low, often due to mistrust of system performance.

In this ongoing project, we are studying how to best introduce the technology, so that machine translation can contribute to translators' existing workflow and gain increased acceptance. Conversely, we are also investigating whether a language model can learn from its interactions with professional translators and improve its quality.

Word Vectors for Exploratory Text AnalysisAnalytics ML HCI

Our results applying deep learning to sentiment analysis demonstrate the potential of utilizing a rich word representation to improve text classification. The approach is feasible, because the underlying word vectors are invisible to end users who interact with the model only through predefined and interpretable sentiment labels.

In this ongoing project, I am investigating the use of continuous word embedding algorithms to support exploratory text analysis, such as extracting themes from documents and identifying meaningful word dimensions, in the absence of semantically well-defined axes.

Multilingual News Sharing on TwitterAnalytics ML Domain

In collaboration with communication researchers, we collected and analyzed news articles shared by Twitter users, based on 1.4 billion tweets over a 12-month period. By tracking references to named entities across 19 languagesand measuring their effects on selective news consumption, we conducted one of largest studies on gatekeeping.

Currently under submission.

History of Computational LinguisticsVis ML Domain

In collaboration with social scientists, we explored the citation graphs and the flow of ideas in 45 years of computational linguistics research.

My visualization enabled detailed examination of the lines of research as predicted by Topic Flow algorithm. My tool revealed previously unknown issues in the algorithm such as unintended accumulation of flows due to cycles in the citation graph.

Semantic Text ZoomingVis ML

My text shortening algorithm can progressively shorten phrases between 2 to 8 words in length, based on examples from Wikipedia. I demonstrated that the technique enables adaptive resizing of text visualization to fit small displays.