Linguistic Extensions of Topic Models

Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzinglarge datasets where observations are collected into groups. Although topic modeling
has been fruitfully applied to problems social science, biology, and computer vision,
it has been most widely used to model datasets where documents are modeled as
exchangeable groups of words. In this context, topic models discover topics, distribu-
tions over words that express a coherent theme like “business” or “politics.” While
one of the strengths of topic models is that they make few assumptions about the
underlying data, such a general approach sometimes limits the type of problems topic
models can solve.
When we restrict our focus to natural language datasets, we can use insights from
linguistics to create models that understand and discover richer language patterns. In
this thesis, we extend LDA in three different ways: adding knowledge of word mean-
ing, modeling multiple languages, and incorporating local syntactic context. These
extensions apply topic models to new problems, such as discovering the meaning of
ambiguous words, extend topic models for new datasets, such as unaligned multi-
lingual corpora, and combine topic models with other sources of information about
documents’ context.