Linguistic Extensions of Topic Models (thesis)

Abstract:

Topic models like latent Dirichlet allocation (LDA) provide a framework for analyzing
large datasets where observations are collected into groups. Although topic modeling
has been fruitfully applied to problems social science, biology, and computer vision,
it has been most widely used to model datasets where documents are modeled as
exchangeable groups of words. In this context, topic models discover topics, distributions
over words that express a coherent theme like "business" or "politics." While
one of the strengths of topic models is that they make few assumptions about the
underlying data, such a general approach sometimes limits the type of problems topic
models can solve.
When we restrict our focus to natural language datasets, we can use insights from
linguistics to create models that understand and discover richer language patterns. In
this thesis, we extend LDA in three different ways: adding knowledge of word meaning,
modeling multiple languages, and incorporating local syntactic context. These
extensions apply topic models to new problems, such as discovering the meaning of
ambiguous words, extend topic models for new datasets, such as unaligned multilingual
corpora, and combine topic models with other sources of information about
documents' context.
In Chapter 2, we present latent Dirichlet allocation with WordNet (LDAWN),
an unsupervised probabilistic topic model that includes word sense as a hidden variable.
LDAWN replaces the multinomial topics of LDA with Abney and Light's distribution
over meanings. Thus, posterior inference in this model discovers not only the
topical domains of each token, as in LDA, but also the meaning associated with each
token. We show that considering more topics improves the problem of word sense
disambiguation.
LDAWN allows us to separate the representation of meaning from how that meaning
is expressed as word forms. In Chapter 3, we extend LDAWN to allow meanings
to be expressed using dierent word forms in different languages. In addition to the
disambiguation provided by LDAWN, this offers a new method of using topic models
on corpora with multiple languages.
In Chapter 4, we relax the assumptions of multilingual LDAWN. We present the
multilingual topic model for unaligned text (MuTo). Like multilingual LDAWN, it
is a probabilistic model of text that is designed to analyze corpora composed of
documents in multiple languages. Unlike multilingual LDAWN, which requires the
correspondence between languages to be painstakingly annotated, MuTo also uses
stochastic EM to simultaneously discover both a matching between the languages
while it simultaneously learns multilingual topics. We demonstrate that MuTo allows
the meaning of similar documents to to be recovered across languages.
In Chapter 5, we address a recurring problem that hindered the performance of the
models presented in the previous chapters: the lack of a local context. We develop the
syntactic topic model (STM), a non-parametric Bayesian model of parsed documents.
The STM generates words that are both thematically and syntactically constrained,
which combines the semantic insights of topic models with the syntactic information
available from parse trees. Each word of a sentence is generated by a distribution that
combines document-specific topic weights and parse-tree-specific syntactic transitions.
Words are assumed to be generated in an order that respects the parse tree. We
derive an approximate posterior inference method based on variational methods for
hierarchical Dirichlet processes, and we report qualitative and quantitative results on
both synthetic data and hand-parsed documents.
In Chapter 6, we conclude with a discussion of how the models presented in this
thesis can be applied in real world applications such as sentiment analysis and how
the models can be extended to capture even richer linguistic information from text.