Code for Unsupervised Grammar Induction

Background

Unsupervised grammar induction refers to the task
of learning a grammar with the input being only sentences in natural
language. Other than the task being challenging and educating by itself,
it can also potentially have impact on applications which need a parser
at some stage in their pipeline, especially when these applications are
intended to be used with languages for which annotated data is not readily
available, if at all. The back bones of such parsers are based on a
probabilistic grammars, which encode the structure of the relevant
language.

DAGEEM is piece of software, written in C++, which is designed
to estimate the parameters of such probabilistic grammars. Currently,
DAGEEM supports only the DMV model ("dependency model with valence"),
originally designed by Klein and Manning (2004). The DMV
is widely recognized as an effective grammar for dependency grammar induction,
and has been recently used in various settings for this task.

Central to DAGEEM is the use of a logistic normal distribution on the
grammar parameters. The logisitic normal distribution, which
was successfully used for topic modeling by
Blei and Lafferty (2006)
offers advantages, conceptually and (tested) empirically, over the more
common Dirichlet distribution, commonly used because of its
mathemetical elegance. This piece of software
extends the inference algorithm suggested by Blei and Lafferty, and
uses a variational approximation in a Bayesian setting to estimate the
grammar parameters.

Download

The latest version of Dageem can be downloaded here from github.com. A zip file can be
downloaded here. The older version (1.0) can be downloaded from Google code
here. In 2011, the package was rewritten
in Java, and the old C++ code is no longer available. Contact me if you are interested in the old C++ code for some reason.
The package does not include the
data sets used in the paper. The data sets are based on the Penn Treebank,
and therefore must be separately licensed through the LDC.