TADM

The Toolkit for Advanced Discriminative Modeling

Introduction

The Toolkit for Advanced Discriminative Modeling (TADM) is a C++
implementation for estimating the parameters of discriminative models,
such as maximum entropy models. It uses the PETSc and TAO toolkits to provide
high performance and scalability. It was written by Rob Malouf and is now
being developed as an open source project on Sourceforge in collaboration with Jason Baldridge and
Miles Osborne. It
is licensed under the Lesser GNU Public
License.

Background

A feature of maximum entropy (ME) modeling that makes it very
attractive is that it is a general purpose technique which can be
applied to a wide variety of problems in natural language processing.
Indeed, recent years have seen ME techniques used for sentence
boundary detection, part of speech tagging, parse selection and
ambiguity resolution, and stochastic attribute-value grammars, to name
just a few applications (see, e.g., Berger, et al. 1996; Ratnaparkhi
1998; Johnson, et al. 1999; Osborne 2000). However, while parameter
estimation for ME models is conceptually straightforward, in practice
ME models for typical natural language tasks are usually large, and
frequently contain thousands of free parameters. Estimation of such
large models is not only expensive, but also, due to sparsely
distributed features, sensitive to round-off errors.

Input format

The format for event files:

2
5 2 0 1 1 2
3 2 0 3 2 1
3
10 1 3 1
6 2 0 2 2 2
3 1 2 1

The first part of the file is a header, bracketed by lines containing
&header and /. The header is optional and, if
present, is ignored. The first line of each block is the number of
events for that context (2 and 3 for the two contexts here). Then
come the events. Each event line has a frequency, the number of
feature value pairs, then pairs of feature number and value. Features
are numbered starting with zero. Each feature can appear only once in
an event, and must have a value greater than zero. You can have
events with a zero frequency -- these are used in computing Z(x) for
each context, but ignored for computing the entropy and KL divergence.
Any feature with an expected value of zero is ignored (i.e., the
corresponding parameter is set to 0.0).

Event files can be compressed using gzip. As event files tend to get very
large, this can save a lot of disk space and improve performance
dramatically.

Usage

The tadm executable takes all its commands as options on
the command line. Some of the most interesting options are:

-events_in <filename>

file to read the events from (required)

-params_out <filename>

file to write parameter values to

-method <method>

optimization method to use
(reasonable choices are tao_lmvm, tao_cg_prp,
iis, gis, steep; there are other choices
but using them isn't a good idea) (default = tao_lmvm)

-monitor

display progress towards convergence

-max_it <n>

stop if still haven't converged after n iterations (default = 9999)

-frtol <d>

relative stopping tolerance (if
frtol=.001 then the final log-likelihood will be accurate to about 3
places, whatever that means) (default = 1e-7)

-fatol <d>

absolute stopping tolerance
(fatol=.001 means when the log likelihood improves between iterations
by less than .001) (default = 1e-10)}

There are some recent options which we have not provided documentation
for as yet. There are also scores of other options which get passed
on to PETSc and TAO (the option -help will list some of them,
and more are listed in the documentation for the libraries), but most
of them are mainly for profiling and tuning the underlying solvers.
Feel free to tinker with the options (the SNES options look
particularly interesting and particularly daunting), and let me know
if any of them improve anything.

Most of the options have reasonable defaults (except -events_in,
which you need to give a value for, and -params_out, which you
probably want to give an option for) and can be left out. One feature
that's kind of cute is that on startup the program reads default
settings from ~/.petscrc (or a different file specified by the
option -options_file). This file can also have alias
statements, to allow abbreviations for some of the option
names. For example, my .petscrc contains:

-monitor
alias -in -events_in
alias -out -params_out

Parallel processing

Since tadm uses MPI for interprocess
communication, it can easily be ported to a wide range of parallel
architectures, including SMP and Beowulf-type clusters. Documentation
for how to do this will come in future releases.

Changes

version 0.9.5 - First TADM release, basically Rob Malouf's
original code relicensed under the Lesser GNU Public License.