MALLET is open source software
[License].
For research use, please remember to cite MALLET.

Use Mallet 2.0.7 or greater for this code.
The implementation of GE training of MaxEnt models in pre-2.0.7 versions of Mallet contains a bug that often results in low accuracy when the number of constraints is small. Specifically,
the Gaussian prior was not always being included in the objective function value, which caused problems in numerical optimization.
(Published experiments, i.e. [Druck, Mann, and McCallum 2008], used a different implementation and are not affected by this bug.)

Document Classification with Expectation Constraints

In this tutorial we describe training maximum entropy document classifiers with
expectation constraints that specify affinities between words and labels.
See [Druck, Mann, and McCallum 2008] for more
information. We assume that the task is classifying baseball and hockey documents and that we have
processed data sets baseball-hockey.train.vectors and
baseball-hockey.test.vectors.

These methods require unlabeled training data. We can hide labels using Vectors2Vectors.

If the data is truly unlabeled, then the easiest way to import it is to assign an arbitrary label to each document, ensuring that each label is used at least once.

Generalized Expectation

Suppose we know a priori that the words baseball and puck are good indicators of labels
baseball and hockey respectively. Specifically, suppose that we estimate that 90% of the
documents in which the word puck occurs should be labeled hockey, and similarly for
baseball. We may specify these constraints in a file as follows.

baseball hockey:0.1 baseball:0.9
puck hockey:0.9 baseball:0.1

The general format for a constraints file is:

feature_name label_name=probability label_name=probability ...

The number of probabilities must be equal to the number of labels. The feature and label names must match the names in the data and target alphabets exactly.

The following command trains a MaxEnt classifier with the above constraints (assumed to be in file
baseball-hockey.constraints) using Generalized Expectation (GE) (as described in [Druck, Mann, and McCallum 2008]).
We specify the constraints file using
constraintsFile and specify a regularization penalty with gasussianPriorVariance.

L2 Penalty

By default, the difference between the target and model expectations is penalized using KL divergence (as in [Druck, Mann, and
McCallum 2008]). Instead, we can impose an L2 penalty using the L2 option.

API

The underlying trainer is cc.mallet.classify.MaxEntGETrainer. New GE constraints and penalties for training MaxEnt models can be defined by implementing
cc.mallet.classify.constraints.ge.MaxEntGEConstraint.

Generalized Expectation with Target Ranges

It is also possible to specify L2 constraints that do not impose a penalty if the model expectation is within some target range.
For example, we can encourage model expectations to be in the range 90-100%.

API

The underlying trainer is cc.mallet.classify.MaxEntGERangeTrainer. New GE constraints and penalties for training MaxEnt models can be defined by implementing
cc.mallet.classify.constraints.ge.MaxEntGEConstraint.

Posterior Regularization

There is also support for training MaxEnt models with Posterior Regularization (PR)
[Ganchev, Graça, Gillenwater, and Taskar 2010].
The following command trains a MaxEnt classifier using the above constraints (assumed to be in file
baseball-hockey.constraints) with PR for 100 iterations. We specify the constraints file using
constraintsFile and specify a regularization penalty for each step (c.f.
[Bellare, Druck, and McCallum 2009]) with pGasussianPriorVariance and qGaussianPriorVariance.

API

The underlying trainer is cc.mallet.classify.MaxEntPRTrainer. New PR constraints and penalties for training MaxEnt models can be defined by implementing
cc.mallet.classify.constraints.pr.MaxEntPRConstraint.

Automated Methods for Obtaining Constraints

Below, we discuss machine-assisted methods for obtaining constraints. Note that these methods do not yet support target ranges.

User-provided Labeled Features

Rather than specifying the target expectations directly, we may instead specify "labels" for features, and have these converted into target expectations. Suppose we know that the word puck is associated with hockey, and the word baseball is associated with the label baseball. We may specify these labeled features in a file (baseball-hockey.labeled_features) as follows.

baseball baseball
puck hockey

The general format for a file with labeled features is:

feature_name label_name label_name ...

Vectors2FeatureConstraints can estimate target expectations from a file with labeled features. A simple heuristic for obtaining expectations from labeled features is to uniformly divide constant probability mass among the labels for a feature. By default, 0.9 probability is allocated to the labels for a feature. This estimation method can be specified using heuristic for the targets command option.

The lda-file is a serialized LDA model file. See the topic modeling tutorial for more information. Setting targets to none tells Vectors2FeatureConstraints to output candidate features only. baseball-hockey.features will then contain a list of ten candidate features, one per line.

The above method is unsupervised (i.e. does not look at the true labels). We can also select
candidate features using an "oracle" information gain method (infogain) that looks at the
true labels. (Note that when using true labels obtaining constraints, baseball-hockey.train.vectors, rather than baseball-hockey.unlabeled.vectors, must be used.)

Machine-provided Target Expectations

Given a set of candidate features, we may estimate constraints using two methods. The first method is to have the machine label the features (by revealing the true labels and using the method of [Druck, Mann, and McCallum 2008]), and convert these labels into expectations using the same
heuristic as above.

Note that when using heuristictargets, the machine may discard candidate features
in the labeling process (c.f. [Druck, Mann, and McCallum 2008]). However, the machine does not discard
any candidate features when using --targets oracle .

Tips

For GE training, a gaussianPriorVariance of 1 is a reasonable default choice.

For PR training, in our experience large values for qGaussianPriorVariance and small values for pGaussianPriorVariance work best.

The command line interfaces only provide basic functionality. In some cases it may be necessary to tweak the optimization code (by for example setting convergence tolerances or step sizes) in
order to obtain good results.

As a rule of thumb, try to specify a set of constraints that is balanced among labels and covers many documents.