Maximum entropy models are very popular, especially in natural
language processing. The software here is an implementation of
maximum likelihood and maximum a posterior optimization of the
parameters of these models. The algorithms used are much more
efficient than the iterative scaling techniques used in almost every
other maxent package out there. A description of the algorithms is
available in the following unpublished report:

Notes on CG and LM-BFGS Optimization of Logistic Regression
These are notes on this implementaiton of conjugate gradient and
limited memory BFGS optimization for logistic regression (aka maximum
entropy) classifiers. The notes were created because it is actually
quite difficult to find good references on efficient implementation of
these algorithms, though discussion of them exists everywhere.

What's New!

New features: You can do passive aggressive updates (ala
Crammer et al.) with perceptron or multitron by specifying -pa. You
can also do instance normalization with -norm1 or -norm2 (for l1 and
l2 norms respectively). You can use named classes (-nc) and finally
everything is about 3-4 times faster!

Bug fix:-nobias works better now (i.e., works at all).

New feature: You can now train using perceptron updates for binary or
"multitron" updates for multiclass problems. For these, the -lambda parameter
is ignored (though it may not be in the future). Performance tends to be
comparable to the binary/multiclass optimization.

New feature: Now, the model spits out the weights that occurred at the
highest dev performance; to restore the old way of spitting out the final
weights, use -lastweight.

New feature: You can do multilabel classification. Specify -multilabel
on the command line and replace class ids with cost vectors. Eg., "0:0:1:2" is
a four class problem for which predicting class 0 and 1 lead to a cost of zero,
predicting class 2 leads to a cost of one and predicting class 3 leads to a
cost of two. Only works with multiclass or multitron optimization.

New feature: Prediction can be run in a dual-pipe mode. If you use "-"
as the name of the file to predict, you can write examples to stdin and read
predictions from stdout. Flushing is forced early, so you can do prediction
one-by-one.

New feature: You can now do ranking. In particular, if you use
the explicit file format together with -nobias, you can
accomplish a ranking model. Importantly, the number of "classes" does
not need to be the same for each example.

Bug fix: Various errors with the explicit file format.

New feature: All files read either as weights from
-predict or data files can now be compressed. If the file
ends in .Z, .gz or .bz2, it will
automatically read it as a compressed file.

New feature:-nobias option turns off the learning of
a bias parameter.

New feature: the cap on the maximum number of training examples
has increased.

Bug fix: the saving of weights with feature selection was
broken; it now works properly. Thanks to Taichi Nakamura for pointing
this out!

Bug fix: there was a bug with predicting explicit multiclass
data; it should be fixed now. Thanks to Ashish Venugopal for pointing
this out!

Bug fix: using explicit class feature information for multiclass
problems was broken; it is fixed in the new version. Thanks to Simon
Clematide for pointing this out!

Bug fix: there was a small, but important bug with estimating the
bias (stupid bias...it gets me every time). You'll want to get the
new version ASAP. Many thanks to Shou-de Lin for pointing this
out!

There are several changes from version 1 to version 2. First, some
limits (number of examples) has been raised to over four million (and
it's still more than sufficiently efficient with this much data).
Additionally, features have been added for "density
estimation" style modeling (specifically for whole-sentence
maximum entropy language models), restarted training, naive domain
adaptation and feature selection. See options
-prog/-sfile, -mean, -init and
-abffs below.

Code and Executables

The software can be downloaded as source code (in O'Caml) or as executables for i686
Linux or Sun4 Solaris. For executables, there are both debugging
versions and optimized versions avaiable:

Source:megam_src.tgzLinux: Debugging: megam_i686.gz or Optimized: megam_i686.opt.gzSolaris: I no longer have access to a sun machine...sorry, download the source.Windows: I no longer have access to a Windows machine...sorry, download the source.

For people at Utah: The optimized executable can be run as
/home/hal/bin/megam.
This code is free to anyone and you can use it in any research thing
you want. For commercial use, please contact me. If you use it for
work in a published article, please add a footnote acknowledging the
software, or cite the article above.

Speed

I have produced some graphs comparing the efficiency of MegaM to
YASMET, a maxent software package by Franz Josef Och (now at Google),
that uses generalized iterative scaling for optimization. This has
been done both for a binary problem and a multiclass problem, each
with about 140 thousand training instances and 30 thousand test
instances, with hundreds of thousands of features (an NP chunking
task, if you must know).

Below, we see training error and test error curves plotted against
time (not number of iterations) for GIS and for CG (our binary
optimizer). Both are done with lambda=1 smoothing. As you can see,
CG both achieves lower error rates, and achieves them more quickly (it
is done after less than 100 seconds, less than the time it takes
YASMET to even start up):

If you subtract off the startup file reading time, the curves are:

We have done the same thing for multiclass. In this, we have two
versions of LM-BFGS, one which uses explicit feature vectors and one
which uses implicit feature vectors (see usage and file formats
below). Both are faster than GIS, and implicit is quite significantly
faster:

If you subtract off the startup file reading time, the curves are:

Usage

The MegaM software lists its usage if you run it with no (or with
invalid) arguments. Three types of problems can be solved using the
software: binary classification (clases are 0 or 1), binomial
regression ("classes" are real values between 0 and 1; the
model will try to match its predictions to those values), or
multiclass classification (classes are 0, 1, 2, and so on).

usage: megam [options] <model-type> <input-file>
[options] are any of:
-fvals data is in not in bernoulli format (i.e. feature
values appear next to their features int the file)
-explicit this specifies that each class gets its own feature
vector explicitely, rather than using the same one
independently of class
(only valid for multiclass problems)
-quiet don't generate per-iteration output
-maxi <int> specify the maximum number of iterations
(default: 100)
-dpp <float> specify the minimum change in perplexity
(default: -99999)
-memory <int> specify the memory size for LM-BFGS (multiclass only)
(default: 5)
-lambda <float> specify the precision of the Gaussian prior
(default: 1)
-tune tune lambda using repeated optimizations (starts with
specified -lambda value and drops by half each time
until optimal dev error rate is achieved)
-sprog <prog> for density estimation problems, specify the program
that will generate samples for us (see also -sfile)
-sfile <files> for de problems, instead of -sprog, just read from a
(set of) file(s); specify as file1:file2:...:fileN
-sargs <string> set the arguments for -sprog; default ""
-sspec <i,i,i> set the <burn-in time, number of samples, sample space>
parameters; default: 1000,500,50
-sforcef <file> include features listed in <file> in feature vectors
(even if they don't exist in the training data)
-predict <file> load parameters from <file> and do prediction
(will not optimize a model)
-mean <file> the Gaussian prior typically assumes mu=0 for all features;
you can instead list means in <file> in the same format
as is output by this program (baseline adaptation)
-init <file> initialized weights as in <file>
-abffs <int> use approximate Bayes factor feature selection; add features
in batches of (at most) <int> size
-curve <spec> produce a learning curve, where spec = "min,step"
and we start with min examples and increase (multiply!)
by step each time; eg: -curve 2,2
-nobias do not use the bias features
-repeat <int> repeat optimization <int> times (sometimes useful because
bfgs thinks it converges before it actually does)
-lastweight if there is DEV data, we will by default output the best
weight vector; use -lastweight to get the last one
-multilabel for multiclass problems, optimize a weighted multiclass
problem; labels should be of the form "c1:c2:c3:...:cN"
where there are N classes and ci is the cost for
predicting class i
<model-type> is one of:
binary this is a binary classification problem; classes
are determined at a threshold of 0.5 (anything
less is negative class, anything greater is positive)
perceptron binary classification with averaged perceptron
multitron multiclass classification with averaged perceptron
binomial this is a binomial problem; all values should be
in the range [0,1]
multiclass this is a multiclass problem; classes should be
numbered [0, 1, 2, ...]; anything < 0 is mapped
to class 0
density this is a density estimation problem and thus the
partition function must be calculated through samples
(must use -sprog or -sfile arguments, above)

A standard call for optimization would be something like
"megam binary file" or "megam multiclass
file where file is an existing file that contains the
training data (see file formats, below). This will run for at most
100 iterations (for BFGS, when the weight vector doesn't change,
iterations will cease no matter what you specify).

The program will write the learned weights to stdout and will write
progress reports to stderr. For instance, using the small2.gz data set (unzip it first), we get (with
stdout is directed to another file or to /dev/null):

The first line tells us how much data it has read (1000 training
instances, 600 development instances and 700 test instances). Then,
for each iteration, it gives us the iteration number, the maximum
absolute weight change and, for each of training, development and
test, the current average perplexities and error rates. It
automatically stopped after 22 iterations because dw has
dropped to zero.

The first column is the feature name and then each remaining column is
the feature weight for that class number (in this case, class zero
through four). The weights for the first class are always zero.

The first and last line are written to stderr; the predictions are to
stdout. The first column is the predicted class, and then each other
column is the corresponding class probabilities. The lines to stdout
are produced lazily (i.e., it doesn't read in the whole file before
producing them).

Here, it has mapped any class >0.5 to class +3 and anything less
than 0.5 to class 0 (so class 0 is still class 0, and all of the
classes 1,2,3,4 have become class 1). There are only two columns this
time, corresponding to the class and its probability. The weight file
looks a bit different, too:

Since there are only two classes, and hence only one weight vector,
you only get two columns, one for the feature name and one for the
corresponding weight (for class 1).

If you try to use a multiclass weight vector with a binary problem, or
vice versa, strange things will happen, so be careful (most likely, it
will die).

Specifying a different value for maxi changes the number of
iterations, putting dpp at, for instance, 0.00001 means that
if the change in (training) perplexity drops below 0.00001, the
algorithm will terminate. lambda is the precision of the
Gaussian prior on the weights, 0 means no prior, 1000000 means too
much prior. You can tune the lambda value by setting a maximum value
using -lambda and specifying -tune, which will
search for a good one by halfing lambda in a sequence of iterations
until development error rate ceases to decrease.

File Formats

There are roughly four file formats that are acceptable to MegaM. No
matter what, each instance gets its own line and the first column
(space or tab separated) is the class. For binary problems, this
should be 0 or 1 (anything less than 0.5 goes to 0 and anything
greater goes to 1). For binomial problems, this should be between 0
and 1 (anything less than 0 goes to 0 and anything greater goes to
1). For multiclass problems, this should be greater than zero
(anything less than zero goes to zero).

In the simplest file format, the so-called "bernoulli
implicit" format, after the class label, feature names are simply
listed. It is called "bernoulli" because features can take
only values of 0 or 1, and any non-present feature is assume to have
value 0. It is called implicit because, for multiclass problems, we
assume the same feature vector for all class labels. An example is:

0 F1 F2 F3
1 F2 F3 F8
0 F1 F2
1 F8 F9 F10

The second simplest file format, "non-bernoulli implicit"
requires that the option -fvals be specified to the program.
In this, feature values are written adjacent to their feature names.
The values can be any real number. An identical example would be:

The "explicit" file formats are only for multiclass problems
and are useful (supposedly) if different features fire depending on
the class label. In this case, we need to specify a feature vector
for each possible class; these are separated by space-delimited pound
signs. For instance:

In this example, there are the same features (more or less) for each
class, but the are given slightly different names. This need not be
the case: you can put anything you want. When you use explicit files,
you need to specify -explicit to the program. You can use
both explicit and non-bernoulli files by specifying -explicit
and -fvals on the command line and using the obvious
corresponding file format. Note that when you use explicit format,
the weight vector output will not have a different column for each
class; it will simply have each feature with a single weight.

Finally, you can break your training file into a training,
development and test section. The first segment of the
file is automatically training data (what is used to optimize the
weights). If you then have a single line with the word DEV,
then what follows will be considered development data (what is used to
optimize the lambda parameter). Finally, a single line with the word
TEST specifies that the remainder of the file is test data,
which is (understandably) not used for any optimization.

Density Estimation

To use the density estimation capabilities, you must provide
independent samples from a base distribution (typically an n-gram
language model). See the Chen/Rosenberg paper on whole sentence
maxent language models for more details. You still need training
data, which should consist only of "positive examples (labeled
however you like; with zeros is best). You then must tell us how to
generate samples. This can either be done by providing a file that
has one sample per line (in the same format as the training data), or
by providing the name of a program that will generate samples for us.
In the latter case, the program should take (at least) two arguments:
first, the file name to dump the samples to; second, the number of
samples to generate. In the case of a file, just specify the file
name.

You will additionally need to set the sample spects (-sspec
#,#,#) where the three comma-separated numbers are the burn-in
time, the number of samples to use, and how far apart to space the
samples (we use an independence sampler). Finally, if the samples
might contain features not in the training data and you don't want
these to be ignored, list them all in a file and provide this as an
option to -sforcef and it will include all the features.

Naive Domain Adaptation

We provide capabilities to do naive domain adaptation as described by
Chelba and Acero in their 2004 EMNLP paper. The idea is to use the
weights learned from an "out of domain" corpus as the prior
means for the "in domain" data. There are therefore two
steps to doing such training. First, use your out of domain data to
train a model and same the model file. Now, use the in domain data to
train a new model, but use the -mean parameter to specify the
model output from the out of domain data. This tends to give better
performance to other options and is quite trivial to run.

Initialization

The optimization problem we solve is convex and so there's a unique
maximum, but sometimes getting there is slow. If you have a good
guess at paramters, or you want to initialize them to something other
than zero, write the parameters to a file in the same format as model
outputs and the use the -init command to load them. This is
also useful if you want to just run more iterations.

Feature Selection

Our last "cool" capability is to do feature selection using
approximate Bayes factors (see my paper on
this topic). This is reasonably efficient, but not for huge huge data
sets. To use it, simply specify -abffs # as an option, where
# is the number of features that should be added each
iteration. I typically use 50, but YMMV. Using 1 is
"correct", but using more isn't too bad and is much more
efficient.

Using Weights

Sometimes you would like to weight different examples are more or less
"important" than others. To do so, simply place the string
"$$$WEIGHT #" immediately following the class
number for the corresponding example, where # is a positive
real number. For instance, saying "$$$WEIGHT 2" is
equivalent to simply including two identical lines without weight (but
will be slightly more efficient).

Frequently Asked Questions

Why is this program called MegaM? What does the "GA"
stand for?
Excellent question. Right now, I won't tell you (ask me in person).
The optimizers here are actually part of a larger research project
that will hopefully come to fruition shortly. When that happens, this
program will support the "GA" part of its name correctly.
Check back soon!

Should I ever use iterative scaling again?
NO!

What should I do if I encounter a bug?
Please email me. I use this software too. If there's a bug, I'd like
to fix it!

Can I use this software for research?
Of course, that's the whole point. Please give me credit in any paper
that comes out, though, either as a footnote containing this URL, or
as a full citation to the paper up top.

Can I put weights on different examples?
Yes, see "Using Weights" above.

last updated twenty august, two thousand four
comments, corrections? email