Meteor version 1.3
Code by Michael Denkowski (mdenkows at cs.cmu.edu)
Authors of previous Meteor versions:
Abhaya Agarwal (abhayaa at cs.cmu.edu)
Satanjeev "Bano" Banerjee (satanjeev at cs.cmu.edu)
Alon Lavie (alavie at cs.cmu.edu)
Carnegie Mellon University
Note: See xray/README for directions using Meteor X-Ray
1. Introduction:
================
The Meteor metric evaluates a machine translation hypothesis against a reference
translation by calculating a similarity score based on an alignment between the
two strings. When multiple references are provided, the hypothesis is scored
against each and the reference producing the highest score is used. Alignments
are formed according to the following types of matches between strings:
Exact: Words are matched if and only if their surface forms are identical.
Stem: Words are stemmed using a language-appropriate Snowball Stemmer and
matched if the stems are identical.
Synonym: Words are matched if they are both members of a synonym set according
to the WordNet database.
Paraphrase: Phrases are matched if they are listed as paraphrases in the Meteor
paraphrase tables.
Currently supported languages are English, Czech, German, French, Spanish, and
Arabic. The system is written in Java with a full API to allow easy
incorporation of Meteor scoring into existing systems such as MERT
implementations.
This release also includes:
- a standalone version of the Aligner
- a standalone version of the Sufficient Statistics Scorer
- a Trainer which can tune optimal Meteor parameters for new data
- an Extractor which can convert (possibly malformed) XML/SGML into plaintext
2. Running Meteor:
==================
The following can be seen by running the Meteor scorer with no arguments:
--------------------------------------------------------------------------------
Meteor version 1.3
Usage: java -Xmx2G -jar meteor-*.jar [options]
Options:
-l language One of: en cz de es fr ar
-t task One of: rank adq hter li tune
-p 'alpha beta gamma delta' Custom parameters (overrides default)
-m 'module1 module2 ...' Specify modules (overrides default)
Any of: exact stem synonym paraphrase
-w 'weight1 weight2 ...' Specify module weights (overrides default)
-r refCount Number of references (plaintext only)
-x beamSize (default 40)
-d synonymDirectory (if not default for language)
-a paraphraseFile (if not default for language)
-j jobs Number of jobs to run (nBest only)
-f filePrefix Prefix for output files (default 'meteor')
-norm Tokenize / normalize punctuation and lowercase
(Recommended unless scoring raw output with
pretokenized references)
-lower Lowercase only (not required if -norm specified)
-noPunct Do not consider punctuation when scoring
(Not recommended unless special case)
-sgml Input is in SGML format
-mira Input is in MIRA format
(Use '-' for test and reference files)
-nBest Input is in nBest format
-oracle Output oracle translations (nBest only)
-vOut Output verbose scores (P / R / frag / score)
-ssOut Output sufficient statistics instead of scores
-writeAlignments Output alignments annotated with Meteor scores
(written to -align.out)
Sample options for plaintext: -l -norm
Sample options for SGML: -l -norm -sgml
Sample options for raw output / pretokenized references: -l -lower
See README file for additional information
--------------------------------------------------------------------------------
The simplest way to run Meteor is as follows:
$ java -Xmx2G -jar meteor-*.jar -norm
If files are in SGML format, use:
$ java -Xmx2G -jar meteor-*.jar -sgml -norm
For example, using the sample files included with this distribution,
you can run the following test.
Score the example test and reference files using the filtered paraphrase table:
$ java -Xmx2G -jar meteor-*.jar example/test.sgm example/ref.sgm -sgml -norm
You should see the following output:
--------------------------------------------------------------------------------
Meteor version: 1.3
Eval ID: meteor.1.3-en-norm-0.85_0.2_0.6_0.75-ex_st_sy_pa-1.0_0.6_0.8_0.6
Language: English
Format: SGML
Task: Ranking
Modules: exact stem synonym paraphrase
Weights: 1.0 0.6 0.8 0.6
Parameters: 0.85 0.2 0.6 0.75
[newstest2009][cmu-combo]
Test Matches Reference Matches
Stage Content Function Total Content Function Total
1 16052 21035 37087 16052 21035 37087
2 553 13 566 555 11 566
3 899 150 1049 932 117 1049
4 3989 3275 7264 4151 2982 7133
Total 21493 24473 45966 21690 24145 45835
Test words: 64748
Reference words: 66017
Chunks: 22847
Precision: 0.6466761746295856
Recall: 0.62208024339228
f1: 0.6341398024007918
fMean: 0.6256496735692693
Fragmentation penalty: 0.5218595115532901
Final score: 0.299148440516935
--------------------------------------------------------------------------------
This includes the following in order:
- Meteor version
- Eval ID, a string that uniquely identifies all version, setting, and parameter
information to ensure that other data sets scored with Meteor can be scored
consistently and comparably.
- Header describing settings and parameters
- List of translations to be scored (in this case only the cmu-combo system on
one test set.
- Match statistics
- Summary statistics
- Final score
Score files for segment, document, and system level scores are produced,
prefixed with "meteor" or the spefied prefix. The output from the above should
match the example scores.
$ diff meteor-seg.scr example/meteor-seg.scr
$ diff meteor-doc.scr example/meteor-doc.scr
$ diff meteor-sys.scr example/meteor-sys.scr
3. Options:
===========
Language: -l language
---------------------
English is assumed by default. Meteor also supports evaluation of MT output in
the following languages:
Language Available Modules
English (en) (exact, stem, synonym, paraphrase)
French (fr) (exact, stem, paraphrase)
German (de) (exact, stem, paraphrase)
Spanish (es) (exact, stem, paraphrase)
Czech (cz) (exact, paraphrase)
Arabic (ar) (exact, paraphrase)
Task: -t task
-------------
Each task specifies the modules, module weights, and parameters (alpha, beta,
gamma) tuned to a specific type of human judgment data. These tasks and their
supported languages follow:
rank : Tuned to human rankings of translations from WMT09 and WMT10.
- English
- Czech
- German
- Spanish
- French
adq : Tuned to adequacy scores from NIST OpenMT09.
- English
Tuned to adequacy scores of Google translations of news into Arabic by
volunteers at Columbia University.
- Arabic
hter : Tuned to HTER scores from GALE P2 and P3.
- English
li : Language independent - exact matches only, parameters selected to
generalize well across languages
Parameters: -p 'alpha beta gamma delta'
---------------------------------
Alternatively, the three parameters (alpha, beta, gamma, delta) can be
specified manually. This is most often used when tuning Meteor to new data.
Modules: -m 'module1 module2 ...'
---------------------------------
Meteor supports 4 matcher modules:
exact match using surface forms
stem match using stems obtained from the Snowball stemmers
synonym match based on synonyms obtained from WordNet
paraphrase match based on paraphrases from the Meteor paraphrase tables
See the language section to determine which modules are available for languages.
Module Weights: -w 'weight1 weight2 ...'
----------------------------------------
The module weights can also be specified manually. This is also primarily used
for tuning Meteor.
Reference Count: -r refCount
----------------------------
If the input is in plaintext, the number of references can be specified. For N
references, it is assumed that the reference file will be N times the length of
the test file, containing sets of N references in order. For example, if N=4,
reference lines 1-4 will correspond to test line 1, 5-8 to line 2, etc.
Beam Size: -x
-------------
This number, set to 40 by default, is used to limit the beam size when searching
for the highest scoring alignment. As parameters are tuned for a beam size of
40, simply increasing this number does not necessarily produce more reliable
scores.
Synonymy Directory: -d synonymDirectory
---------------------------------------
This option should only be used to test external synonymy databases. By default,
the included synonymy database will be used.
Paraphrase File: -a paraphraseFile
----------------------------------
This option should only be used to test external synonymy databases. By default,
the included paraphrase tables will be used.
Jobs: -j jobs
-------------
This option (nBest scoring only) sets the number of jobs to use for scoring. It
is generally a good idea to set this to the number of CPUs on the machine
running Meteor.
File Prefix: -f filePrefix
--------------------------
Specify the prefix of score files in SGML mode. Files produced will be
-seg.scr, -doc.scr, -sys.scr. The default
prefix is "meteor". If alignments are to be written, they are written to
-align.out.
Normalize: -norm
----------------
Tokenize and lowercases input lines, normalize punctuation to improve scoring
accuracy. This option is highly recommended unless scoring raw system output
against pretokenized references.
Lowercase: -lower
-----------------
Lowercase input lines (not required if -norm also specified). This is most
commonly used scoring cased, tokenized outputs with pretokenized references.
Ignore Punctuation: -noPunct
----------------------------------
If specified, punctuation symbols will be removed before scoring. This is
generally not recommended as parameters are tuned with punctuation included.
SGML: -sgml
-----------
This specifies that input is in SGML format. (See Input/Output section)
MIRA: -mira
-----------
Input is in cdec scoring format. Use with "-" for test and reference files,
reads from standard in and writes to standard out. Lines are composed of the
following:
SCORE ||| reference 1 words ||| reference n words ||| hypothesis words
Scores hypothesis against one or more references and returns line of sufficient
statistics.
EVAL ||| stats
Calculates final scores using output of SCORE lines.
N-Best: -nBest
--------------
This specifies that input is in nBest format with multiple translations for each
segment. For each segment, a line containing a single number for the count of
translations is followed by one translation per line. For example, an input file
with translations for three segments might appear as follows:
1
This is a single translation.
3
This is hypothesis one.
This is hypothesis two.
This is hypothesis three.
2
This segment has two translations.
This is the second translation.
See Input/Output section for the output format.
Oracle Best: -oracle
--------------------
Also output the oracle 1-best translation for each segment when scoring an
N-best list.
Verbose Output: -vOut
---------------------
Output verbose scores (Precision, Recall, Fragmentation, Score) in place of
regular scores.
Sufficient Statistics: -ssOut
-----------------------------
This option outputs sufficient statistics in place of scores and omits
all other output. The behavior is slightly different depending on
the data format.
Plaintext:
Space delimited lines are output, each having the following form:
tstLen refLen stage1tstTotalMatches stage1refTotalMatches
stage1tstWeightedMatches stage1refWeightedMatches s2tTM s2rTM s2tWM
s2rWM s3tTM s3rTM s3tWM s3rWM s4tTM s4rTM s4tWM s4rWM chunks lenCost
No system level score is output. The lines can be piped or otherwise passed to
the StatsScorer program to produce Meteor scores from the sufficient statistics.
SGML:
The output score files will contain space delimited sufficient statistics in
place of scores. Segment, Document, and System level scores are still produced.
Write Alignments: -writeAlignments
----------------------------------
Write alignments between hypotheses and references to meteor-align.out or
-align.out when file prefix is specified. Alignments are written in
Meteor format, annotated with Meteor statistics:
Title precision recall fragmentation score
sentence1
sentence2
Line2Start:Length Line1Start:Length Module Score
...
4. Input/Output Formats:
========================
Input can be in either plaintext with one segment per line (also see -r and
-nBest for multiple references or hypotheses), or in SGML.
For plaintext, output is to standard out with scores for each segment and final
system level statistics.
If nBest is specified, a score is output for each translation hypothesis along
with system level statistics for first-sentence (first translation in each list)
and best-choice (best scoring translation in each list).
For SGML, output includes 3 files containing segment, document, and system level
scores for the systems and test sets:
meteor-seg.scr contains lines: testset system document segment score
meteor-doc.scr contains lines: testset system document score
meteor-sys.scr contains lines: testset system score
System level statistics will also be written to standard out for SGML scoring.
5. Aligner:
===========
The Meteor aligner can be run independently with the following command:
$ java -Xmx2G -cp meteor-*.jar Matcher
Without any arguments, the following help text is printed.
--------------------------------------------------------------------------------
Meteor Aligner version 1.3
Usage: java -Xmx2G -cp meteor-*.jar Matcher [options]
Options:
-l language One of: en cz de es fr ar
-m 'module1 module2 ...' Specify modules (overrides default)
One of: exact stem synonym paraphrase
-t type Alignment type (coverage vs accuracy)
One of: maxcov maxacc
-x beamSize Keep speed reasonable
-d synonymDirectory (if not default)
-a paraphraseFile (if not default)
See README file for examples
--------------------------------------------------------------------------------
The aligner reads in two plaintext files and outputs a detailed line-by-line
alignment between them. Only the options (outlined in previous sections) which
apply to the creation of alignments are available. The type option determines
whether the aligner prefers coverage (better for correlation with human
judgments in evaluation) or accuracy (better for tasks requiring high accuracy
for each alignment link).
6. StatsScorer:
===============
The Meteor sufficient statistics scorer can also be run independently:
$ java -cp meteor-*.jar StatsScorer
The --help option provides the following help text.
--------------------------------------------------------------------------------
Meteor Stats Scorer version 1.3
Usage: java -cp meteor.jar StatsScorer [options]
Options:
-l language One of: en cz de es fr ar
-t task One of: adq rank hter li
-p 'alpha beta gamma' Custom parameters (overrides default)
-w 'weight1 weight2 ...' Specify module weights (overrides default)
-final Output final (system level) score
--------------------------------------------------------------------------------
The scorer reads lines of sufficient statistics from standard in and writes
Meteor scores to standard out. If -final is specified, an additional line is
written containing the aggregate score.
7. Trainer:
===============
The Meteor trainer can be used to tune Meteor parameters for new data. The
"scripts" directory contains scripts for creating training sets from many common
data formats.
Without any arguments, the following help text is printed.
--------------------------------------------------------------------------------
Meteor Trainer version 1.3
Usage: java -XX:+UseCompressedOops -Xmx2G -cp meteor-*.jar Trainer [options]
Tasks: One of: segcor rank
Options:
-a paraphrase
-e epsilon
-l language
-i 'p1 p2 p3 w1 w2 w3 w4' Initial parameters and weights
-f 'p1 p2 p3 w1 w2 w3 w4' Final parameters and weights
-s 'p1 p2 p3 w1 w2 w3 w4' Steps
--------------------------------------------------------------------------------
The Trainer will explore the parameter space bounded by the initial and final
weights using the given steps. Output should be piped to a file and sorted to
determine the best scoring point. The following tasks are available:
segcor: Segment-level correlation: data dir can contain file triplets for any
number of systems in the form:
.tst - MT system output file (SGML)
.ref - Reference translation file (SGML)
.ter - Human score file for this system containing lines in the
form (space delimited):
example:
newswire1 12 5
example: sys1.tst sys1.ref sys1.ter
Human scores can be of any numerical measure (7 point adequacy scale,
0/1 correctness, HTER or other post-edit measure). For each point in
the parameter space, the segment-level length-weighted Pearson's
correlation coefficient is calculated across the scores for all segments
in all files.
rank: Rank consistency: data dir can contain file groups in the following form:
.rank - rank file containing lines in the form (tab delimited):
example:
3 cz-en sysA cz-en sysB
indicating that for a given segment, language pair A,
system A is preferred (higher score) to language pair B
system B. There can be multiple judgments for the same
systems on the same segments.
.ref.sgm - Reference translation file for this language pair
(SGML)
..sgm - MT system output for this language pair (SGML)
..sgm - another system
..sgm - another system
...additional systems...
example: cz-en.rank
cz-en.ref.sgm
cz-en.sysA.sgm
cz-en.sysB.sgm
cz-en.sysC.sgm
...
For each point in the parameter space, the rank consistency (proportion of
times preferred segments receive a higher metric score) is calculated.
8. SGML-to-Plaintext Converter:
===============================
This release also includes a program for reliably converting SGML test and
reference files to plain text. Resulting files are consistently ordered even
if the SGML files are not and blank lines are appropriately added for empty or
missing segments. To run this program, use:
$ java -cp meteor-*.jar SGMtoPlaintext
9. Scripts:
===========
The scripts directory contains many useful scripts for training and debugging
Meteor. If you are brave enough to use them, most of them are reasonably
commented. You can also send email to mdenkows at cs.cmu.edu .
10. Licensing:
=============
Meteor is released under the LGPL and includes some files subject to the
(less restrictive) WordNet license. See the included COPYING files for
details.
11. Acknowledgements:
=====================
The following researchers have contributed to previous implementations
the Meteor system (all at Carnegie Mellon University):
Rachel Reynolds
Kenji Sagae
Jeremy Naman
Shyamsundar Jayaraman