Features for Named Entity Recognition. The code here creates the features
by processing Lists of CoreLabels.
Look at SeqClassifierFlags to see where the flags are set for
what options to use for what flags.

To add a new feature extractor, you should do the following:

Add a variable (boolean, int, String, etc. as appropriate) to
SeqClassifierFlags to mark if the new extractor is turned on or
its value, etc. Add it at the bottom of the list of variables
currently in the class (this avoids problems with older serialized
files breaking). Make the default value of the variable false/null/0
(this is again for backwards compatibility).

Add a clause to the big if/then/else of setProperties(Properties) in
SeqClassifierFlags. Unless it is a macro option, make the option name
the same as the variable name used in step 1.

Add code to NERFeatureFactory for this feature. First decide which
classes (hidden states) are involved in the feature. If only the
current class, you add the feature extractor to the
featuresC code, if both the current and previous class,
then featuresCpC, etc.

Parameters can be defined using a Properties file
(specified on the command-line with -proppropFile),
or directly on the command line. The following properties are recognized:

Property Name

Type

Default Value

Description

loadClassifier

String

n/a

Path to serialized classifier to load

loadAuxClassifier

String

n/a

Path to auxiliary classifier to load.

serializeTo

String

n/a

Path to serialize classifier to

trainFile

String

n/a

Path of file to use as training data

testFile

String

n/a

Path of file to use as training data

map

String

see below

This applies at training time or if testing on tab-separated column data. It says what is in each column. It doesn't apply when running on plain text data. The simplest scenario for training is having words and classes in two column. word=0,answer=1 is the default if conllNoTags is specified; otherwise word=0,tag=1,answer=2 is the default. But you can add other columns, such as for a part-of-speech tag, presences in a lexicon, etc. That would only be useful at runtime if you have part-of-speech information or whatever available and are passing it in with the tokens (that is, you can pass to classify CoreLabel tokens with additional fields stored in them).

useWord

boolean

true

Gives you feature for w

useBinnedLength

String

null

If non-null, treat as a sequence of comma separated integer bounds, where items above the previous bound up to the next bound are binned Len-range

useNGrams

boolean

false

Make features from letter n-grams, i.e., substrings of the word

lowercaseNGrams

boolean

false

Make features from letter n-grams only lowercase

dehyphenateNGrams

boolean

false

Remove hyphens before making features from letter n-grams

conjoinShapeNGrams

boolean

false

Conjoin word shape and n-gram features

useNeighborNGrams

boolean

false

Use letter n-grams for the previous and current words in the CpC clique. This feature helps languages such as Chinese, but not so much for English

usePrev

boolean

false

Gives you feature for (pw,c), and together with other options enables other previous features, such as (pt,c) [with useTags)

useNext

boolean

false

Gives you feature for (nw,c), and together with other options enables other next features, such as (nt,c) [with useTags)

The value can be one or more filenames (names separated by a comma, semicolon or space).
If provided gazettes are loaded from these files. Each line should be an entity class name, followed by whitespace followed by an entity (which might be a phrase of several tokens with a single space between words).
Giving this property turns on useGazettes, so you normally don't need to specify it (but can use it to turn off gazettes specified in a properties file).

sloppyGazette

boolean

false

If true, a gazette feature fires when any token of a gazette entry matches

cleanGazette

boolean

false

If true, a gazette feature fires when all tokens of a gazette entry match

Does not use any class combination features using previous classes if this is false

useNextSequences

boolean

false

Does not use any class combination features using next classes if this is false

useLongSequences

boolean

false

Use plain higher-order state sequences out to minimum of length or maxLeft

useBoundarySequences

boolean

false

Use extra second order class sequence features when previous is CoNLL boundary, so entity knows it can span boundary.

useTaggySequences

boolean

false

Use first, second, and third order class and tag sequence interaction features

useExtraTaggySequences

boolean

false

Add in sequences of tags with just current class features

useTaggySequencesShapeInteraction

boolean

false

Add in terms that join sequences of 2 or 3 tags with the current shape

strictlyFirstOrder

boolean

false

As an override to whatever other options are in effect, deletes all features other than C and CpC clique features when building the classifier

entitySubclassification

String

"IO"

If
set, convert the labeling of classes (but not the background) into
one of several alternate encodings (IO, IOB1, IOB2, IOE1, IOE2, SBIEO, with
a S(ingle), B(eginning),
E(nding), I(nside) 4-way classification for each class. By default, we
either do no re-encoding, or the CoNLLDocumentIteratorFactory does a
lossy encoding as IO. Note that this is all CoNLL-specific, and depends on
their way of prefix encoding classes, and is only implemented by
the CoNLLDocumentIteratorFactory.

useSum

boolean

false

tolerance

double

1e-4

Convergence tolerance in optimization

printFeatures

String

null

print out all the features generated by the classifier for a dataset to a file based on this name (starting with "features-", suffixed "-1" and "-2" for train and test). This simply prints the feature names, one per line.

printFeaturesUpto

int

-1

Print out features for only the first this many datums, if the value is positive.

useSymTags

boolean

false

Gives you
features (pt, t, nt, c), (t, nt, c), (pt, t, c)

useSymWordPairs

boolean

false

Gives you
features (pw, nw, c)

printClassifier

String

null

Style in which to print the classifier. One of: HighWeight, HighMagnitude, Collection, AllWeights, WeightHistogram

printClassifierParam

int

100

A parameter
to the printing style, which may give, for example the number of parameters
to print

intern

boolean

false

If true,
(String) intern read in data and classes and feature (pre-)names such
as substring features

intern2

boolean

false

If true, intern all (final) feature names (if only current word and ngram features are used, these will already have been interned by intern, and this is an unnecessary no-op)

cacheNGrams

boolean

false

If true,
record the NGram features that correspond to a String (under the current
option settings) and reuse rather than recalculating if the String is seen
again.

selfTest

boolean

false

noMidNGrams

boolean

false

Do not include character n-gram features for n-grams that contain neither the beginning or end of the word

maxNGramLeng

int

-1

If this number is
positive, n-grams above this size will not be used in the model

useReverse

boolean

false

retainEntitySubclassification

boolean

false

If true, rather than undoing a recoding of entity tag subtypes (such as BIO variants), just leave them in the output.

useLemmas

boolean

false

Include the lemma of a word as a feature.

usePrevNextLemmas

boolean

false

Include the previous/next lemma of a word as a feature.

useLemmaAsWord

boolean

false

Include the lemma of a word as a feature.

normalizeTerms

boolean

false

If this is true, some words are normalized: day and month names are lowercased (as for normalizeTimex) and some British spellings are mapped to American English spellings (e.g., -our/-or, etc.).

normalizeTimex

boolean

false

If this is true, capitalization of day and month names is normalized to lowercase

useNB

boolean

false

useTypeSeqs

boolean

false

Use basic zeroeth order word shape features.

useTypeSeqs2

boolean

false

Add additional first and second order word shape features

useTypeSeqs3

boolean

false

Adds one more first order shape sequence

useDisjunctive

boolean

false

Include in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position)

disjunctionWidth

int

4

The number of words on each side of the current word that are included in the disjunction features

useDisjunctiveShapeInteraction

boolean

false

Include in features giving disjunctions of words anywhere in the left or right disjunctionWidth words (preserving direction but not position) interacting with the word shape of the current word

useWideDisjunctive

boolean

false

Include in features giving disjunctions of words anywhere in the left or right wideDisjunctionWidth words (preserving direction but not position)

wideDisjunctionWidth

int

4

The number of words on each side of the current word that are included in the disjunction features

usePosition

boolean

false

Use combination of position in sentence and class as a feature

useBeginSent

boolean

false

Use combination of initial position in sentence and class (and word shape) as a feature. (Doesn't seem to help.)

useDisjShape

boolean

false

Include features giving disjunctions of word shapes anywhere in the left or right disjunctionWidth words (preserving direction but not position)

useClassFeature

boolean

false

Include a feature for the class (as a class marginal). Puts a prior on the classes which is equivalent to how often the feature appeared in the training data.

useShapeConjunctions

boolean

false

Conjoin shape with tag or position

useWordTag

boolean

false

Include word and tag pair features

useLastRealWord

boolean

false

Iff the prev word is of length 3 or less, add an extra feature that combines the word two back and the current word's shape. Weird!

useNextRealWord

boolean

false

Iff the next word is of length 3 or less, add an extra feature that combines the word after next and the current word's shape. Weird!

useTitle

boolean

false

Match a word against a list of name titles (Mr, Mrs, etc.). Doesn't really seem to help.

useTitle2

boolean

false

Match a word against a better list of English name titles (Mr, Mrs, etc.). Still doesn't really seem to help.

useDistSim

boolean

false

Load a file of distributional similarity classes (specified by distSimLexicon) and use it for features

distSimLexicon

String

The file to be loaded for distsim classes.

distSimFileFormat

String

alexclark

Files should be formatted as tab separated rows where each row is a word/class pair. alexclark=word first, terrykoo=class first

useOccurrencePatterns

boolean

false

This is a very engineered feature designed to capture multiple references to names. If the current word isn't capitalized, followed by a non-capitalized word, and preceded by a word with alphabetic characters, it returns NO-OCCURRENCE-PATTERN. Otherwise, if the previous word is a capitalized NNP, then if in the next 150 words you find this PW-W sequence, you get XY-NEXT-OCCURRENCE-XY, else if you find W you get XY-NEXT-OCCURRENCE-Y. Similarly for backwards and XY-PREV-OCCURRENCE-XY and XY-PREV-OCCURRENCE-Y. Else (if the previous word isn't a capitalized NNP), under analogous rules you get one or more of X-NEXT-OCCURRENCE-YX, X-NEXT-OCCURRENCE-XY, X-NEXT-OCCURRENCE-X, X-PREV-OCCURRENCE-YX, X-PREV-OCCURRENCE-XY, X-PREV-OCCURRENCE-X.

useTypeySequences

boolean

false

Some first order word shape patterns.

useGenericFeatures

boolean

false

If true, any features you include in the map will be incorporated into the model with values equal to those given in the file; values are treated as strings unless you use the "realValued" option (described below)

justify

boolean

false

Print out all
feature/class pairs and their weight, and then for each input data
point, print justification (weights) for active features

normalize

boolean

false

For the CMMClassifier (only) if this is true then the Scorer normalizes scores as probabilities.

useHuber

boolean

false

Use a Huber loss prior rather than the default quadratic loss.

useQuartic

boolean

false

Use a Quartic prior rather than the default quadratic loss.

sigma

double

1.0

epsilon

double

0.01

Used only as a parameter in the Huber loss: this is the distance from 0 at which the loss changes from quadratic to linear

beamSize

int

30

maxLeft

int

2

The number of things to the left that have to be cached to run the Viterbi algorithm: the maximum context of class features used.

maxRight

int

2

The number of things to the right that have to be cached to run the Viterbi algorithm: the maximum context of class features used. The maximum possible clique size to use is (maxLeft + maxRight + 1)

dontExtendTaggy

boolean

false

Don't extend the range of useTaggySequences when maxLeft is increased.

numFolds

int

1

The number of folds to use for cross-validation. CURRENTLY NOT IMPLEMENTED.

startFold

int

1

The starting fold to run. CURRENTLY NOT IMPLEMENTED.

endFold

int

1

The last fold to run. CURRENTLY NOT IMPLEMENTED.

mergeTags

boolean

false

Whether to merge B- and I- tags.

splitDocuments

boolean

true

Whether or not to split the data into separate documents for training/testing

maxDocSize

int

10000

If this number is greater than 0, attempt to split documents bigger than this value into multiple documents at sentence boundaries during testing; otherwise do nothing.

Note: flags/properties overwrite left to right. That is, the parameter
setting specified last is the one used.