Class NumberNormalizer

Provides functions for converting words to numbers.
Unlike QuantifiableEntityNormalizer that normalizes various
types of quantifiable entities like money and dates,
NumberNormalizer only normalizes numeric expressions
(e.g., one => 1, two hundred => 200.0 )
This code is somewhat hacked together, so should be reworked.
There is a library in perl for parsing english numbers:
http://blog.cordiner.net/2010/01/02/parsing-english-numbers-with-perl/
TODO: To be merged into QuantifiableEntityNormalizer.
It can be used by QuantifiableEntityNormalizer
to first convert numbers expressed as words
into numeric quantities before figuring
out how to do higher level combos
(like one hundred dollars and five cents)
TODO: Known to not handle the following:
oh: two oh one
non-integers: one and a half, one point five, three fifth
funky numbers: pi
TODO: This class is very language dependent
Should really be AmericanEnglishNumberNormalizer
TODO: Make things not static

Method Detail

setVerbose

wordToNumber

Fairly generous utility function to convert a string representing
a number (hopefully) to a Number.
Assumes that something else has somehow determined that the string
makes ONE suitable number.
The value of the number is determined by:
0. Breaking up the string into pieces using whitespace
(stuff like "and", "-", "," is turned into whitespace);
1. Determining the numeric value of the pieces;
2. Finding the numeric value of each piece;
3. Combining the pieces together to form the overall value:
a. Find the largest component and its value (X),
b. Let B = overall value of pieces to the left (recursive),
c. Let C = overall value of pieces to the right recursive),
d. The overall value = B*X + C.

Parameters:

str - The String to convert

Returns:

numeric value of string

getNewEnv

findNumbers

Find and mark numbers (does not need NumberSequenceClassifier)
Each token is annotated with the numeric value and type:
- CoreAnnotations.NumericTypeAnnotation.class: ORDINAL, UNIT (hundred, thousand,..., dozen, gross,...), NUMBER
- CoreAnnotations.NumericValueAnnotation.class: Number representing the numeric value of the token
( two thousand => 2 1000 ).
Tries also to separate individual numbers like four five six,
while keeping numbers like four hundred and seven together
Annotate tokens belonging to each composite number with
- CoreAnnotations.NumericCompositeTypeAnnotation.class: ORDINAL (1st, 2nd), NUMBER (one hundred)
- CoreAnnotations.NumericCompositeValueAnnotation.class: Number representing the composite numeric value
( two thousand => 2000 2000 ).
Also returns list of CoreMap representing the identified numbers.
The function is overly aggressive in marking possible numbers
- should either do more checks or use in conjunction with NumberSequenceClassifier
to avoid marking certain tokens (like second/NN) as numbers...

Parameters:

annotation - The annotation structure

Returns:

list of CoreMap representing the identified numbers

findAndMergeNumbers

Takes annotation and identifies numbers in the annotation.
Returns a list of tokens (as CoreMaps) with numbers merged.
As by product, also marks each individual token with the TokenBeginAnnotation and TokenEndAnnotation
- this is mainly to make it easier to the rest of the code to figure out what the token offsets are.
Note that this copies the annotation, since it modifies token offsets in the original.