FeatureGeneratorUtil.tokenFeature() is too specific for some languages

Details

Type: Improvement

Status:Resolved

Priority: Minor

Resolution:
Fixed

Affects Version/s:1.9.0

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

As I described in OPENNLP-1197, in Japanese NER problem, we usually use only DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present are too specific. I don't need to distinguish among lc (lowercase alphabet), ac (all capital letters) and ic (initial capital letter), for example.

By way of trial, if I applied the following patch in order to avoid "too specific token class generation":