You can customize this tokenizer's behavior by specifying per-script rule files,
which are compiled by the ICU RuleBasedBreakIterator. See the
ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a
comma-separated list of code:rulefile pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource
path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic
(script code "Cyrl"):