4. Find probability for the context in vectors file (see "VECTORS FILE") or use the default value - 0.5.

CONTEXT

In terms of this module context is just a binary vector, currently consisting of 17 elements. It's calculated for every character of the text, then it gets converted to decimal representation and then it's checked against "VECTORS FILE". Every element is a result of a simple function like _is_latin, _is_digit, _is_bracket and etc. applied to the input character and few characters around it.

VECTORS FILE

Contains a list of vectors with probability values showing the chance that given vector is a token boundary.

Built by OpenCorpora project from semi-automatically annotated corpus.

HYPHENS FILE

Contains a list of hyphenated Russian words. Used in vectors calculations.

Built by OpenCorpora project from semi-automatically annotated corpus.

EXCEPTIONS FILE

Contains a list of char sequences that are not subjects to tokenizing.

Built by OpenCorpora project from semi-automatically annotated corpus.

PREFIXES FILE

Contains a list of common prefixes for decompound words.

Built by OpenCorpora project from semi-automatically annotated corpus.

NOTE: all files are stored as GZip archives and are not supposed to be edited manually.