Protected Member Functions

Detailed Description

Encapsulation of a User Defined Function for performing tokenization of text runs.

You must implement a subclass of this class.

When you install a subclass of LexerUDF as a native plugin, MarkLogic servers can apply your algorithm to perform tokenization of a text run. To activate your tokenization algorithm for a particular language, you will need to apply the appropriate language customization configuration.

Your tokenization algorithm will be applied automatically when content in the configured language is loaded into the database or searched. You can also see the effect of your algorithm from XQuery (cts:tokenize) or JavaScript (cts.tokenize).

Deploy the plugin to MarkLogic Server. For example, by calling plugin:install-from-zip.

Configure the language to use that algorithm. For example, by calling lang:language-config-write

A tokenizer is expected to provide a complete partition of the input array, with no gaps, overlaps, or changes to the content. The contents of documents are stored in tokenized form, so failure to abide by this rule may cause document contents to be changed.

Tokens may have an optional part of speech, for implementations that need to provide extra information to their stemmer. It is generally preferable to produce multiple alternative stems rather than attempt to be overly precise: tokenization and stemming of query strings needs to be able to produce matches for content, and part of speech information is unlikely to be accurate for short query strings.

MarkLogic Server uses the version to enforce plugin consistency across all hosts in a cluster. The API version against which your plugin is compiled must match the API version supported by the MarkLogic Server instance(s) on which your plugin executes.

For more information, see "Registering an Lexer UDF" in the Application Developer's Guide.