As of Lucene version 3.1, this class implements the Word Break rules from the
Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.

Many applications have specific tokenizer needs. If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.

You must specify the required Version
compatibility when creating StandardTokenizer:

As of 3.4, Hiragana and Han characters are no longer wrongly split
from their combining characters. If you use a previous version number,
you get the exact broken behavior for backwards compatibility.

As of 3.1, StandardTokenizer implements Unicode text segmentation.
If you use a previous version number, you get the exact behavior of
ClassicTokenizer for backwards compatibility.

getMaxTokenLength

incrementToken

Consumers (i.e., IndexWriter) use this method to advance the stream to
the next token. Implementing classes must implement this method and update
the appropriate AttributeImpls with the attributes of the next
token.

The producer must make no assumptions about the attributes after the method
has been returned: the caller may arbitrarily change it. If the producer
needs to preserve the state for subsequent calls, it can use
AttributeSource.captureState() to create a copy of the current attribute state.

To ensure that filters and consumers know which attributes are available,
the attributes must be added during instantiation. Filters and consumers
are not required to check for availability of attributes in
TokenStream.incrementToken().

end

This method is called by the consumer after the last token has been
consumed, after TokenStream.incrementToken() returned false
(using the new TokenStream API). Streams implementing the old API
should upgrade to use this feature.

This method can be used to perform any end-of-stream operations, such as
setting the final offset of a stream. The final offset of a stream might
differ from the offset of the last token eg in case one or more whitespaces
followed after the last token, but a WhitespaceTokenizer was used.