Definition of a language, its lexer and its embedded languages.
It's a mirror of Language on SPI level containing
additional information necessary for the lexer infrastructure operation.
The language hierarchies should be implemented by SPI providers
and their languages should be given for public use
(language hierarchy classes do not need to be public though).
A typical situation may look like this:

createTokenIds

Provide a collection of token ids that comprise the language.
If token ids are defined as enums then this method
should simply return EnumSet.allOf(MyTokenId.class).

This method is only called once by the infrastructure
(when constructing language) so it does
not need to cache its result.
This method is called in synchronized section.
If its implementation would use any synchronization
a care must be taken to prevent deadlocks.

createTokenCategories

Provide map of token category names to collection of its members.
The results of this method will be merged with the primary-category
information found in token ids.
This method is only called once by the infrastructure
(when constructing language) so it does
not need to cache its result.
This method is called in synchronized section.
If its implementation would use any synchronization
a care must be taken to prevent deadlocks.

There is a convention that the category names should only consist
of lowercase letters, numbers and hyphens.

Returns:

mapping of category name to collection of its ids.
It may return null to signal no mappings.

embedding

Get language embedding (if exists) for a particular token
of the language at this level of language hierarchy.
This method will only be called if the given token instance
will not be flyweight token or token with custom text:
token.isFlyweight() == false && token.isCustomText() == false
That restriction exists because the children token list is constructed
lazily and the infrastructure needs to access the token's parent token
list which would not be possible if the token would be flyweight.

Parameters:

token - non-null token for which the language embedding will be resolved.
The token may have a zero length (Token.length() == 0)
in case the language infrastructure performs a poll for all embedded
languages for the

languagePath - non-null language path at which the language embedding
is being created. It may be used for obtaining appropriate information
from inputAttributes.

inputAttributes - input attributes that could affect the embedding creation.
It may be null if there are no extra attributes.

Returns:

language embedding instance or null if there is no language embedding
for this token.

embeddingPresence

Determine whether embedding may be present for a token with the given token id.
The embedding for the particular token may either never be present, always present or sometimes
present (depending on token's text or properties).
By default the method returns EmbeddingPresence.CACHED_FIRST_QUERY
so the embedding(Token,LanguagePath,InputAttributes)
will be called once (for a first token instance with the given token id)
and if there is no embedding then the embedding creation will not be attempted
for any other token with the same token id. This should be appropriate
for most cases.
This method allows to avoid frequent queries checking
whether particular token might contain embedding or not.

isRetainTokenText

This feature is currently not supported - Token.text()
will return null for non-flyweight tokens.
Determine whether the text of the token with the particular id should
be retained after the token has been removed from the token list
because of the underlying mutable input source modification.
Token.text() will continue
to return the value that it had right before the token's removal.
This may be useful if the tokens are held directly in parse trees
and the parser queries the tokens for text.

Retaining text in the tokens has performance and memory implications
and should only be done selectively for tokens where it's desired
(such as identifiers).
The extra performance and memory penalty only happens during
token's removal from the token list for the given input.
Token creation performance and memory consumption during
token's lifetime stay unaffected.

Retaining will only work if the input source is capable of providing
the removed text right after the modification has been performed.