10.5 Parsers

Text search parsers are responsible for splitting raw document text
into tokens and identifying each token's type, where
the set of possible types is defined by the parser itself.
Note that a parser does not modify the text at all--it simply
identifies plausible word boundaries. Because of this limited scope,
there is less need for application-specific custom parsers than there is
for custom dictionaries. At present PostgreSQL
provides just one built-in parser, which has been found to be useful for a
wide range of applications.

The built-in parser is named pg_catalog.default.
It recognizes 23 token types:

Note: The parser's notion of a “letter” is determined by the database's
locale setting, specifically lc_ctype. Words containing
only the basic ASCII letters are reported as a separate token type,
since it is sometimes useful to distinguish them. In most European
languages, token types word and asciiword
should be treated alike.

email does not support all valid email characters as
defined by RFC 5322. Specifically, the only non-alphanumeric
characters supported for email user names are period, dash, and
underscore.

It is possible for the parser to produce overlapping tokens from the same
piece of text. As an example, a hyphenated word will be reported both
as the entire word and as each component: