Dictionaries are used to eliminate words that should not be
considered in a search (stop words), and
to normalize words so that different
derived forms of the same word will match. A successfully
normalized word is called a lexeme.
Aside from improving search quality, normalization and removal of
stop words reduce the size of the tsvector
representation of a document, thereby improving performance.
Normalization does not always have linguistic meaning and usually
depends on application semantics.

If indexing numbers, we can remove some fractional digits
to reduce the range of possible numbers, so for example
3.14159265359,
3.1415926,
3.14 will be the
same after normalization if only two digits are kept after
the decimal point.

A dictionary is a program that accepts a token as input and
returns:

an array of lexemes if the input token is known to the
dictionary (notice that one token can produce more than one
lexeme)

an empty array if the dictionary knows the token, but it
is a stop word

NULL if the dictionary does not
recognize the input token

PostgreSQL provides
predefined dictionaries for many languages. There are also
several predefined templates that can be used to create new
dictionaries with custom parameters. Each predefined dictionary
template is described below. If no existing template is suitable,
it is possible to create new ones; see the contrib/ area of the PostgreSQL distribution for examples.

A text search configuration binds a parser together with a set
of dictionaries to process the parser's output tokens. For each
token type that the parser can return, a separate list of
dictionaries is specified by the configuration. When a token of
that type is found by the parser, each dictionary in the list is
consulted in turn, until some dictionary recognizes it as a known
word. If it is identified as a stop word, or if no dictionary
recognizes the token, it will be discarded and not indexed or
searched for. The general rule for configuring a list of
dictionaries is to place first the most narrow, most specific
dictionary, then the more general dictionaries, finishing with a
very general dictionary, like a Snowball stemmer or simple, which recognizes everything. For example,
for an astronomy-specific search (astro_en configuration) one could bind token type
asciiword (ASCII word) to a synonym
dictionary of astronomical terms, a general English dictionary
and a Snowball English
stemmer:

Stop words are words that are very common, appear in almost
every document, and have no discrimination value. Therefore,
they can be ignored in the context of full text searching. For
example, every English text contains words like a and the, so it is
useless to store them in an index. However, stop words do
affect the positions in tsvector, which
in turn affect ranking:

It is up to the specific dictionary how it treats stop
words. For example, ispell
dictionaries first normalize words and then look at the list of
stop words, while Snowball stemmers
first check the list of stop words. The reason for the
different behavior is an attempt to decrease noise.

The simple dictionary template
operates by converting the input token to lower case and
checking it against a file of stop words. If it is found in the
file then an empty array is returned, causing the token to be
discarded. If not, the lower-cased form of the word is returned
as the normalized lexeme. Alternatively, the dictionary can be
configured to report non-stop-words as unrecognized, allowing
them to be passed on to the next dictionary in the list.

Here is an example of a dictionary definition using the
simple template:

Here, english is the base name of a
file of stop words. The file's full name will be $SHAREDIR/tsearch_data/english.stop, where
$SHAREDIR means the PostgreSQL installation's shared-data
directory, often /usr/local/share/postgresql (use pg_config --sharedir to determine it if you're
not sure). The file format is simply a list of words, one per
line. Blank lines and trailing spaces are ignored, and upper
case is folded to lower case, but no other processing is done
on the file contents.

We can also choose to return NULL,
instead of the lower-cased word, if it is not found in the stop
words file. This behavior is selected by setting the
dictionary's Accept parameter to
false. Continuing the example:

With the default setting of Accept
= true, it is only useful to place a
simple dictionary at the end of a list
of dictionaries, since it will never pass on any token to a
following dictionary. Conversely, Accept = false is only
useful when there is at least one following dictionary.

Caution

Most types of dictionaries rely on configuration
files, such as files of stop words. These files
must be
stored in UTF-8 encoding. They will be translated to
the actual database encoding, if that is different,
when they are read into the server.

Caution

Normally, a database session will read a dictionary
configuration file only once, when it is first used
within the session. If you modify a configuration file
and want to force existing sessions to pick up the new
contents, issue an ALTER TEXT
SEARCH DICTIONARY command on the dictionary. This
can be a "dummy" update that
doesn't actually change any parameter values.

This dictionary template is used to create dictionaries that
replace a word with a synonym. Phrases are not supported (use
the thesaurus template (Section
12.6.4) for that). A synonym dictionary can be used to
overcome linguistic problems, for example, to prevent an
English stemmer dictionary from reducing the word 'Paris' to
'pari'. It is enough to have a Paris
paris line in the synonym dictionary and put it before the
english_stem dictionary. For
example:

The only parameter required by the synonym template is SYNONYMS, which is the base name of its
configuration file — my_synonyms in
the above example. The file's full name will be $SHAREDIR/tsearch_data/my_synonyms.syn (where
$SHAREDIR means the PostgreSQL installation's shared-data
directory). The file format is just one line per word to be
substituted, with the word followed by its synonym, separated
by white space. Blank lines and trailing spaces are
ignored.

The synonym template also has an
optional parameter CaseSensitive,
which defaults to false. When
CaseSensitive is false, words in the synonym file are folded to
lower case, as are input tokens. When it is true, words and tokens are not folded to lower
case, but are compared as-is.

A thesaurus dictionary (sometimes abbreviated as
TZ) is a collection of words
that includes information about the relationships of words and
phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms,
related terms, etc.

Basically a thesaurus dictionary replaces all non-preferred
terms by one preferred term and, optionally, preserves the
original terms for indexing as well. PostgreSQL's current implementation of the
thesaurus dictionary is an extension of the synonym dictionary
with added phrase support. A thesaurus
dictionary requires a configuration file of the following
format:

where the colon (:) symbol acts as a
delimiter between a a phrase and its replacement.

A thesaurus dictionary uses a subdictionary (which is specified in the
dictionary's configuration) to normalize the input text before
checking for phrase matches. It is only possible to select one
subdictionary. An error is reported if the subdictionary fails
to recognize a word. In that case, you should remove the use of
the word or teach the subdictionary about it. You can place an
asterisk (*) at the beginning of an
indexed word to skip applying the subdictionary to it, but all
sample words must be
known to the subdictionary.

The thesaurus dictionary chooses the longest match if there
are multiple phrases matching the input, and ties are broken by
using the last definition.

Specific stop words recognized by the subdictionary cannot
be specified; instead use ? to mark
the location where any stop word can appear. For example,
assuming that a and the are stop words according to the
subdictionary:

? one ? two : swsw

matches a one the two and
the one a two; both would be replaced
by swsw.

Since a thesaurus dictionary has the capability to recognize
phrases it must remember its state and interact with the
parser. A thesaurus dictionary uses these assignments to check
if it should handle the next word or stop accumulation. The
thesaurus dictionary must be configured carefully. For example,
if the thesaurus dictionary is assigned to handle only the
asciiword token, then a thesaurus
dictionary definition like one 7 will
not work since token type uint is not
assigned to the thesaurus dictionary.

Caution

Thesauruses are used during indexing so any change
in the thesaurus dictionary's parameters requires reindexing. For
most other dictionary types, small changes such as
adding or removing stopwords does not force
reindexing.

mythesaurus is the base name
of the thesaurus configuration file. (Its full name will
be $SHAREDIR/tsearch_data/mythesaurus.ths,
where $SHAREDIR means the
installation shared-data directory.)

pg_catalog.english_stem is
the subdictionary (here, a Snowball English stemmer) to
use for thesaurus normalization. Notice that the
subdictionary will have its own configuration (for
example, stop words), which is not shown here.

Now it is possible to bind the thesaurus dictionary
thesaurus_simple to the desired
token types in a configuration, for example:

Now we can see how it works. ts_lexize is not very useful for testing a
thesaurus, because it treats its input as a single token.
Instead we can use plainto_tsquery and to_tsvector which will break their input
strings into multiple tokens:

The Ispell dictionary
template supports morphological
dictionaries, which can normalize many different linguistic
forms of a word into the same lexeme. For example, an English
Ispell dictionary can match
all declensions and conjugations of the search term bank, e.g., banking,
banked, banks, banks', and
bank's.

The standard PostgreSQL
distribution does not include any Ispell configuration files. Dictionaries
for a large number of languages are available from Ispell. Also, some more modern dictionary file
formats are supported — MySpell (OO < 2.0.1) and Hunspell (OO >= 2.0.2). A large list of
dictionaries is available on the OpenOffice Wiki.

To create an Ispell
dictionary, use the built-in ispell
template and specify several parameters:

Here, DictFile, AffFile, and StopWords
specify the base names of the dictionary, affixes, and
stop-words files. The stop-words file has the same format
explained above for the simple
dictionary type. The format of the other files is not specified
here but is available from the above-mentioned web sites.

Ispell dictionaries usually recognize a limited set of
words, so they should be followed by another broader
dictionary; for example, a Snowball dictionary, which
recognizes everything.

Ispell dictionaries support splitting compound words; a
useful feature. Notice that the affix file should specify a
special flag using the compoundwords
controlled statement that marks dictionary words that can
participate in compound formation:

The Snowball dictionary
template is based on a project by Martin Porter, inventor of
the popular Porter's stemming algorithm for the English
language. Snowball now provides stemming algorithms for many
languages (see the Snowball site for more information). Each
algorithm understands how to reduce common variant forms of
words to a base, or stem, spelling within its language. A
Snowball dictionary requires a language parameter to identify which stemmer to
use, and optionally can specify a stopword file name that gives a list of words to
eliminate. (PostgreSQL's
standard stopword lists are also provided by the Snowball
project.) For example, there is a built-in definition
equivalent to

A Snowball dictionary
recognizes everything, whether or not it is able to simplify
the word, so it should be placed at the end of the dictionary
list. It is useless to have it before any other dictionary
because a token will never pass through it to the next
dictionary.