The pg_trgm module provides
functions and operators for determining the similarity of
alphanumeric text based on trigram matching, as well as index
operator classes that support fast searching for similar
strings.

A trigram is a group of three consecutive characters taken
from a string. We can measure the similarity of two strings by
counting the number of trigrams they share. This simple idea
turns out to be very effective for measuring the similarity of
words in many natural languages.

Note:pg_trgm ignores
non-word characters (non-alphanumerics) when extracting
trigrams from a string. Each word is considered to have two
spaces prefixed and one space suffixed when determining the
set of trigrams contained in the string. For example, the
set of trigrams in the string "cat" is
" c",
" ca",
"cat",
and "at
". The set of trigrams in the string "foo|bar" is
" f",
" fo",
"foo",
"oo ",
" b",
" ba",
"bar",
and "ar
".

Returns a number that indicates how similar the two
arguments are. The range of the result is zero
(indicating that the two strings are completely
dissimilar) to one (indicating that the two strings are
identical).

show_trgm(text)

text[]

Returns an array of all the trigrams in the given
string. (In practice this is seldom useful except for
debugging.)

show_limit()

real

Returns the current similarity threshold used by
the % operator. This sets the
minimum similarity between two words for them to be
considered similar enough to be misspellings of each
other, for example.

set_limit(real)

real

Sets the current similarity threshold that is used
by the % operator. The
threshold must be between 0 and 1 (default is 0.3).
Returns the same value passed in.

Table F-23. pg_trgm
Operators

Operator

Returns

Description

text%text

boolean

Returns true if its
arguments have a similarity that is greater than the
current similarity threshold set by set_limit.

text<->text

real

Returns the "distance"
between the arguments, that is one minus the
similarity() value.

The pg_trgm module provides GiST
and GIN index operator classes that allow you to create an
index over a text column for the purpose of very fast
similarity searches. These index types support the
above-described similarity operators, and additionally support
trigram-based index searches for LIKE,
ILIKE, ~ and
~* queries. (These indexes do not
support equality nor simple comparison operators, so you may
need a regular B-tree index too.)

This will return all values in the text column that are
sufficiently similar to word,
sorted from best match to worst. The index will be used to make
this a fast operation even over very large data sets.

A variant of the above query is

SELECT t, t <-> 'word' AS dist
FROM test_trgm
ORDER BY dist LIMIT 10;

This can be implemented quite efficiently by GiST indexes,
but not by GIN indexes. It will usually beat the first
formulation when only a small number of the closest matches is
wanted.

Beginning in PostgreSQL
9.1, these index types also support index searches for
LIKE and ILIKE, for example

SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

The index search works by extracting trigrams from the
search string and then looking these up in the index. The more
trigrams in the search string, the more effective the index
search is. Unlike B-tree based searches, the search string need
not be left-anchored.

Beginning in PostgreSQL
9.3, these index types also support index searches for
regular-expression matches (~ and
~* operators), for example

SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';

The index search works by extracting trigrams from the
regular expression and then looking these up in the index. The
more trigrams that can be extracted from the regular
expression, the more effective the index search is. Unlike
B-tree based searches, the search string need not be
left-anchored.

For both LIKE and
regular-expression searches, keep in mind that a pattern with
no extractable trigrams will degenerate to a full-index
scan.

The choice between GiST and GIN indexing depends on the
relative performance characteristics of GiST and GIN, which are
discussed elsewhere. As a rule of thumb, a GIN index is faster
to search than a GiST index, but slower to build or update; so
GIN is better suited for static data and GiST for often-updated
data.

Trigram matching is a very useful tool when used in
conjunction with a full text index. In particular it can help
to recognize misspelled input words that will not be matched
directly by the full text search mechanism.

The first step is to generate an auxiliary table containing
all the unique words in the documents:

where documents is a table that
has a text field bodytext that we
wish to search. The reason for using the simple configuration with the to_tsvector function, instead of using a
language-specific configuration, is that we want a list of the
original (unstemmed) words.

Next, create a trigram index on the word column:

CREATE INDEX words_idx ON words USING gin(word gin_trgm_ops);

Now, a SELECT query similar to the
previous example can be used to suggest spellings for
misspelled words in user search terms. A useful extra test is
to require that the selected words are also of similar length
to the misspelled word.

Note: Since the words
table has been generated as a separate, static table, it
will need to be periodically regenerated so that it remains
reasonably up-to-date with the document collection. Keeping
it exactly current is usually unnecessary.