Text Search Data Types

Text Search Data Types

Greenplum Database provides two data types that are designed to support full text search,
which is the activity of searching through a collection of natural-language documents
to locate those that best match a query. The tsvector type
represents a document in a form optimized for text search; the tsquery type
similarly represents a text query. Using Full Text Search provides a detailed explanation of
this facility, and Text Search Functions and Operators summarizes the
related functions and operators.

The tsvector and tsquery types cannot be part of the
distribution key of a Greenplum Database table.

tsvector

A tsvector value is a sorted list of distinct lexemes, which are
words that have been normalized to merge different variants of the same word (see
Using Full Text Search for details). Sorting and
duplicate-elimination are done automatically during input, as shown in this example:

A position normally indicates the source word's location in the document. Positional
information can be used for proximity ranking. Position values can range from 1 to
16383; larger numbers are silently set to 16383. Duplicate positions for the same lexeme
are discarded.

Lexemes that have positions can further be labeled with a weight, which can be
A, B, C, or D.
D is the default and hence is not shown on output:

Weights are typically used to reflect document structure, for example by marking title
words differently from body words. Text search ranking functions can assign different
priorities to the different weight markers.

It is important to understand that the tsvector type itself does not
perform any normalization; it assumes the words it is given are normalized appropriately
for the application. For example,

For most English-text-searching applications the above words would be considered
non-normalized, but tsvector doesn't care. Raw document text should usually be passed
through to_tsvector to normalize the words appropriately for
searching:

Quoting rules for lexemes are the same as described previously for lexemes in
tsvector; and, as with tsvector, any required
normalization of words must be done before converting to the tsquery
type. The to_tsquery function is convenient for performing such
normalization: