NAME

DESCRIPTION

Introduction

SQLite is bundled with an extension module called "FTS" for full-text indexing. Tables with this feature enabled can be efficiently queried to find rows that contain one or more instances of some specified words (also called "tokens"), in any column, even if the table contains many large documents.

The first full-text search modules for SQLite were called FTS1 and FTS2 and are now obsolete. The latest version is FTS4, but it shares many features with the former module FTS3, which is why parts of the API and parts of the documentation still refer to FTS3; from a client point of view, both can be considered largely equivalent. Detailed documentation can be found at http://www.sqlite.org/fts3.html.

where <columns> is a list of column names. Columns may be typed, but the type information is ignored. If no columns are specified, the default is a single column named content. In addition, FTS tables have an implicit column called docid (or also rowid) for numbering the stored documents.

Statements for inserting, updating or deleting records use the same syntax as for regular SQLite tables.

Full-text searches are specified with the MATCH operator, and an operand which may be a single word, a word prefix ending with '*', a list of words, a "phrase query" in double quotes, or a boolean combination of the above.

The builtin function snippet(...) builds a formatted excerpt of the document text, where the words pertaining to the query are highlighted.

There are many more details to building and searching FTS tables, so we strongly invite you to read the full documentation at http://www.sqlite.org/fts3.html.

QUERY SYNTAX

Here are some explanation about FTS queries, borrowed from the sqlite documentation.

Token or token prefix queries

An FTS table may be queried for all documents that contain a specified term, or for all documents that contain a term with a specified prefix. The query expression for a specific term is simply the term itself. The query expression used to search for a term prefix is the prefix itself with a '*' character appended to it. For example:

If a search token (on the right-hand side of the MATCH operator) begins with "^" then that token must be the first in its field of the document : so for example ^lin* matches 'linux kernel changes ...' but does not match 'new linux implementation'.

Column specifications

Normally, a token or token prefix query is matched against the FTS table column specified as the right-hand side of the MATCH operator. Or, if the special column with the same name as the FTS table itself is specified, against all columns. This may be overridden by specifying a column-name followed by a ":" character before a basic term query. There may be space between the ":" and the term to query for, but not between the column-name and the ":" character. For example:

Phrase queries

A phrase query is a query that retrieves all documents that contain a nominated set of terms or term prefixes in a specified order with no intervening tokens. Phrase queries are specified by enclosing a space separated sequence of terms or term prefixes in double quotes ("). For example:

NEAR queries.

A NEAR query is a query that returns documents that contain a two or more nominated terms or phrases within a specified proximity of each other (by default with 10 or less intervening terms). A NEAR query is specified by putting the keyword "NEAR" between two phrase, term or prefix queries. To specify a proximity other than the default, an operator of the form "NEAR/<N>" may be used, where <N> is the maximum number of intervening terms allowed. For example:

More than one NEAR operator may appear in a single query. In this case each pair of terms or phrases separated by a NEAR operator must appear within the specified proximity of each other in the document. Using the same table and data as in the block of examples above:

Set operations

The three basic query types described above may be used to query the full-text index for the set of documents that match the specified criteria. Using the FTS query expression language it is possible to perform various set operations on the results of basic queries. There are currently three supported operations:

The AND operator determines the intersection of two sets of documents.

The OR operator calculates the union of two sets of documents.

The NOT operator may be used to compute the relative complement of one set of documents with respect to another.

The AND, OR and NOT binary set operators must be entered using capital letters; otherwise, they are interpreted as basic term queries instead of set operators. Each of the two operands to an operator may be a basic FTS query, or the result of another AND, OR or NOT set operation. Parenthesis may be used to control precedence and grouping.

The AND operator is implicit for adjacent basic queries without any explicit operator. For example, the query expression "implicit operator" is a more succinct version of "implicit AND operator".

Boolean operations as just described correspond to the so-called "enhanced query syntax" of sqlite; this is the version compiled with DBD::SQLite, starting from version 1.31. A former version, called the "standard query syntax", used to support tokens prefixed with '+' or '-' signs (for token inclusion or exclusion); if your application needs to support this old syntax, use DBD::SQLite::FTS3Transitional (published in a separate distribution) for doing the conversion.

TOKENIZERS

Concept

The behaviour of full-text indexes strongly depends on how documents are split into tokens; therefore FTS table declarations can explicitly specify how to perform tokenization:

CREATE ... USING fts4(<columns>, tokenize=<tokenizer>)

where <tokenizer> is a sequence of space-separated words that triggers a specific tokenizer. Tokenizers can be SQLite builtins, written in C code, or Perl tokenizers. Both are as explained below.

SQLite builtin tokenizers

Under the simple tokenizer, a term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters, the "_" character, and all characters with UTF codepoints greater than or equal to 128. All other characters are discarded when splitting a document into terms. They serve only to separate adjacent terms.

All uppercase characters within the ASCII range (UTF codepoints less than 128), are transformed to their lowercase equivalents as part of the tokenization process. Thus, full-text queries are case-insensitive when using the simple tokenizer.

porter

The porter tokenizer uses the same rules to separate the input document into terms, but as well as folding all terms to lower case it uses the Porter Stemming algorithm to reduce related English language words to a common root.

icu

The icu tokenizer uses the ICU library to decide how to identify word characters in different languages; however, this requires SQLite to be compiled with the SQLITE_ENABLE_ICU pre-processor symbol defined. So, to use this tokenizer, you need edit Makefile.PL to add this flag in @CC_DEFINE, and then recompile DBD::SQLite; of course, the prerequisite is to have an ICU library available on your system.

unicode61

The unicode61 tokenizer works very much like "simple" except that it does full unicode case folding according to rules in Unicode Version 6.1 and it recognizes unicode space and punctuation characters and uses those to separate tokens. By contrast, the simple tokenizer only does case folding of ASCII characters and only recognizes ASCII space and punctuation characters as token separators.

By default, "unicode61" also removes all diacritics from Latin script characters. This behaviour can be overridden by adding the tokenizer argument "remove_diacritics=0". For example:

If a more complex tokenizing algorithm is required, for example to implement stemming, discard punctuation, or to recognize compound words, use the perl tokenizer to implement your own logic, as explained below.

Perl tokenizers

Declaring a perl tokenizer

In addition to the builtin SQLite tokenizers, DBD::SQLite implements a perl tokenizer, that can hook to any tokenizing algorithm written in Perl. This is specified as follows :

CREATE ... USING fts4(<columns>, tokenize=perl '<perl_function>')

where <perl_function> is a fully qualified Perl function name (i.e. prefixed by the name of the package in which that function is declared). So for example if the function is my_func in the main program, write

CREATE ... USING fts4(<columns>, tokenize=perl 'main::my_func')

Writing a perl tokenizer by hand

That function should return a code reference that takes a string as single argument, and returns an iterator (another function), which returns a tuple ($term, $len, $start, $end, $index) for each term. Here is a simple example that tokenizes on words according to the current perl locale

sub locale_tokenizer {returnsub{my$string=shift;uselocale;my$regex=qr/\w+/;my$term_index=0;returnsub{# closure$string=~/$regex/gorreturn;# either match, or no more tokenmy($start,$end)=($-[0],$+[0]);my$len=$end-$start;my$term=substr($string,$start,$len);return($term,$len,$start,$end,$term_index++);}};}

There must be three levels of subs, in a kind of "Russian dolls" structure, because :

the external, named sub is called whenever accessing a FTS table with that tokenizer

the inner, anonymous sub is called whenever a new string needs to be tokenized (either for inserting new text into the table, or for analyzing a query).

the innermost, anonymous sub is called repeatedly for retrieving all terms within that string.

Using Search::Tokenizer

Instead of writing tokenizers by hand, you can grab one of those already implemented in the Search::Tokenizer module. For example, if you want ignore differences between accented characters, you can write :

How to spare database space

By default, FTS stores a complete copy of the indexed documents, together with the fulltext index. On a large collection of documents, this can consume quite a lot of disk space. However, FTS has some options for compressing the documents, or even for not storing them at all -- see http://www.sqlite.org/fts3.html#fts4_options.

In particular, the option for contentless FTS tables only stores the fulltext index, without the original document content. This is specified as content="", like in the following example :

CREATE VIRTUAL TABLE t1 USING fts4(content="", a, b)

Data can be inserted into such an FTS4 table using an INSERT statements. However, unlike ordinary FTS4 tables, the user must supply an explicit integer docid value. For example:

Of course your application will need an algorithm for finding the external resource corresponding to any docid stored within SQLite.

When using placeholders, the docid must be explicitly typed to INTEGER, because this is a "hidden column" for which sqlite is not able to automatically infer the proper type. So the following doesn't work :

It is not possible to UPDATE or DELETE a row stored in a contentless FTS4 table. Attempting to do so is an error.

Contentless FTS4 tables also support SELECT statements. However, it is an error to attempt to retrieve the value of any table column other than the docid column. The auxiliary function matchinfo() may be used, but snippet() and offsets() may not, so if such functionality is needed, it has to be directly programmed within the Perl application.