Testing

Full-Text Search with InnoDB

MySQL's latest InnoDB engine can now do extensive, high-performance, full text search. A quick primer delivers all the goodies

DML (Insert, Update, and Delete)

As mentioned in the last section, the inserting string processing (tokenization) is performed only at commit time. This behavior is consistent with FTS in most transactional database systems, since the DMLs on tables with full-text search indexes are costly operations. Typically, you synchronize the results of DML operations to the FTS index periodically rather than in real time.

A typical example is the Oracle Text component of Oracle Database, which synchronizes table data with the search index periodically or manually. Starting in Oracle Database 11g, it now allows users to specify when data updates are reflected in the full-text index  manually, on commit, or at a regular interval.

For InnoDB Full-Text Search, the FTS index is updated at transaction commit time, when all the document tokenization and inserting into the FTS index (or associated cache) happens. All other DML operations still follow their usual rules about commit/rollback and isolation levels.

Insert. The inserted document is tokenized at commit time and inserted into the FTS Index Cache. This cache has a configurable size (innodb_ft_cache_size) with a default of 32 MB. Once this cache is full, it is synchronized to the on-disk tables that represent the index. During normal server shutdown, any content in the FTS Index Cache is synchronized to the FULLTEXT index on disk. If there is a server crash and the content of FTS Index Cache was not saved to disk, after the server reboot, the data is made consistent when you first query or insert into the FULLTEXT index. Any documents missing from the FULLTEXT index are read from the original table, retokenized, and added to the FTS Index Cache.

Delete. When you delete rows from a table containing an InnoDB FULLTEXT index, InnoDB does not delete the corresponding word entries in the FTS auxiliary tables. At commit time, InnoDB records the deleted Document IDs in another auxiliary table named DELETED. For each query, InnoDB consults this DELETED table to filter out any deleted documents. This design makes the delete operation simple and fast, without the need to update thousands of word entries for each deleted document. Since the word entries are not removed from the FULLTEXT index automatically, you need to perform index optimization periodically (as described in the next section) to keep the index size reasonable.

Update. For updates affecting any FTS-indexed columns, the updates are performed as a combination of Delete + Insert operations. In-place update happens only if the update affects none of the columns referenced by the FULLTEXT index.

Index Optimization

As just discussed, if there is substantial update and delete activity on an InnoDB FULLTEXT index, the index could become bloated, as InnoDB logs the deleted Document IDs in the DELETED auxiliary table. Over time, the FULLTEXT index could grow bigger despite rows being removed from the original table. To resolve this, you can optimize the index. This operation does two things: it removes the deleted Document ID from the word's Document ID/Position pair list (ilist), and it consolidates multiple entries for the same word to one entry if possible by consolidating their Document ID/Position pair list (ilist).

Currently, MySQL runs the optimization operation only as part of the OPTIMIZE TABLE command, when the innodb_optimize_fulltext_only configuration variable is enabled:

As the table could be large, and optimization could take a significant amount of time, you typically should do the optimization in stages. The configuration variable innodb_ft_num_word_optimize specifies how many words to optimize for each OPTIMIZE TABLE command. It defaults to 2000 words. When the next OPTIMIZE TABLE command is issued, the server continues the process from where it left off.

Stopword Handling

Stopwords are common or trivial words, such as "the" and "an," which are omitted from the FULLTEXT index. They are potentially different for each language or application. InnoDB FTS provides two sources of stopwords:

MySQL server predefined stopwords. If no user stopword list is defined, this default list is used. You can view this list by querying the table INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD:

mysql> select * from INNODB_FT_DEFAULT_STOPWORD;

User-defined stopwords.You can define your own stopwords by creating a table with a single column named value, with datatype varchar, and pointing the global variable innodb_ft_server_stopword_table to this table. MySQL loads stopwords from this user table, rather than the default stopword list, when creating the FTS index. For example:

Query Performance

Earlier in this article, we demonstrated ways to make creating an InnoDB FULLTEXT index faster. Preliminary results on a 2.7 GB Wikipedia data set are shown in the following table. The queries used include various combinations of include, exclude, wildcard, and proximity operators, even an intentional misspelling ("preassumtions"). They are identified by the number in the leftmost column:

* FTS Search processing time is the actual time spent by InnoDB full-text index scan.

Limitations

As this version of FTS is fairly new, there are some aspects that have limitations that Oracle will remove in succedding versions. These include:

Ranking: Currently, the FTS uses a very simple ranking mechanism (term frequency, inverse document frequency) for document result ranking. The more the word appears in a document, and the less frequent the word appears in overall documents, the higher the selected document is ranked.

Stemming, which matches alternative forms such as singular, plural, past tense, and other forms derived from the same root word is not supported.

Ideographic languages, Chinese, Japanese and Korean (referred to as CJK), which do not have word delimiters, do not yet have support for N-GRAM parsing. (MyISAM FTS has a similar limitation.)

Single character set: Although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation. (MyISAM FTS has a similar limitation.)

However, the InnoDB full-text search gives InnoDB an important capability in using MySQL to handle text documents in transaction-intensive database applications. Subsequent releases will continue to increase its performance, usability, and feature set.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!