Menu

InnoDB Full-Text : N-gram Parser

The default InnoDB full-text parser is ideal for latin based languages where whitespace is the token or word separator, but for languages like Chinese, Japanese, and Korean (CJK)—where there is no fixed separators for individual words, and each word can be compromised of multiple characters—we need a different way to handle the word tokens. I’m now very happy to say that in MySQL 5.7.6 we’ve made use of the new pluggable full-text parser support in order to provide you with an n-gram parser that can be used with CJK!

What Exactly Is N-gram?

In full-text search, an n-gram is a contiguous sequence of n characters from a given sequence of text. For example, with n-gram we tokenize the text “abcd” as follows:

N-gram Example

1

2

3

4

N=1:'a','b','c','d';

N=2:'ab','bc','cd';

N=3:'abc','bcd';

N=4:'abcd';

How Can You Use the N-gram Parser with InnoDB?

The new n-gram parser is loaded and enabled by default. To use it, you would simply specify the WITH PARSER ngram clause in your relevant DDL statements. For example, any of the following statements are valid in MySQL 5.7.6 and later:

We also have a new global server variable called ngram_token_size (a token being roughly equivalent to a word, made up of n characters). The default value is 2 (bigram), and the valid range is from 1 to 10. The next natural question then becomes: what token size value should I choose? 2, or bigram, is recommended for general use with CJK, but you can choose any valid value based on this simple rule: set the token size to the largest token you expect to search for. If we only ever want to search for single characters, then we should set ngram_token_size to 1. The smaller the ngram_token_size the smaller the index, and the faster the full-text searches using that index will generally be. The downside, however, is that you’re then limiting the token size and thus what you can search for. For example, ‘Happy Birthday’ in English translates to ‘生日高興’ in traditional Chinese, where ‘Happy’ == ‘高興’ and ‘Birthday’ == ‘生日’. As you can see, each respective word or token is comprised of two individual characters, so in order to search for each token we need to have ngram_token_size set to 2 (the default) or higher.

More on N-gram Tokenization

The n-gram parser differs from the default full-text parser in the following ways:

Stopword handling — we handle stopwords a little differently. Usually if a tokenized word itself (an exact match) is in the stopword table, then it will not be added to the full-text index. With the n-gram parser, however, we check whether the tokenized word contains any word in the stop word table and if so then we do not index that token. The reason for this difference in behavior is that we have so many frequent and meaningless characters, words, and punctuation with CJK. If we use the contains method to match stopwords, then we can eliminate more useless tokens.

White space — white space is always a hard-coded stopword. As an example, this means that ‘my sql’ is always tokenized to ‘my’, ‘y ‘, ‘ s’, ‘sq’, and ‘ql’, and ‘y ‘ and ‘ s’ will not be indexed.

We can query INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE and INFORMATION_SCHEMA.INNODB_FT_TABLE_TABLE to see exactly what tokens have been indexed within a specific full-text index. This is a very useful tool for debugging purposes. For example, if a word doesn’t show up in a full-text search results as expected, then the word may simply not be indexed for one reason or another (stop words, token sizes, etc.) and you can query these tables to be find out for sure. Here’s a simple example:

A Simple Debugging Example

MySQL

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

mysql>INSERTINTOarticles(title)VALUES('my sql');

QueryOK,1rowaffected(0.03sec)

mysql>SET GLOBALinnodb_ft_aux_table="test/articles";

QueryOK,0rowsaffected(0.00sec)

mysql>SELECT*FROMINFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;

+------+--------------+-------------+-----------+--------+----------+

|WORD|FIRST_DOC_ID|LAST_DOC_ID|DOC_COUNT|DOC_ID|POSITION|

+------+--------------+-------------+-----------+--------+----------+

|my|1|1|1|1|0|

|ql|1|1|1|1|4|

|sq|1|1|1|1|3|

+------+--------------+-------------+-----------+--------+----------+

3rowsinset(0.00sec)

More on Search Result Processing with N-gram

Text Searches

In NATURAL LANGUAGE MODE, the text searched for is converted to a union of n-gram values. For example, 'sql' is converted to 'sq ql' (with a default token size of 2 or bigram). Here’s an example:

If the prefix length is equal to or greater than ngram_token_size, then the wildcard search is converted to a phrase search and the wildcard is ignored. For example, 'sq*' is converted to '"sq"', and 'sql*' is equivalent to '"sq ql"':

If you would like to learn more general details about InnoDB full-text search, please see the InnoDB Full-Text Index section of the user manual and Jimmy’s excellent Dr. Dobb’s article. For more details about the N-gram parser specifically, please see the N-gram parser section in the user manual.

We hope that you find this new feature useful! We’re very happy to have improved CJK support throughout MySQL 5.7, and this is a big part of that. If you have any questions please feel free to post them here on the blog post or in a support ticket. If you feel that you have encountered any related bugs, please let us know via a comment here, a bug report, or a support ticket.