Choose a Language When Creating a Full-Text Index

When creating a full-text index, you need to specify a column-level language for the indexed column. The word breaker and stemmers of the specified language will be used by full-text queries on the column. There are a couple of things to consider when choosing the column language when creating a full-text index. These considerations relate to how your text is tokenized and then indexed by Full-Text Engine.

This section provides an introduction to word breakers and stemmers, and discusses how full-text search uses the LCID of the column-level language.

Introduction to Word Breakers and Stemmers

SQL Server 2008 and later versions include a complete new family of word breakers and stemmers that are significantly better than the those previously available in SQL Server.

Note

The Microsoft Natural Language Group (MS NLG) implemented and supports these new linguistic components.

The new word breakers provide the following benefits:

Robustness

Testing has shown that the new word breakers are robust in high-pressure query environments.

Security

The new word breakers are enabled by default in SQL Server thanks to security improvements in linguistic components. We highly recommend that external components such as word breakers and filters be signed to improve the overall security and robustness of SQL Server. You can configure full-text to verify that these components are signed as follows:

How Full-Text Search Uses the Name of the Column-Level Language

When creating a full-text index, you need to specify a valid language name for each column. If a language name is valid but not returned by the sys.fulltext_languages (Transact-SQL) catalog view, full-text search falls back to the closest available language name of the same language family, if any. Otherwise, full-text search falls back to the Neutral word breaker. This fall-back behavior might affect the recall accuracy. Therefore we strongly recommend that you specify a valid and available language name for each column when creating a full-text index.

Note

The LCID is used against all data types eligible for full-text indexing (such as char or nchar). If you have the sort order of a char, varchar, or text type column set to a language setting different from the language identified by the LCID, the LCID is used anyway during full-text indexing and querying of those columns.

A word breaker tokenizes the text being indexed on word boundaries, which are language specific. Therefore, word-breaking behavior differs among different languages. If you use one language, x, to index a number of languages {x, y, and ,z}, some of the behavior might cause unexpected results. For example, a dash (-) or a comma (,) might be a word-break element that will be thrown away in one language but not in another. Also rarely unexpected stemming behavior might occur because a given word might stem differently in different language. For example, in the English language, word boundaries are typically white space or some form of punctuation. In other languages, such as German, words or characters may be combined together. Therefore, the column-level language that you choose should represent the language that you expect will be stored in rows of that column.

Western Languages

For the Western family of languages, if you are unsure which languages will be stored in a column or you expect more than one to be stored, a general workaround is to use the word breaker for the most complex language that might be stored in the column. For instance, you might expect to store English, Spanish and German content in a single column. These three Western languages possess very similar word-breaking patterns, with the German patterns being the most complex. Therefore, a good choice is this case would be to use the German word breaker, which should be able to process English and Spanish text correctly. In contrast, the English word breaker might not process German text perfectly because of the compound words of German.

Note that using the word breaker of the most complex language in a language family does not guarantee perfect indexing of every language in the family. Corner cases might exist in which the most complex word breaker cannot correctly handle text written in another language.

Non Western Languages

For non Western languages (such as Chinese, Japanese, Hindi, and so forth) the above workaround does not necessarily work, for linguistic reasons. For non Western languages, consider one of the following workarounds:

For languages from different families

If a column might contain dramatically different languages, for example, Spanish and Japanese, consider storing the content of different languages in separate columns. This would allow you to use the language-specific word breaker for each column. If you choose this solution and you don't know the query language at query time, you might need to issue the query against both columns to ensure that the query finds the right row or document.

For Binary content (such as Microsoft Word documents)

When the indexed content is of binary type, the full-text search filter that processes the textual content before sending it to the word breaker might honor specific language tags existing within the binary file. In this case, at indexing time, the filter will emit the right LCID for a document or section of a document. The Full-Text Engine will then call the word breaker for the language with that LCID. However, after indexing multi language content, we recommend that you verify that the content was correctly indexed.

For plain text content

When your content is plain text, you can convert it to the xml data type and add language tags that indicate the language corresponding to each specific document or document section. For this to work, however, you need to know the language before full-text indexing.

An additional consideration when choosing your column-level language is stemming. Stemming in full-text queries is the process of searching for all stemmed (inflectional) forms of a word in a particular language. When you use a generic word breaker to process several languages, the stemming process works only for the language specified for the column, not for other languages in the column. For example, German stemmers do not work for English or Spanish (and so forth). This might affect your recall depending of which language you choose at query time.

Another consideration in language choice is related to how the data is represented. For data that is not stored in varbinary(max) column, no special filtering is performed. Rather, the text is generally passed through the word breaking component as-is.

Also, word breakers are designed mainly to process written text. So, if you have any type of markup (such as HTML) on your text, you may not get great linguistic accuracy during indexing and search. In that case, you have two choices—the preferred method is simply to store the text data in varbinary(max) column, and to indicate its document type so it may be filtered. If this is not an option, you may consider using the neutral word breaker and, if possible, adding markup data (such as 'br' in HTML) to your noise word lists.

Note

Language based stemming does not come into play when you specify the neutral language.

By default, in , full-text search will parse the query terms using the language specified for each column that is included in the full-text clause. To override this behavior, specify a nondefault language at query time. For supported languages whose resources are installed, the LANGUAGE language_term clause of a CONTAINS, CONTAINSTABLE, FREETEXT, or FREETEXTTABLE query can be used to specify the language used for word breaking, stemming, thesaurus, and stopword processing of the query terms.