SQL Server Best Practices Article

Summary: The observations in this paper are based on tests run in the SQL Server Customer Lab for a customer who needed to scale up full-text search to a much greater potential volume. The paper describes the customer scenario, provides an overview of SQL Server 2005 full-text concepts that bear on the results, and offers lessons learned and recommendations for using full-text queries on large catalogs.

On This Page

Introduction

The observations and recommendations in this paper are based on tests run in the Microsoft SQL Server Customer Lab for a customer who needed to scale up full-text search to a much greater potential volume. The next section describes the customer scenario, which is followed by a brief overview of Microsoft® SQL Server™ 2005 full-text concepts that bear on the results, and then lessons learned and recommendations.

Customer Scenario

The customer has a SQL Server 2005 full-text application that required significant scaling to much larger volumes of data. Some of the full-text queries in the application are very broad, returning large sets of rows that were not performing as well as desired. The goals of the engagement were to determine better ways to optimize the customer's full-text queries and to scale the application for a much greater load.

The customer's application queries a single table that contains a text column that is full-text indexed. Each text value in the table row corresponds to a scanned page of an external document. The following figure shows the relationship.

Each external paper document is scanned and text extracted per page by an OCR process. The resulting text from each document page is placed in a new row. Each ID is associated with an appropriate document ID in a separate association table. The text column in the table is indexed by using the SQL Server 2005 Full-Text Search service. In the database tested in the Customer Lab, the number of rows in the database table, representing the number of externally scanned document pages, was over 13 million. The resulting database was about 420 gigabytes (GB) in size, with a full-text catalog that was approximately 10 GB. Ranking and fuzzy matches were not allowed: the query must return each and every page ID that contained the target words.

The customer required that their application scale up to potentially 20 times the tested database size. The goal of the Customer Lab engagement was to test various methods for scaling up or out to achieve that size while maintaining good full-text query performance.

About SQL Server 2005 Full-Text Search

In Microsoft SQL Server 2005, Full-Text Search provides a faster method of querying large amounts of unstructured textual data than does the Transact-SQL LIKE predicate. Full-Text Search queries run faster because they store the words and phrases of a particular language in a specially constructed catalog that can be searched by individual word, rather than by scanning each individual document. The queries return a list of the IDs of the documents that contain the searched word or phrase.

Full-Text Search consists of a single service for each SQL Server instance called MSFTESQL, the Microsoft Full-Text Engine for SQL Server. MSFTESQL is built on the Microsoft Search service (MSSearch) and is integrated into each SQL Server 2005 instance as a distinct service. SQL Server interfaces with the MSFTESQL service when reading and writing to full-text catalogs. To use full-text indexing on a given SQL Server 2005 instance, the MSFTESQL service for that instance must also be running.

In addition, during the full-text indexing process, an executable called the Microsoft Full-Text Engine Filter Daemon (MSFTED) is used to access and filter the textual data from tables that are being indexed. The Filter Daemon is used as a word-breaker to isolate distinct indexable words within text, as well as to apply rules based on filters that allow you to index certain document types other than ordinary text.

Full-text indexes are stored on disk in full-text catalog files separate from SQL Server database data files. The full-text catalog consists of a number of files in a format that the MSFTESQL engine can access for searching. You can allow full-text indexes to update automatically when new documents are added to your tables, or you can control when and how the indexes are updated.

You can create a full-text index on any text column, including all the char and varchar variations, as well as the text and ntext data types. In addition, you can index columns with the xml, image, and varbinary data types. These data types must be associated with a particular filter. SQL Server 2005 has built-in filters for numerous document types, and will automatically use the xml filter on the xml data type. (For more information about applying full-text indexing to the xml data type, see XML Best Practices in SQL Server 2005 Books Online.)

Unlike table-based indexes, which use a B-tree structure, the MSFTESQL service builds full-text indexes as an inverted table of full-text phrases, each associated with a tokenized list of internal document IDs that map back to the full-text key of the full-text index. When you query the full-text catalog for a particular string, the query returns a vertical table of matching keys that you can then use to join with the original table. For example, suppose you create a full-text index on the DocumentSummary column in the Production.Document table of the Adventure Works database. You could query it for a particular word such as 'detailed', in the following fashion:

SELECT D.DocumentID, D.DocumentSummary
FROM Production.Document AS D
WHERE CONTAINS(DocumentSummary, 'Detailed')

The CONTAINS predicate implicitly joins keys from the full-text index with the table's primary key. You can use the returned keys natively and make the join explicit by using the CONTAINSTABLE function:

Because the full-text catalog stores its data in an inverted and tokenized format, the full-text catalog may not grow at the same rate that a table would. When you add a new row to a table with new text data, many of the indexed words in the new row are already present in the catalog, and the MSFTESQL service only needs to add IDs to the list of those existing words. Only new words or phrases would cause additional rows to be added to the catalog.

Because SQL Server uses the MSFTESQL MSSearch service to query full-text catalogs, for large catalogs, make sure that you provision free memory on your server for file caching by the MSFTESQL service. You cannot configure MSFTESQL's use of memory from within SQL Server 2005. (For more information, see Full-text Configuration Recommendations later in this paper.)

Full-text query performance is roughly proportional to the number of matching keys returned from the catalog. On a large catalog, querying for words that are only rarely present in the indexed column returns fewer keys than querying for words that are contained in many columns, and accordingly the queries that return fewer matching keys generally perform better.

Because the MSFTESQL service operates outside the database engine, SQL Server 2005 cannot extend predicates on SQL queries into the full-text search. As a result, even though your query may have additional conditions that significantly reduce the final number of rows returned by the query, the MSFTESQL service still returns all the matching keys for the target search.

For more information

To learn more about Microsoft SQL Server 2005 full-text capabilities, see the "Full-Text Search" topics in SQL Server Books Online. Also, SQL Server 2005 comes with a sample full-text search application. For information about general best practices using Full-Text Search in SQL Server 2005, see Managing Internal Documents by Using Microsoft SQL Server 2005 on Microsoft TechNet.

Recommendations

The following recommendations are the result of our testing and design sessions in the Microsoft SQL Server Customer Lab, working with a very large full-text catalog. These recommendations address optimizing full-text queries with broad searches over a large catalog, exploring architectural strategies for scaling to a much greater catalog size, and finding optimal full-text configurations.

Optimizing Full-Text Queries

The application includes one key query where user input of one or more words returns results that identify the pages and documents containing the searched values. When the full-text queries returned a relatively small number of rows (fewer than 10,000), performance was relatively good. But when the full-text query returned over 100,000 rows, due to very broad word searches, the performance of the queries was not adequate. We were able to significantly improve the performance of the full-text searches as well as suggest some architectural changes that would assist in performance gains.

When testing, we found an important query that used the CONTAINS predicate in a WHERE clause, and joined with an association table to return the page IDs and their associated documents that satisfied the search. In a revised version of the query, the resulting page IDs were associated separately with the association table, removing one join from the query. That allowed us to use the CONTAINSTABLE function instead of the CONTAINS function.

The CONTAINSTABLE function performs better on large queries than the CONTAINS predicate. Because the CONTAINS predicate has an embedded search condition, it is translated internally into a join between the base table and a CONTAINSTABLE query. When you only need to return the matching full-text key columns, the CONTAINSTABLE function is more efficient because it has one less join. (Also, if you require ranking, you must use the CONTAINSTABLE function. In this application we did not use ranking, but nevertheless CONTAINSTABLE performed much better than CONTAINS.)

The combination of these changes produced significant benefits for all searched words, even those that returned fewer than 100,000 keys.

Consider embedding filter conditions as keywords in the indexed text

Because the MSFTESQL service is separate from SQL Server, filters placed on queries do not apply to the searched text. In this customer scenario, one of the query filters involved an application metadata ID that was associated with a number of document pages. After reviewing the customer's architecture, we recommended that one possible method of improving performance in the future would be to place certain filter conditions into the indexed text. These conditions might include an ID that would tie a particular page back to the application metadata. That way, the full-text query search could include the application metadata ID as well as the user-submitted search words.

For example, suppose that scanned document pages are associated with some entity called a "batch," and it is commonly required to filter the full-text queries based on one or a set of batch IDs. (For example, a particular text column page is associated with a batch ID of 10009.) Currently queries must extract all the matching IDs from the full-text search, and then filter based on a set of batch IDs. The number of returned rows will normally be much higher than the final number of rows that are returned after filtering based on batch IDs. Now assume that a batch ID (such as the string "BATCH10009" in our example) is embedded in the database table's text column for each row. Full-text queries can now pass the batch ID, along with the other search words, to the full-text engine. The full-text service will return fewer rows and therefore the full-text query performance will improve.

Combine text data to reduce the number of keys returned

Another option to achieve a similar result as that described in the previous point would be to somehow merge the search text. In this case, instead of storing single scanned pages, the text column could include the text from all pages of the entire scanned document. Although this would increase the size of the text column and the number of words indexed in each row, it would greatly reduce the number of rows returned by the full-text search, because the keys returned would contain documents, not pages. As previously mentioned, for broad searches on large catalogs, full-text query performance is roughly proportional to the number of keys returned. By reducing the number of keys that are returned, full-text query performance would improve. The improvement is proportional to the reduction in the number of rows returned by the full-text query.

Avoid parameterizing queries on the broadest searches

We found that the query plans for queries that returned a large number of keys were different from those that returned a smaller number of queries. The customer's application currently submits its full-text queries in an ad hoc fashion because the search screen of the application has a great deal of variability. We found that the optimal query plans for the broader searches differed from the optimal plans for words that returned smaller numbers of keys. The narrower queries performed best with nested loop joins, whereas the broader queries performed best with hash joins, because the number of keys was much higher. When we parameterized the queries, one group of queries ended up using a suboptimal plan.

Therefore, we recommended that the customer not parameterize the full-text queries, but continue to submit ad hoc queries for the full-text searches. This allows SQL Server to cache multiple plans covering both narrow and broad word searches in the procedure cache, and the optimizer can choose the appropriate plan.

Consider a Scale-Out Architecture

In this customer scenario, even with improved full-text query performance, there was no guarantee that high-end servers would be capable of scaling up to much larger full-text catalogs. After analysis, it appeared that scaling out the application would be a better approach than attempting to scale up. The full-text query improvements listed in the previous section could be implemented in the short-term to midterm range. For the longer term, we made the following recommendations.

One way to scale out to much larger catalogs would be to split the full-text indexed table into several tables spread across multiple database servers. That way, the full-text queries could use linked server queries to distribute the query search across multiple databases, each returning a subset of the desired rows, and the query could use a UNION ALL to accumulate the results.

In one of our tests, we took a large catalog, duplicated it to four database servers, and then submitted a large number of queries to the system, balancing the load across all the servers. We saw some overall full-text query performance improvement. The important point is that spreading the queries evenly across the database servers was essential, so that no individual database server carried a larger load than another.

However, when we split the large catalog into four smaller catalogs on a single database server, and then used a UNION ALL to join the results, the queries did not perform as well as the original query on the larger single catalog. Using smaller catalogs alone is not enough; they must be spread across multiple database servers.

When spread across multiple servers, queries used linked server connections and then combined the results by using UNION ALL. The queries were distributed randomly across the servers to achieve a degree of load balancing, because remote queries using a UNION ALL cannot be parallelized.

Isolate application users into related groups

Another architectural change we considered is whether the application would permit separating application users into natural groupings and giving each group its own database server. This works when a given group does not require searching another group's text at the same time they are searching their own. This approach effectively reduces the catalog size on each server, allowing the application to scale out as needed. Also, it provides natural security barriers when users from one group should not be allowed to query data belonging to another group. Based on further research, this turned out to be the most promising strategy.

Scaling out is not necessarily a solution that fits all, however. Scaling out allows the use of less expensive hardware compared to scaling up. Scaling up is generally more expensive than an equivalent scale-out configuration.

Full-Text Configuration Recommendations

Based on consultation with the SQL Server Full-Text developer team, we configured our full-text options as described in this section.

SQL Server Service Pack 2 (SP2) contains numerous small enhancements to full-text search, and should be used whenever possible. In particular, SP2 contains a new stored procedure for tuning the query timeout of the Full-Text Search service.

During some of our tests, we saw some very broad full-text queries time out in the MSFTESQL service with the following message:

"Informational: MS Search stop limit reached. The full-text query may have returned fewer rows than it should."

The message is informational only, but the number of rows returned was smaller than what should have been returned. Since a SQL Server error is not raised, this could cause some full-text queries to return a smaller set of rows without this being noticed by the application.

To configure the timeout to a longer number, we used the 'remote query timeout' for the SQL Server 2005 database server:

The MSFTESQL service will use the lower of the two values. When we configured the system to use a higher timeout value, the informational message disappeared and full-text queries resumed returning the correct number of rows.

Configure sufficient memory for full-text search

At a Microsoft Windows Server® level, SQL Server may compete for available memory with the MSFTESQL service. It is important to make sure that the SQL Server data cache does not prevent the Windows® system cache from caching the full-text catalog in memory and vice-versa.

First, make sure that Windows favors the system cache over the application cache. Windows Server 2003 is normally set to favor the system cache, but you should verify that by right-clicking My Computer and selecting Properties. Click the Advanced tab, select Performance Settings, and then open the Advanced dialog box. In the Memory usage section, make sure that the Adjust for best performance of: System cache option is checked.

After you make sure that the Windows system cache is favored, ensure that sufficient memory is available for the MSFTESQL service to cache the full-text catalog. The Windows Server 2003 file caching can be only indirectly detected, but in our tests we found that large queries performed better when the maximum memory usage for SQL Server was set lower than the available RAM. We set SQL Server maximum cache usage to be the available RAM less the size of the full-text catalog, so that the entire catalog could be placed in the Windows file cache. We did not focus on estimating the exact size or behavior of the file caching, but we did notice that disk I/O to the drive and to the full-text catalog file was very low during our tests when we made sure that the server had enough free memory to cache the full-text catalog.

Also, ensure that the SQL Server service has the Windows policy, Lock Pages in Memory, enabled. This ensures that SQL Server memory is not swapped out if the file cache grows.

Ensure optimum I/O performance for full-text catalogs

Monitor the performance and take steps to address I/O performance if an issue arises. For large catalogs, place the full-text catalog on its own dedicated volume. This helps ensure that full-text catalog I/O does not compete with other database I/O. In addition, it makes it easier to measure I/O activity to the catalog.

This is a general recommendation, and depends on how much I/O you expect and observe from your full-text catalog searches. Generally, we found that for our large full-text queries, the full-text activity was CPU-intensive and memory-intensive. Because we ensured that there was enough memory for the catalog to be cached by the MSFTESQL service, we did not observe much disk I/O for the full-text catalog. If we had placed the catalog on the database data drives, the MSFTESQL service would have competed for I/O with the database during the initial building of the system cache.

Other configurations

For very large full-text catalogs, configure the full-text service to create the full-text indexes to meet the Service Level Agreement (SLA), which generally requires fast processing. To do this, you can configure the full-text crawl range as well as the full-text service resource usage for more optimal performance.

As a general rule, configure the maximum full-text crawl range to be twice the number of physical processors on the machine. For example, to do this on a database server with eight physical CPU cores, you can execute the following:

EXEC sp_configure 'max full-text crawl range', 16

In addition, it can be helpful to configure the resource usage to the most aggressive setting of five, as in the following:

EXEC sp_fulltext_service 'resource_usage', 5

This will also help improve the full-text crawl performance.

We did experiment with running the SQL Server service and the MSFTESQL service on different processors of the same physical server. This required setting the affinity mass for both SQL Server and MSFTESQL. However, our tests did not show any full-text performance gain over letting both services use the same set of processors.

Conclusion

Tuning full-text queries on very large catalogs requires some experimentation. We observed that full-text query performance is roughly proportional to the number of rows returned for a given set of search words. We found improvements in performance by using CONTAINSTABLE rather than the CONTAINS function, and by removing unnecessary joins. We also found promising architectural improvements by focusing on scaling out the full-text catalog to multiple database servers.