Inside Search Arguments

One of my favorite SQL Server topics to teach about is the query optimizer, whose job is to determine which query plans SQL Server will use to execute the queries your applications submit for processing. A main optimizer task is to determine which indexes, if any, SQL Server will use to find the relevant data in each table involved in a query. In my July 2001 column, "Are You in Tune?" (InstantDoc ID 21038), I introduced the concept of a SARG, which is my term for a particular form of search argument condition in a query's WHERE clause. In most cases, a column must appear in a SARG before the query optimizer will consider using an index on that column.

A SARG limits a search because it specifies an exact match, a range of values, or a conjunction of two or more items joined by AND. A SARG contains a constant expression (or a variable that resolves to a constant) that SQL Server compares to a column by using an operator. SARGs take the form column inclusive_operator or inclusive_operator column. The column name appears on one side of the operator, and the value appears on the other side. The following operators are considered inclusive operators: =, >, , <=, BETWEEN, and sometimes LIKE (if the wildcard isn't at the beginning of the string). The most important condition is that the column must appear alone on one side of the operator. If the column doesn't appear by itself, the expression isn't a SARG, and SQL Server won't even consider using an index on the column. The sample code in this article illustrates when and how SARGs can be useful.

To run this article's examples, run the script that Listing 1 shows to create a copy of the Northwind database's Orders table. This copy, NewOrders, has an integer IDENTITY column added, and its OrderID column contains character strings. To keep the OrderID values unique even though the same data appears five times, the code concatenates a different digit onto the OrderID string each time it copies the Orders table.

In many cases, you can convert a non-SARG into a SARG so that SQL Server can use indexes to process a query. Consider this query against the newly built NewOrders table:

SELECT * FROM NewOrders
WHERE substring(OrderID, 1, 5) = '10248'

The plan for this query shows that SQL Server will perform a table scan. If you use SET STATISTICS IO ON before executing the query, you should see 106 logical reads—one for each page in the table. To change the query so that it contains a SARG but remains logically equivalent, you'd run the following:

SELECT * FROM NewOrders
WHERE OrderID LIKE '10248%'

The plan for the modified query shows an index seek. STATISTICS IO reveals that only seven logical page reads returned the same five rows of data. You get a better plan and better performance because the SARG directs the optimizer to evaluate the index on OrderID, and the optimizer determines that the index is useful. In cases like this example, you can easily rewrite your query so that it contains a SARG. However, in some queries, the lack of a SARG isn't obvious; these queries can be hard to detect and correct. I'll discuss some examples of such queries later in this article.

Change Is Good...Sometimes

Sometimes changes in the way SQL Server processes your queries internally can mean expressions that seemed to contain SARGs no longer contain any. In my columns "Inside Optimization" (October 2003, InstantDoc ID 39822) and "Inside Optimizer Enhancements" (November 2003, InstantDoc ID 39906), I discussed query optimizer changes in SQL Server 2000, most of which were improvements. However, reader Tim Kehoe told me about one optimizer change that hurt his application's performance. Before SQL Server 2000, in a comparison between a column and a value of a different data type, SQL Server always implicitly converted the value (a constant or expression) to the column's data type. However, the rules for implicit conversion changed in SQL Server 2000. Under the topic "Data Type Precedence," SQL Server Books Online (BOL) states:

"When two expressions of different data types are combined by an operator, the data type precedence rules specify which data type is converted to the other. The data type with the lower precedence is converted to the data type with the higher precedence. If the conversion is not a supported implicit conversion, an error is returned. When both operand expressions have the same data type, the result of the operation has that data type."

The BOL topic goes on to list all the data types in order of precedence. The list shows that integer types have a higher precedence than character types. (For more information about data types and data-type precedence, see Wayne Snyder's May 2001 article, "Problem Children," InstantDoc ID 20014.) Tim had queries like this:

SELECT * FROM NewOrders
WHERE OrderID = 102533

Although OrderID is a character column by definition, this query compares it to a constant, which SQL Server assumes to be an integer. Because SQL Server needs both values to be the same data type, it will convert the OrderID column to an integer type, as if you'd written this query:

SELECT * FROM NewOrders
WHERE CONVERT(int, OrderID) = 102533

In fact, the query plan for the first query includes the conversion, so this query no longer contains a SARG. The solution in this case is to not let SQL Server assume that the data type of the constant is integer but instead to force SQL Server to recognize the constant as a character string. In this case, you simply put quotes around the value. When you run the following two queries, you get significantly different plans and performance:

The first query above uses an index scan and requires 15 logical reads, whereas the second uses an index seek to access the data directly and needs only 3 logical reads to find the data.

The change in the optimizer to have it convert a lower-precedence data type to a higher-precedence type wasn't an improvement for Tim. To help avoid similar problems, always be aware of what data types you're working with. SQL Server 7.0 gave good performance for this query and converted the 102533 into a character string, but assuming that the query would always work turned out to be a mistake. To make totally clear that this value is a character string, put quotes around the value in your code every time. If you leave the data type open to (mis)interpretation, you can get inconsistent results.

For example, if instead of using the equality operator, you use an inequality, the meaning of the following two queries is completely different:

In the first query, SQL Server treats 10249 as an integer and returns all rows in which the OrderID (converted to an integer) is numerically less than 10249. The only such OrderID is 10248. However, when you explicitly represent 10249 as a character string by enclosing it in quotes, SQL Server looks for values of OrderID that are less than 10249 in ASCII sequence. SQL Server returns five values—10248, 102481, 102482, 102483, and 102484—as a match for that comparison.

I can't say which result set is correct for your application, but if you let SQL Server convert the data the way it's designed to, be aware that the results you get might not be what you want, because what you thought was a valid SARG is no longer valid. I recommend that you always represent your constants and expressions as the same data type as the column you're comparing them to. Sometimes, SQL Server might not assume a constant is the data type you think it is. For example, in the clause WHERE bitcolumn = 1, SQL Server assumes the 1 is an int, not a bit. In the clause WHERE bigintcolumn = 30000000, SQL Server assumes the constant's type is numeric, not bigint. If any uncertainty exists, using the CAST() function with the constant is preferable to having SQL Server convert the column to a different data type.

All SARGs Are Not Created Equal

After the optimizer determines that the query contains a SARG, it checks whether any possibly useful indexes exist on the column involved. If so, the optimizer uses the available statistics for the index to estimate how many rows SQL Server will need to access; in this way, it determines whether the index will be useful. The statistics have two parts. For the first column in an index, the statistics contain a histogram, so for a given value, the optimizer can make a very close estimate of how many times that value will appear. For example, look again at the execution plan for this query:

SELECT * FROM NewOrders
WHERE OrderID < '10249'

If you're using the graphical execution plan, you can put your cursor over the index seek icon and see that the estimated number of rows is six. The actual number of rows the query returns during execution is five, but the estimate is pretty close. Note that only the first column in an index has a corresponding histogram even though an index can contain up to 16 columns. For example, if you had a composite index on lastname, firstname in the Employees table, you could issue this query:

— This query is just a sample;
— you can't really run it.
SELECT *
FROM employees
WHERE lastname = 'Clark'
AND firstname = 'Lewis'

Although SQL Server will consider the entire WHERE clause as a SARG, including both conditions, the histogram would give the optimizer an estimate of how many rows contain a last name value of Clark but couldn't tell how many contain both the requested first and last names.

In addition to the histogram, the statistics track density information—the average number of duplicates—for every left-based set of columns. For example, if you had an index on lastname, firstname, country, the statistics would tell you the average number of times each lastname value occurs, the average number of times each lastname/firstname combination occurs, and the average number of times each lastname/firstname/country combination occurs. (For more details about optimizer statistics in SQL Server 2000, see the white paper "Statistics Used by the Query Optimizer in Microsoft SQL Server 2000" at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/html/statquery.asp.)

If you don't give the optimizer a known value to look for, it can't use the histogram, but it can use the density information. For example, consider these two queries against the NewOrders table:

In the first query, OrderID is a character column and the query is comparing it to a character constant, so the optimizer can use the histogram and will come up with a good estimate. The second query contains a SARG (the column appears by itself), but the optimizer won't actually perform the substring operation. It can tell that, during execution, it will be looking for the OrderID value to be less than some specified value, but the optimizer doesn't know beforehand which value to use, so it can't use the histogram. The optimizer will guess how many rows might satisfy the WHERE clause. If you look at the plan details, you'll see that the guess is about 30 percent of the rows in the table. That's too many rows for the index to be useful, so the optimizer chooses a table scan for the second query, even though it returns exactly the same data as the first query.

Each of these queries contains a SARG and returns the same results. However, although the optimizer can't use the statistics for the index on the OrderID column for the second query, it can use the density information because this query contains an equality. The density information tells the optimizer that each value occurs no more than once, so even though the optimizer doesn't know what specific value it will be looking for, it knows that the query will return no more than one row and that using the index on OrderID is a good choice.

Give the Optimizer a Hand

In an ideal world, the query optimizer would have access to all the information it needed, but right now, that isn't the case. Therefore, the more information you can give the optimizer to help keep it from having to guess, the better plan it can devise. Make your data types explicit so that SQL Server doesn't assume you mean a different type and perform an internal conversion. And make sure you provide data in the most explicit form possible; don't make SQL Server and the optimizer determine function results during query processing. Next month, to help you improve your queries, I'll show you a few more queries that contain SARGs but for which the optimizer has to guess and can't take full advantage of the statistics.