More on TOP

Last week I wrote about a special case of the TOP operator known as ROWCOUNT TOP. This week I’ll take a look at some other interesting TOP scenarios. In general, TOP is a fairly mundane operator. It simply counts and returns the specified number of rows. SQL Server 2005 does include two enhancements to TOP that were not present in SQL Server 2000. First, in SQL Server 2000, we can only specify an integer constant for the number of rows to return. In SQL Server 2005, we can specify an arbitrary expression, including an expression containing T-SQL variables or parameters. Second, SQL Server 2000 only permits TOP in a SELECT statement (though it does support ROWCOUNT TOP in INSERT, UPDATE, and DELETE statements). SQL Server 2005 permits TOP with SELECT, INSERT, UPDATE, and DELETE statements.

In this post, I’m going to focus only on some simple examples involving SELECT statements. Let’s create a small table to get started:

TOP is often used in conjunction with ORDER BY. Combining TOP with ORDER BY adds determinism to the set of rows returned. Without the ORDER BY, the set of rows returned depends on the query plan and may even vary from execution to execution. If we have an index to provide order, we continue to get a simple query plan (notice the ORDERED FORWARD keywords):

Note that without an index to find the top 5 rows, SQL Server must scan all 100 rows in the input table. Also note that the sort is actually a “top sort.” A top sort generally uses less memory than a regular sort as it only needs to identify and sort the top few rows rather than sorting the entire input.

Now let’s consider what happens if we request the top 5% of rows. To figure out the actual number of rows to return, SQL Server must count all of the rows and calculate 5%. This makes queries that use TOP PERCENT less efficient than queries that use TOP with an absolute row count.

Like the prior example, SQL Server must scan all 100 rows in the input table. In this example, SQL Server uses an eager spool which reads and counts all of the input rows before returning any rows. The top then requests the row count from the spool, calculates 5%, and proceeds like any other top.

If SQL Server must sort, the sort can also provide the count of input rows. However, only a regular sort can provide this count. A top sort must know the number of rows to return from the start.

TOP WITH TIES is also incompatible with top sort. TOP WITH TIES cannot know for certain how many rows will be returned until the number of ties is determined. For this example, let’s add a tie for the fifth row returned:

Although the above plan does not seem to reflect that we have a TOP WITH TIES, the argument column of SHOWPLAN_ALL or STATISTICS PROFILE includes this information: “TIE COLUMNS:([T].[B])”. This information is also available in the graphical and XML plans on SQL Server 2005. Note that the TOP actually returns one extra row. When a TOP N WITH TIES reaches the Nth row, it keeps a copy of the tie columns for this row (in this example B==4) and compares each subsequent row to that row. As long as there are rows that match, it keeps returning them. Because the top must compare subsequent rows until it finds a non-match to the Nth row, the top retrieves one more row from the sort than it returns.

Finally, there are a couple of special cases. The optimizer knows that TOP 0 and TOP 0 PERCENT never return any rows and replaces any query plan with a TOP 0 with a constant scan:

SELECT TOP 0 * FROM T

|–Constant Scan

The optimizer also knows that TOP 100 PERCENT always returns all rows and removes the top operator from the query plan:

SELECT TOP 100 PERCENT * FROM T

|–Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Both of these optimizations require that the number of rows is a constant. For example, using an expression that includes a T-SQL variable or parameter will prevent the optimization. Both optimizations also work with INSERT, UPDATE, and DELETE statements.

I understand that a Table spool is a physical operator that scans the input and places a copy of each row in a table that is stored in the tempdb database. So in this plan, are we actually creating a copy of the table T and then scanning the Table Spool in Tempdb in order to answer the query: SELECT TOP 5 PERCENT * FROM T? or the server only stores a set of pointers? is there any particular reason why the optimizer chooses this path? what happened if the table is huge? Wouldn’t it be more efficient (memory wise) to remove the Table Spool and have a second Clustered Index Scan for the total of rows we want to retrieve?

The table spool does create a copy of the table though it only copies those columns that are specified in the SELECT list of the query. In some cases, this example is likely one of those cases, it might be faster to perform two clustered index scans. However, if the table has many large columns that are not output, the spool could actually make the query faster by eliminating unnecessary columns from the second scan and, thus, reducing the number of I/Os. More importantly, if the input to the spool is a complex operation such as an aggregation or a join, it could be much faster to spool the results than to recompute them.

The presence of the scan operator does not mean that SQL Server scans the entire table. In fact, as the statistics profile output shows, SQL Server actually only scans 5 rows. SQL Server executes this query plan by requesting one row at a time from the top operator which requests one row at a time from the scan operator. Once the top operator has returned 5 rows, it terminates the scan (regardless of the number of rows in the table) and returns that execution is complete.

Hi there is an issue I am facing with TOP Clause. Please guide me on this. The scenario is:

In my Product table I have a ProductID & ProductPrice column. ProductID column has unique values while in ProductPrice column some values/data are similar. The issue is when using top 1 the query returns the second row and not the first one.

SQL Server uses a full sort for the first query and a top sort for the second query. Top sort is faster than a full sort when you only need a small number of rows. However, the two operators can (as you've observed) produce different orders for duplicate rows. You can resolve this issue by either adding a second column to the sort (e.g., … order by ProductPrice, ProductID) or by requesting duplicates from the top operation (e.g., select top 1 with ties Product ID from …). Note that using top with ties will force a full sort (which may hurt performance) since the actual number of rows needed is unknown.

It means that the query is not able to distinguish the rows having same data for the ordered column. So every time making a query, I always need to bear in mind that both the queries result set will be different 🙂 & query accordingly. Does it make sense from the programmers point of view? As they can't always predict the data whether it will be same or not.

In order to make the query faster why did they forgot that one should receive the correct data/result set evenif using any sql clauses .

I am grateful to you for making the point clear to me. If possible can you send me some link or document that explains exactly how the TOP Clause along with Order By clause works internally.

The results differ because the query is ambiguous as to which row(s) are desired. There is more than one correct answer and no one answer is more or less correct than the others. As I noted, you can eliminate the ambiguity and achieve consistent results by choosing an explicit sort order or by requesting duplicates. I'm not aware of any other resources that describe TOP or TOP Sort in more detail.