Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. Join them; it only takes a minute:

When using a subquery to find the total count of all prior records with a matching field, the performance is terrible on a table with as little as 50k records. Without the subquery, the query executes in a few milliseconds. With the subquery, the execution time is upwards of a minute.

For this query, the result must:

Include only those records within a given date range.

Include a count of all prior records, not including the current record, regardless of date range.

Records 96 and 95 are excluded from the result, but are included in the PriorCount subquery

Current Query

select
*.a
, ( select count(*)
from Activity
where
Activity.Address = a.Address
and Activity.ActionDate < a.ActionDate
) as PriorCount
from Activity a
where a.ActionDate between '2017-05-29' and '2017-05-30'
order by a.ActionDate desc

Question

What strategies could be used to improve the performance of this query?

Edit 1
In reply to the question of what I can modify on the DB: I can modify the indexes, just not the table structure.

Edit 2
I have now added a basic index on the Address column, but that didn't seem to improve much. I am currently finding much better performance with creating a temp table and inserting the values without the PriorCount and then updating each row with their specific counts.

Edit 3
The Index Spool Joe Obbish (accepted answer) found was the issue. Once I added in a new nonclustered index [xyz] on [Activity] (Address) include (ActionDate), the query times went down from upwards of a minute to less than a second without using a temp table (see edit 2).

1 Answer
1

With the index definition that you have for IDX_my_nme, SQL Server will be able to seek using the ActionDate column but not with the Address column. The index contains all of the columns needed to cover the subquery but it likely isn't very selective for that subquery. Suppose that almost all of the data in the table has an ActionDate value of earlier than '2017-05-30'. A seek of ActionDate < '2017-05-30' will return almost all of the rows from the index, which are further filtered down after the row is fetched from the index. If your query returns 200 rows then you would probably doing almost 200 full index scans on IDX_my_nme, which means you will read around 50000 * 200 = 10 million rows from the index.

It's likely that seeking on Address will be far more selective for your subquery, although you haven't given us full statistical information about the query so that's an assumption on my part. However, suppose that you create an index on just Address and your table has 10k unique values for Address. With the new index, SQL Server will only need to seek 5 rows from the index for each execution of the subquery, so you'll read around 200 * 5 = 1000 rows from the index.

I'm testing against SQL Server 2016 so there might be some minor syntax differences. Below is some sample data in which I made similar assumptions to the above for data distribution:

I've created your index as described in the question. I'm testing against this query which returns the same data as the one in the question:

select
a.*
, ( select count(*)
from #Activity Activity
where
Activity.[Address] = a.[Address]
and Activity.ActionDate < a.ActionDate
) as PriorCount
from #Activity a
where a.ActionDate between '2017-05-29' and '2017-05-30'
order by a.ActionDate desc;

I get an index spool. What that means at a basic level is that the query optimizer build a temporary index on-the-fly because none of the existing indexes against the table were suitable.

The query still finishes quickly for me. Perhaps you aren't getting the index spool optimization on your system or there's something different about the table definition or the query. For educational purposes I can use an undocumented feature OPTION (QUERYRULEOFF BuildSpool) to disable the index spool. Here's what the plan looks like:

Don't be fooled by the appearance of a simple index seek. SQL Server reads nearly 10 million rows from the index:

If I'm going to be running the query more than once then it probably doesn't make sense for the query optimizer to create an index each time it runs. I could create an index upfront that would be more selective for this query:

However, with the new index SQL Server only reads 1000 rows from the index. 800 of the rows are returned to be counted. The index could be defined to be more selective but this could be good enough depending on your data distribution.

If you aren't able to define any additional indexes on the table I would consider using window functions. The following appears to work:

SELECT t.*
FROM
(
select
a.*
, -1 + ROW_NUMBER() OVER (PARTITION BY [Address] ORDER BY ActionDate) PriorCount
from #Activity a
) t
where t.ActionDate between '2017-05-29' and '2017-05-30'
order by t.ActionDate desc;

That query does a single scan of the data but does an expensive sort and calculates the ROW_NUMBER() function for every row in the table, so it feels like there's some extra work done here:

However, if you really like that code pattern you could define an index to make it more efficient:

The Index Spool you found was the issue. Once I added in a new nonclustered index [xyz] on [Activity] (Address) include (ActionDate), the query times went down from upwards of a minute to less than a second. +10 if I could. Thanks!
– Metro SmurfMay 14 '17 at 19:16