I have a large SQL Server database with a table at about 45 million records. I am archiving this table, and need to remove all entries greater than two years ago. I have the inserting into my archive table working fine, but I'm having issues with efficiency when deleting.

My problem lies within the indexes currently on the table. I would like to delete (and archival insert) in 1000 record chunks. To do this, I need to determine the "top" 1000 records fulfilling the requirement (greater than two years old). The DateTime stamp on the row is a clustered index, so this is great for grabbing the rows. However SQL 2000 does not allow DELETE TOP 1000.... so I need to do something like:

DELETE FROM <table> WHERE [UniqueID] IN
(SELECT TOP 1000 [UniqueID] FROM <table> WHERE [DateTime] < @TwoYearsAgo)

This would work great, if UniqueID was indexed. Since it is not, this takes a very long time (it is scanning the table for each of the 1000 records to be deleted). There are no other indexes on the table that uniquely identify the records. I am told it would be too costly to compute an index on UniqueID, as this is a live DB. Can anyone point out a way to optimize this query?

How on earth did you manage to add a column named UniqueID, that identifies the rows, but has no index? Don't you have a primary key?
–
Lasse V. KarlsenDec 17 '09 at 23:10

I did not design the table, I am a new programmer to the project charged with archiving it. It is a heavy write table (used for logging), but not much in terms of reads, so creating an additional index (or PK, for that matter) would cause unnecessary additions to insertion time. Again this is not how I would've designed it, but it is what it is.
–
KevinDec 17 '09 at 23:18

3

The person who told you it was too expensive to index is lying. :-)
–
onupdatecascadeDec 17 '09 at 23:19

1

I understand the need for optimal insert performance on a logging table, but that also means you will have problems doing fast queries on it. You can't have your cake and eat it too, something has to give. But test the answers you've been given, with SET ROWCOUNT, and see if that doesn't help.
–
Lasse V. KarlsenDec 17 '09 at 23:20

3

@onupdatecascade, that's oversimplifying it. An index on a table does indeed have a negative performance impact on inserting and updating the table, and for logging tables, inserts are the ultimate usage, so a design-choice might have been made to disallow indexing on this table, to ensure low overhead. It might also be a boneheaded decision, but it's a bit too simple to just say that the person is lying. There might be good reasons to do this, even if we may disagree with them.
–
Lasse V. KarlsenDec 17 '09 at 23:22

7 Answers
7

Also note that per the documentation for DELETE, it supports the TOP clause, but that is apparently new for SQL Server 2005 and up. I'm saying this since it sounds like it isn't supported on your database server, but have you actually tried using it? I don't have access to SQL Server 2000 documentation so I'm unsure if it is supported on that version. It very well might not be.

DELETE TOP (1000) FROM <table> WHERE [DateTime] < @TwoYearsAgo

Note the difference from the way TOP on select can be written, without the parenthesis. For UPDATE, DELETE and INSERT, the expression must be parenthesized, even if it's only a constant number like above.

I am also pushing to move to Server 2008, but we are most likely going to trim the database before we move it to a new instance.
–
KevinDec 17 '09 at 23:21

Yes, I have tried both with and without parentheses, to no avail.
–
KevinDec 17 '09 at 23:37

Please note that according to MSDN docs SET ROWCOUNT will not affect insert, delete and update statements anymore in the next release after sql server 2012. So if you want a query which works on all versions you'd have to do the delete from (select top ...)
–
ChrisWueJul 9 '13 at 3:42

I had to do something similar a while back -- make lightweight insert and delete to move old records to an archive table. Although counterintuitive, the fastest and least impactful solution I found was:

Make a small #temp table with the values of IDs for the top (x) rows. If ID really can't be indexed in your scenario, you might use date AND ID instead, so the combination of the two can use an index.

begin tran

Insert into archive table where ID and DATE in ( #temp )

Delete from main table where ID and DATE in ( #temp )

commit

Truncate #temp

Repeat

Having the temp table to stage the row identifiers is more total work than a straight delete, but makes the process very lightweight in cases where you want to just chip away a little at a time without blocking.

Also I agree with Lasse - can't see the point of a unique id with no index, and therefore no constraint, to enforce it.

I tried something similar to this with a locally declared temp table, but without having a unique identifier indexed, it didn't help much. I'll try using both the date and uniqueID, see if that gets me anywhere. Thanks!
–
KevinDec 17 '09 at 23:39

I wonder whether you must stick with the 1000 record chunk requirement. If it is there for the reason of server load and kind of arbitrary, you may want to try the following, since you already have a clustered index on [DateTime]:

For backward compatibility, the parentheses are optional in SELECT statements. We recommend that you always use parentheses for TOP in SELECT statements for consistency with its required use in INSERT, UPDATE, MERGE, and DELETE statements in which the parentheses are required.

USE AdventureWorks;
GO
DELETE TOP (20)
FROM Purchasing.PurchaseOrderDetail
WHERE DueDate < '20120701';
GO