A creative use of IGNORE_DUP_KEY

Let’s say you have a big table with a clustered primary key, and an application that inserts batches of rows into it. The nature of the business is that the batch will inevitably sometimes contain rows that already exist in the table.

The default SQL Server INSERT behaviour for such a batch is to throw error 2627 (primary key violation), terminate the statement, roll back all the inserts (not just the rows that conflicted) and keep any active transaction open:

CREATETABLE #Big (pk intNOTNULLCONSTRAINT PK_Big PRIMARYKEY);

GO

-- Start with values 1 & 5 in the table

INSERT #Big (pk) VALUES (1), (5);

-- Our batch transaction

BEGINTRANSACTION;

-- Insert a batch containing some pre-existing rows

INSERT #Big (pk) VALUES (1), (2), (3), (4), (5);

-- Show the contents of the table after the insert statement

SELECT pk FROM #Big;

-- Show the transaction count

SELECT tran_count = @@TRANCOUNT;

-- Rollback

ROLLBACKTRANSACTION;

-- Final table contents

SELECT pk FROM #Big;

GO

-- Tidy up

DROPTABLE #Big;

The output is:

Ignoring Duplicates

Let us further imagine that the desired behaviour is that new rows in a batch should be inserted, and any duplicates silently rejected. Most importantly, no error messages should be returned to the client. Ideally, this would be achieved without immediate application changes, and without impacting concurrent users of the table (the instance is Enterprise Edition, so online operations are available).

This seems like an ideal use for the IGNORE_DUP_KEY option:

CREATETABLE dbo.Big

(

pk intNOTNULL,

CONSTRAINT PK_Big

PRIMARYKEY (pk)

WITH (IGNORE_DUP_KEY = ON)

);

GO

-- Unique values

INSERT dbo.Big (pk)

VALUES (1), (3), (5);

GO

-- key 3 already exists

INSERT dbo.Big (pk)

VALUES (2), (3), (4);

That script executes successfully with just a warning (not an error!) about the duplicate key:

The problem

We would like to add the IGNORE_DUP_KEY option to the existing primary key constraint, but the ALTER TABLE command does not have an ALTER CONSTRAINT clause. We do not want to drop the existing primary key and recreate it with the new option, because the table is large and we want to avoid disrupting concurrent users. Dropping and recreating would result in rebuilding the entire table first as a heap and then as a clustered table again.

Although there is no ALTER CONSTRAINT syntax, we do know that certain constraint modifications can be performed using ALTER INDEX REBUILD on the index used to enforce the constraint. We can, for example, change the type of compression used, the index fill factor, and whether row locking is enabled:

CREATETABLE dbo.Big (pk intNOTNULLCONSTRAINT PK_Big PRIMARYKEY);

GO

INSERT dbo.Big (pk)

VALUES (1), (2), (3), (4), (5);

GO

ALTERINDEX PK_Big ON dbo.Big

REBUILD WITH (DATA_COMPRESSION = PAGE, ONLINE = ON);

GO

ALTERINDEX PK_Big ON dbo.Big

REBUILD WITH (FILLFACTOR = 90, ALLOW_ROW_LOCKS = OFF, ONLINE = ON);

GO

DROPTABLE dbo.Big;

Unfortunately, we cannot use the same trick to add the IGNORE_DUP_KEY option to the underlying index for the primary key:

The same error message results even if the ONLINE = ON option is not specified. There will be a range of views about how accurate the error message text is here, but we cannot avoid the fact that the operation we want to perform is not supported by SQL Server.

A creative workaround

One idea is to add a new UNIQUE constraint (or index) on the same columns as the primary key, but with the IGNORE_DUP_KEY option added. This operation can be performed ONLINE:

-- Existing table

CREATETABLE dbo.Big (pk intNOTNULLCONSTRAINT PK_Big PRIMARYKEY);

GO

-- Existing data

INSERT dbo.Big (pk) VALUES (1), (2), (3);

GO

-- New constraint (or index) with IGNORE_DUP_KEY, added ONLINE

ALTERTABLE dbo.Big

ADDCONSTRAINT UQ_idk

UNIQUENONCLUSTERED (pk)

WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);

--CREATEUNIQUENONCLUSTEREDINDEX UQ_idk

--ON dbo.Big (pk)

--WITH (IGNORE_DUP_KEY = ON, ONLINE = ON);

GO

-- Key 3 is a duplicate, just a warning now

INSERT dbo.Big (pk) VALUES (3), (4), (5);

GO

SELECT pk FROM dbo.Big;

The new arrangement results in the correct final state of the database, without throwing an error. The effect is the same as if we had been able to modify the existing primary key constraint to add IGNORE_DUP_KEY:

Execution plan analysis

There are some drawbacks to this idea, which we will explore in some detail. The drawbacks are significant, so adding the extra index with IGNORE_DUP_KEY is only really suitable as a temporary solution, as we will see. The first INSERT in the previous script (without the extra constraint or index with IGNORE_DUP_KEY) is pretty trivial; it just shows the constants from the VALUES clause being inserted to the clustered index:

The second INSERT (with the IGNORE_DUP_KEY index) is rather more complex (click to enlarge):

An extra index to maintain

One fairly obvious consequence of adding the new index is that the Clustered Index Insert operator now has to maintain the new nonclustered index too. I used Plan Explorer above because it shows the per-row nonclustered index insert more explicitly than SSMS, where you have to dig into the Object node in the Properties window with the relevant graphical operator selected:

Another way to see the nonclustered index maintenance explicitly is to run the query again with undocumented trace flag 8790 to produce a wide update plan:

Besides maintaining the extra index, the IGNORE_DUP_KEY plans seem to be doing a lot of extra work: there are lots of new operators compared with the simple insert. As you would expect, all the new operators are associated with ignoring duplicate keys, and there are two distinct cases to consider.

Rows that already exist

The first case relates to checking for INSERT rows that already exist in the base table. This checking is implemented in the execution plan by the Left Semi Join, Index Seek, and Assert operators:

The Index Seek looks for the key value of the current row we are looking to insert. The Semi Join cannot be an Inner Join because that would reject rows where a match was not found (and we need to insert a row in that case). Nevertheless, the query processor needs to know if an existing row was found with the same key value. The Semi Join sets a value to indicate this, which is stored in the Probe Column:

The Assert operator (known internally as a Stream Check) tests a condition and raises an error if the test fails. Assert operators are often seen in plans that affect columns with CHECK constraints, for example. In this case, however, the Assert does not raise an error, it emits the ‘Duplicate key was ignored.’ message instead, and only passes a row on to its parent (the Sort operator) if the condition passes. The check performed by the Assert is based on the Probe value stored in the expression labelled [Expr1013] above:

The Asset passes rows where the predicate evaluates to anything other than NULL. If a match was found by the Index Seek (so the Probe Column, Expr1013 is not NULL) the Assert does not pass on the row and raises a warning instead. The following TF 8607 output fragment for the INSERT statement shows the option (there is nothing equivalent in regular show plan output unfortunately):

Duplicates within the Insert Set

The first check only looks for rows from the insert set that already exist in the base table. The query processor also needs to check for duplicates within the set of values we are inserting. The Segment and Top operators in the plan combine to meet this requirement:

The Segment requires a stream ordered by the index keys, and adds a flag to indicate the start of a new group of key values. The Top operator returns only one row from each group. The overall effect is to remove rows with duplicate index key values.

The Top and Segment execution operators together implement a single physical query processor operator called a “Group-By Top”. I mention this detail because we need to be familiar with the internal name to understand the TF 8607 output, which indicates that this operator also emits a `Duplicate key was ignored.’ warning when it encounters a group with more than one key value:

Locking

Adding the new index means that locks will now be taken on the nonclustered index when checking for existing rows:

It might surprise you to learn that the index seek acquires Range S-U locks on the nonclustered index (extract from the Books Online link below):

Key-range locks like this are only taken under the SERIALIZABLE isolation level, but our INSERT statement uses them whatever isolation level the user runs the query under (for example, range locks are still taken if the session is running under the default READ COMMITTED isolation level).

SQL Server raises the effective isolation level to SERIALIZABLE for the Index Seek (and just that operator) because it needs to be sure that if it does not find a match in the index, that situation remains the same until it inserts the new row. The Index Seek is looking for a key value that does not exist, so a range lock is necessary to prevent a concurrent transaction inserting a row with that key value before we do. There is no existing key in the index to lock (because the row does not exist) so a range lock is the only option. If SQL Server did not do this, our INSERT query might fail with a duplicate key error despite the IGNORE_DUP_KEY setting (we check for a row, don’t find it, someone else inserts it, then we try to).

The optimizer adds a SERIALIZABLE hint for the Index Seek operator (internally, a physical Range operator) as we can see using another extract from the TF 8607 output:

The execution plan is also forced to use the nonclustered index with the IGNORE_DUP_KEY setting for this seek (FORCEDINDEX) and an update lock (UPDLOCK) hint is also given to help prevent conversion deadlocks. Nevertheless, you may find increased deadlocking if you choose to add an extra IGNORE_DUP_KEY index like this.

Permanent solutions

Adding a nonclustered unique index with the IGNORE_DUP_KEY option and the same key as the clustered primary key allowed us to solve an immediate problem without code changes, while keeping the table online, but it does come at a price. The execution plan is much more complex (and expensive) than the original INSERT, and there is a chance of deadlocks. The biggest performance impact of adding the extra index is of course the cost of maintaining it, meaning we need to look at a more permanent solution via a code change.

IGNORE_DUP_KEY

The (trivial) test I am going to run inserts 5000 unique rows from my table of numbers into the table we have been using so far. To establish a baseline, we will first look at the execution plan for an INSERT to the table with a nonclustered IGNORE_DUP_KEY index:

INSERT dbo.Big (pk)

SELECT n FROM dbo.Numbers

WHERE n BETWEEN 1 AND 5000;

This plan is very similar to the one we analysed earlier, though a Merge is used to perform the Left Semi Join, and a wide update plan has been chosen, making the nonclustered index insert easier to see. The other main feature is an Eager Table Spool, required for Halloween Protection. The estimated cost of this execution plan is 0.363269 cost units on my installation.

Modified INSERT

The first code alternative to the extra index is to modify the INSERT statement to check that rows do not already exist in the destination. There are a number of SQL syntaxes for this, each with different characteristics and performance in different circumstances. To keep things simple, and because I only want to make a couple of specific points here, I have chosen my simple example to produce the same execution plans for all the common syntax variants. The first thing to do is to drop our extra constraint:

ALTERTABLE dbo.Big

DROPCONSTRAINT UQ_idk;

Now we can evaluate the modified INSERT statement:

-- I prefer this syntax

INSERT dbo.Big (pk)

SELECT n FROM dbo.Numbers

WHERE n BETWEEN 1 AND 5000

ANDNOTEXISTS (SELECT * FROM dbo.Big WHERE pk = n);

-- With EXCEPT

INSERT dbo.Big (pk)

SELECT n FROM dbo.Numbers

WHERE n BETWEEN 1 AND 5000

EXCEPT

SELECT pk FROM dbo.Big;

-- Not recommended

INSERT dbo.Big (pk)

SELECT n FROM dbo.Numbers

WHERE n BETWEEN 1 AND 5000

AND n NOTIN (SELECT pk FROM dbo.Big);

All three produce this execution plan:

This has an estimated cost of 0.0981188 units – much cheaper than the 0.363269 cost seen previously. The Halloween Protection spool still features in the plan, but there is a weakness in our queries that you might have already spotted. We are not doing anything to protect against duplicate key violation errors in case someone else inserts a row after we have checked to see if it exists, but before we insert it. The query optimizer added a SERIALIZABLE hint when it added an existence check, so if avoiding errors is important to us, we need to do the same:

We do not need the UPDLOCK hint for two reasons. First, the engine automatically takes update locks for us when reading the source table. Second, we are reading from the same index we are inserting to (not reading from a nonclustered index and inserting to the clustered index) so the previous deadlock scenario is not applicable.

Using MERGE

Another option is to use the WHEN NOT MATCHED feature of the Merge statement. This time we will add the necessary SERIALIZABLE hint up front:

MERGE dbo.Big WITH (SERIALIZABLE) AS b

USING (SELECT n FROM dbo.Numbers WHERE n BETWEEN 1 AND 5000) AS s

ON s.n = b.pk

WHENNOT MATCHED THEN

INSERT (pk) VALUES (s.n);

This plan has an estimated cost of 0.0950127 units – slightly less than the 0.0981188 units for the modified INSERT plans. Some of this improvement is due to the lack of a Halloween Protection spool, for interesting reasons I cover in depth in a short series of articles to be published shortly (update: part one is out).

These are not meant to be performance tests by any stretch of the imagination. There are any number of subtle factors that will affect the execution plans and run times for different numbers of rows, different distributions, and so on. I should stress that I normally find MERGE plans perform less well than separate INSERT/UPDATE/DELETE more often than not. Anyway, in case you are interested, typical performance results on my machine for this specific test are (INSERT first, then MERGE), timings in milliseconds:

IGNORE_DUP_KEY and Clustered Indexes

When IGNORE_DUP_KEY is specified for a unique clustered index (primary key or otherwise), duplicates are handled by the storage engine rather than the query processor. For example:

The IGNORE_DUP_KEY side of things is handled entirely by the storage engine – if it finds a row in the unique index where it was going to insert one, it flags a warning but does not try to insert (which would cause an error). The estimated cost of this query plan is identical whether the unique clustered index has IGNORE_DUP_KEY or not – there is no extra work for the query processor.

If you are wondering why the same mechanism is not used for a nonclustered unique index with the IGNORE_DUP_KEY option, my understanding is it is because the Clustered Index is always updated before the nonclustered ones (even in a narrow plan). By the time the storage engine detected a duplicate in the nonclustered index, it would have already added the row to the base table. So, the query processor handles IGNORE_DUP_KEY for nonclustered indexes.

This is my understanding of the mechanics of IGNORE_DUP_KEY, I don’t know for sure and will happily correct any details I may have wrong. Any storage engine experts reading this please contact me if so.

Acknowledgement

This post was inspired by a post by Daniel Adeniji and our subsequent discussion of it. My thanks to him for permission to analyse the issue further here, and for allowing me to reference his original. The issue described here may not accurately reflect Daniel’s original problem – I used a certain amount of artistic licence. Even if the specific issue is not particularly interesting to you, I hope you enjoyed some aspects of the analysis and maybe picked up some new information along the way.

Warning on IGNORE_DUP_KEY was raised as an *error* by current SQLOLEDB/ODBC drivers so it is impossible to use it in client applications. Never tested recent NCLI/.Net Providers but possibly the issue is fixed.

@tobi An INSTEAD OF trigger? I guess so, though the overheads are even worse than the extra operators in the plans I show here. Both INSTEAD OF and AFTER triggers require storage for the inserted and deleted tables.

Neat article, I had to stuff 5 databases into 1 ofcourse for a "special" reporting db and got that same error years ago. Funny no one could get it to work for a long time I was told, I found the INGORE_DUP_KEY option and blingo....everyone in the pool...problem solved

great article! i've had some interesting adventures with IGNORE_DUP_KEY recently and would like to share the following observations (SQL 2008 R2):

1.) as far as i can tell, the only time you can get IGNORE_DUP_KEY=ON going for a clustered index is during initial PK creation (using the syntax you have above); specifying the IGNORE_DUP_KEY options WITH (DROP_EXISTING=ON,IGNORE_DUP_KEY=ON) will always fail, even if the PK/Clustered Index was created with IGNORE_DUP_KEY=ON in the first place.

2.) the above is aggravating because if you use DROP_EXISTING=ON to move your clustered index to a new filegroup/scheme/whatever you will lose the IGNORE_DUP_KEY setting and cannot get it back with dropping to heap and recreating.

3.) however, if you create an empty, identical table with IGNORE_DUP_KEY=ON for your primary key and ALTER TABLE...SWITCH to it, the setting from the new table will take effect. so as a workaround for huge tables, under the right circumstances you can ALTER TABLE SWITCH to an identical (empty) shell, drop and recreate your original table with IGNORE_DUP_KEY=ON, and then switch back, and in effect the IGNORE_DUP_KEY = ON will now be in force on the original huge table.

Interesting how many absurd gyrations this requires with MSSQL, when the much used low end database engine "MySQL" can do this with a simple option to the INSERT query itself, in the form of an IGNORE keyword, which instructs the engine to ignore any insert related errors without race conditions or explicit extra locks (and without global database changes affecting regular INSERT statements by any other (or the same) client).

The resulting MySQL statement becomes simply:

INSERT IGNORE dbo.Big (pk)

SELECT n FROM dbo.Numbers

WHERE n BETWEEN 1 AND 5000

(I actually encountered this difference when using a Microsoft sponsored script that translates MySQL statements to MSSQL statements on the fly, but fails to handle the occasional "INSERT IGNORE" statements. The actual queries originate in a 3rd party application, so I have no control over that).

While you can't ALTER INDEX PK_Big ON dbo.Big REBUILD WITH (IGNORE_DUP_KEY = ON) to set the value after-the-fact on a clustered index you can do it using ALTER TABLE dbo.Big REBUILD WITH (IGNORE_DUP_KEY = ON).

Cody, yes that might be a good alternative for a unique clustered index. It can be online as well, more so in later versions that don't have as many restrictions. There might still be good reasons to employ the workaround though. And of course the whole post is just an excuse to talk about some internals.