A friend of mine asked me the other day, how I would solve the following problem she got during her interview. Imagine you have a table with several attributes, and there some records, where all attributes except of primary key are the same (duplicates). Write one SQL statement to delete duplicates and leave just one of each values combinations.

If the only constraint is one, and we do not care about performance, the solution is easy. Let’s say we have a table:

To delete all duplicate occurrences of (attr1, attr2) we can write the following SQL statement:

DELETE FROM t1 WHERE id NOT IN(
SELECT min(id) AS id
FROM t1 GROUP BY attr1,attr2)

This way only one (minimal) id for each attr1, attr2 comination will be left.

However, for big tables this is not an efficient way to perform delete. For big tables we want to become “imperative” and use a cursor with ordering.

In no particular 4GL language this will look like this:

DECLARE v_attr1 number:=to_number(NULL);
v_attr2 text:=NULL;
BEGIN
FOR rec in (SELECT id, attr1, attr2 FROM t1
ORDER by attr1, attr2, id) LOOP
IF v_attr1 IS NOT NULL AND v_attr2 IS NOT NULL AND v_attr1=rec.attr1 AND v_attr2=rec.attr2
THEN
DELETE t1 WHERE id=rec.id
ELSE
v_attr1:=rec.attr1;
v_attr2:=rec.attr2;
END IF;
END;

If the table is really big, you can insert additional counter and commit after each X number or records.

This is a simple problem, but quite often we need to remove duplicates, for which we have foreign key references, and then it’s a problem. Suppose, t1 is a lookup table, and t2 is referencing it.

Not only we need to remove duplicates now, but we also need to fix the references. You can do it using the single cursor.

DECLARE
v_t1_id integer;
v_attr1 number:=to_number(NULL);
v_attr2 text:=NULL;
BEGIN
FOR rec in (SELECT id, attr1, attr2 FROM t1
ORDER by attr1, attr2, id) LOOP
IF v_attr1 IS NOT NULL AND v_attr2 IS NOT NULL AND v_attr1=rec.attr1 AND v_attr2=rec.attr2
THEN
UPDATE t2 SET t1_id=v_t1_id
WHERE t1_id=rec.id;
DELETE t1 WHERE id=rec.id
ELSE
v_t1_id:=rec.id;
v_attr1:=rec.attr1;
v_attr2:=rec.attr2;
END IF;
END;

Once again, you may need to commit, if the tables are big enough and you do not want to paralyze the work of the system.

Related

7 responses to “Duplicates removal”

I rewrote the stored proc for SQL Server
Why my procedure would work slower than the procedure in the blog? At the interview I was asked to duplicate records inside the cursor using one delete statement, not one by one, may be this is incorrect approach.
CREATE TABLE [dbo].[tb0](
[id] [int] IDENTITY(1,1) NOT NULL,
[trId] [int] NOT NULL,
[datafield1] [varchar](50) NULL,
[datafield2] [varchar](50) NULL,
CONSTRAINT [PK_tb0] PRIMARY KEY CLUSTERED
(
[id] ASC
)
) ON [PRIMARY]

WHILE @@FETCH_STATUS = 0
BEGIN
if ( @v_trId is not null and @v_trId = @trid AND
( @field1 is null and @v_field1 is null or @v_field1 = @field1) AND
( @field2 is null and @v_field2 is null or @v_field2 = @field2)
)
— this is duplicate of previous record, delete it by primary key
delete from tb0 WHERE (id = @id)
else
begin
set @v_trId = @trId
set @v_field1 = @field1
set @v_field2 = @field2
end

So, the question was, why my solution is faster. Well, there are several reasons. First, let me step back. In many cases the “one SQL statement” solution actually WILL BE faster, exactly because it’s ONE SQL statement.

However, when the table is big and we do not have Enova-size memory, and we need to have 2 copies of one table for the whole span of transaction, this MAY become a problem. The cursor solution allows intermediate commits, thus reducing the potential problem of locking big data volumes and running out of temp space.

But back to two procedures. In general, the less db operations, the faster. Deleting and later inserting is bad for many reasons. Even if we are not talking about Postgres, the INSERT, which follows DELETE never reuses the same space, so the table size will increase, the data blocks will become sparse, and it will take longer to read the table.

Second, there is an embedded cursor (when you do one more select inside your fetch loop, you open implicit cursor) – one extra read for each fields combination.

Third, delete in you case does not use the PK index, so there is another scan before delete is executed. In my solution the table is scanned only ones.

1. Of course delete by primary key is faster, but I don’t know what is faster – delete a huge set of records ( let’s say 10000) one by one using primary key, or delete in one statement but using a field which is not a key ( and looks like not an indexed field, I think for indexed field and for not indexed field the answer can be different)
2. Embedded cursor creates an extra read, but the cursor for the first read executes less number of times because it has group by in it. Is it correct or not?
I am not trying to argue, just try to get better understanding.

My process executes 1 select for a cursor and m-n deletes, which do not really require extra selects, because the record is “current of” the cursor (if a particular 4GL language allows it, you can just DELETE CURRENT;

So the total number of operations is m-n+1, and deletes are fast.

Now let’s count your number of operations,

Your cursor is 1 select.
The number of rows it returns is n. For each n you execute one more select and at least one more delete. So you execute 2n+1 operations, if there is only one duplicate for each record, or if m=2n. If m=2n, then my count is n+1, and your count is 2n+1.

Now let’s say, there are more duplicates, than 2. For each occurrence when there is more than 1 dup, you have insert in addition to delete. So the count will be 3n+1

Which means, my way procedure will execute more operations than yours only if
m-n+1>3n+1
m>4n

Which means, that each duplicate will occur more than 4 times on average.

These estimates do not take into account the fact, that “my” deletes are fast, since they are executed on the “current” record, and “your’s” SELECT/DELETE/INSERT actually needs to be executed. Does it make sense?

Thank you, this is the explanation I was looking for!
I have one more question. If I am the owner of that database? I have two ways to fight the duplicates/ Let’s assume, that a duplicates we call the records with the same trId (one int field and it is not a primary key).
I can remove duplicate once, and then make this field a unique index. It will prevent any other duplicates but it will slow down the performance on insert( is it correct?) Another solution – I can run the procedure to remove duplicates once a day or more often. When I should choose which solution?
I think it is better to have the unique constrain but I do not know why.

You are correct, it’s better to have a unique constraint. The reason is – this is the correct state of data, we want it ALWAYS to be correct. not once a day. The overhead on inserts is minimal, close to invisible,and keeping the data consistent is important.