Friday, December 14, 2012

How to find duplicate records in table is a popular SQL interview
question which has been asked as many times as difference
between truncate and delete in SQL or finding second highest salary of
employee. Both of these SQL queries are must know for any one who is appearing
on any programming
interview where some questions on database and SQL are expected. In order
to find duplicate records in database
table you need to confirm definition of duplicates, for example in below contact table
which is suppose to store name and phone number of contact,
a record is considered to be duplicate if both name and phone number is same
but unique if either of them varies. Problem of duplicates in database arise
when you don't have a primary key or unique key on database and that's why its recommended to have a key
column in table. Anyway its easy to find duplicate records in table by using group
by clause of ANSI SQL. Group by clause is used to group data based upon any
column or a number of columns. Here in order to locate duplicate records we
need to use group by clause on both name and phone as shown
in second SQL SELECT query example. You can see in first query that it listed Ruby as duplicate
record even though both Ruby have different phone number because we only
performed group by on name. Once you have grouped data you can filter out
duplicates by using having clause. Having clause is counter part of where
clause for aggregation queries. Just remember to provide temporary name to count() data in
order to use them in having clause.

SQL Query to find duplicate records in a
table in MySQL

In this section we will see SQL query which can be used to locate
duplicate records in table. As explained in previous section, definition of
duplicate depends upon business rules which must be used in group by clause. In
following query we have used SELECT
query to select all records from Contacts table. Here
James, Johnny, Harry and Ron are duplicated four times.

mysql> select * from Contacts;

+-------+----------+

| name|
phone|

+-------+----------+

| James | 80983243 |

| Johnny | 67543212 |

| Harry | 12341234 |

| Ron| 44446666 |

| James | 80983243 |

| Johnny | 67543212 |

| Harry | 12341234 |

| Ron| 44446666 |

| James | 80983243 |

| Johnny | 67543212 |

| Harry | 12341234 |

| Ron|
44446666 |

| James | 80983243 |

| Johnny | 67543212 |

| Harry | 12341234 |

| Ron|
44446666 |

| Ruby|8965342 |

| Ruby|6888342 |

+-------+----------+

18 rows in set (0.00 sec)

Following SELECT query will only find duplicates records based on name
which might not be correct if two contact of same but different numbers are
stored, as in following result set Ruby is shown as duplicate which is
incorrect.

mysql> select name, count(name) from contacts
group by name;

+-------+-------------+

| name|
count(name) |

+-------+-------------+

| Harry |4 |

| James |4 |

| Johnny |4 |

| Ron|4 |

| Ruby|2 |

+-------+-------------+

5 rows in set (0.00 sec)

This is the correct way of finding duplicate contacts at it look both
name and phone number and only print duplicate if both name and phone is
same.

mysql> select name, count(name) from contacts
group by name, phone;

+-------+-------------+

| name|
count(name) |

+-------+-------------+

| Harry |4 |

| James |4 |

| Johnny |4 |

| Ron|4 |

| Ruby|1 |

| Ruby|1 |

+-------+-------------+

6 rows in set (0.00 sec)

having
clause in SQL query will filter duplicate records from non duplicate
records. As in following query it print all duplicate records and how many
times they are duplicated in table.

mysql> select name, count(name) as times from
contacts group by name, phone having times>1;

+-------+-------+

| name|
times |

+-------+-------+

| Harry |4 |

| James |4 |

| Johnny |4 |

| Ron|4 |

+-------+-------+

4 rows in set (0.00 sec)

That's all on how to find duplicate records in table, These SQL queries
will work on all database like MySQL,
Oracle,
SQL Server and Sybase as it only uses ANSI SQL and doesn't use any database specific
feature. Another interesting SQL query interview question is "How to
delete duplicate records from table" which we will see in another
post.

There are plenty of other, more useful duplicate searches one can do for this dataset:

The example: Duplicate by NameGiven example: SELECT NAME FROM CONTACTS GROUP BY NAME HAVING COUNT(1) > 1Better option SELECT * FROM CONTACTS WHERE NAME IN (SELECT NAME FROM CONTACTS GROUP BY NAME HAVING COUNT(1) > 1)

Any Row Duplicate: SELECT C.NAME, C.PHONE, COUNT(1) AS ROW_DUP FROM CONTACTS GROUP BY C.NAME, C.PHONE HAVING COUNT(1) > 1

If you just want unique data, there's two ways to write it: 1: SELECT DISTINCT C.NAME, C.PHONE FROM CONTACTS 2: SELECT C.NAME, C.PHONE FROM CONTACTS GROUP BY C.NAME, C.PHONEOption two is easier on the server, and better for large queries, although people usually opt for the first option due to the simplicity of writing the code.

I really do feel like the post should be more complete (and documented) about the types and details of hunting duplicates.

I was on the fence about if I wanted to start my own blog about SQL, but this post has confirmed my ambitions to make more informed users.