Navicat Blog

How to Identify Duplicates with Non-unique Keys (Part 2)

December 27, 2017 by Robert Gravelle

The majority of duplicate records fall into one of two categories: Duplicate Meaning and Non-unique Keys. The How to Spot and Delete Values with Duplicate Meaning in MySQL blog dealt with Duplicate Meaning; in today's follow-up, we'll address how to identify Non-unique Keys. That's where two records in the same table have the same key, but may or may not have different values and meanings.

How Does this Happen?

Even a well-designed database can accumulate non-unique key duplicates. It often happens as a result of data that is imported from external sources such as text, csv, or excel files as well as data feeds. Even merging data from two different databases might create duplicate keys if you are combining each in some way to generate a new key – assuming of course that the new key column supports non-unique values. For example, concatenating two numbers to generate a new key could prove problematic:

Key 1

Key 2

New Key

--------------------------

10

25

1025

102

5

1025 !!!

An Example Table

In databases that support complex systems, it isn't always feasible to prevent duplicate keys from occurring. What's important is being able to deal with them quickly and effectively before they taint your data.

Here's the product of amalgamating two data sources of actors. You'll notice that there a couple of duplicated names, specifically “JENNIFER DAVIS” and “NICK WAHLBERG”:

id

first_name

last_name

--------------------------------------

10

PENELOPE

GUINESS

12

NICK

WAHLBERG

14

ED

CHASE

22

JENNIFER

DAVIS

23

JOHNNY

LOLLOBRIGIDA

27

BETTE

NICHOLSON

34

GRACE

MOSTEL

41

NICK

WAHLBERG

39

JOE

SWANK

23

CHRISTIAN

GABLE

22

JENNIFER

DAVIS

Nick Walberg would be an instance of Duplicate Meaning, which we explored in the last blog. JENNIFER DAVIS, on the other hand, appears in two records with the same key of 22. There is also a duplicated key that is associated with two unrelated actors: #23 for “JOHNNY LOLLOBRIGIDA” and “CHRISTIAN GABLE”. With regards to the duplicated keys of 22 and 23, the first is a true duplicate, whereas the second only needs a new key to be generated for one of the records.

Identifying and Counting Duplicates

The following query will identify all of the records of the above table that share a common id. I recommend using the MySQL group_concat() function to format duplicated rows together on one line:

If you ever wanted to find all duplicates - that is Duplicate Meaning and Non-unique Key duplicates - at the same time, you can combine the above query with one that checks for duplicated names using the UNION operator:

Conclusion

Crafting a query to identify duplicate keys in MySQL is relatively simple because you only need to group on the key field and include the “Having COUNT(*) > 1” clause. In a future article, we'll review some different approaches for deleting duplicate rows and updating keys.