Sandvox – PHP & MySQL

Sometimes when a database does not have a primary or unique key defined it is easy to end up with duplicate records in a table. Ensuring a correct data model is important but sometimes things can be overlooked. Now that it has happened we now need to find a way of removing these duplicate entries.

Firstly to identify a duplicate entry we need to determine what fields should be considered in determining a duplicate. For instance if we had a table invoices, we could say that the field invoice_number should be unique and we would check based on just this field. In a lab we could say that the fields batch_date and batch_number (where batch number is reset on a daily basis) represent the uniqueness of the data.

To determine the records with duplicate entries we would issue the following command:

SELECT invoice_number FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1;
or
SELECT batch_date, batch_number FROM batches GROUP BY batch_date, batch_number HAVING COUNT(*) > 1;

What this is saying is lets focus on entries that have a count per unique fields greater than 1. It is important not to check for the count per unique fields equal to 2 as there could be entries with more than 2.

Now that we can easily identify the duplicate records we now need to determine which ones should be removed. If we were to use the unique fields in a DELETE statement we would remove all entries.

-- Incorrect - this will remove all sets of duplicate records.
DELETE invoices FROM invoices i INNER JOIN (SELECT invoice_number FROM invoices GROUP BY invoice_number HAVING COUNT(*) > 1) x WHERE i.invoice_number = x.invoice_number;

The removal process is handled in different ways depending on the database server. We will discuss each one:

MySQL

MySQL has built in features (ALTER IGNORE TABLE tablename ADD UNIQUE INDEX (fields)) which allows duplicates to be removed through the addition of a unique index even with duplicate entries.

By using the IGNORE keyword the index gets created with the entries removed.