Many systems that store data in a structured manner (although this is marked language agnostic, the origin of this question is an MS SQL database) struggle with duplicates.
Taking a physical person/user of a system as an example, I'm trying to figure out how to develop a strategy to reliably identify new data entry as a duplicate of an existing entity. Years ago, most people only owned and operated one Email account, making the Email address an almost perfect entry point as a unique key. This is no longer the case, and there are similarities to phone numbers etc. Names aren't any good either, Robert can easily be Bob at one point and Robert Frank Jr.

When a human eye looks at 2 sets of data, it can almost always identify a duplication - but how do we do that?

Almost every developer has this problem, but there appears to be very limited literature on how to approach this.
I have consulted other questions here (e.g. Deduplication of complex records / Similarity Detection), on SO and through Search Engines, but they mostly address a particular implementation, not a strategy on how to derive a score from comparing data.

Assuming we have unlimited tools, be it phonetic or other similarity and distance algorithms, how could one go to score 2 sets of data against each other to receive a finite result - which, if it exceeds a particular threshold may then be considered as a duplicate?

Let's assume we have a first and last name for a person, up to 3 phone numbers in unspecified order, an Email address, a postal address made up of 2 lines, a postcode and city:

This is a very limited set of data to look at; please do not say "well, just compare the lastname and figure out if the phone numbers are similar" - this question is about a more generic approach on how to derive a total score over a number of comparisons.

A human would look Person 1 and 2 and would almost always identify them as duplicate, but would not identify Person 3 as such, even though some properties are equally "similar".

It's highly probable that a person with the same last name and postcode might be a duplicate, but it might be parent and child for example.
Identical phone numbers are an indicator, but not necessarily a deciding one.
If everything matches but the address, our previously registered contact may have moved recently - so again these properties can only be an indicator, not a decisive yes or no.

The end result of this algorithm should really be a level of confidence between low (highly unlikely the two records represent a duplicate entity), medium (not likely a duplicate) and high (there is a good chance that these are duplicates).

If you have read this far, you will probably consider down-voting the question because it is unlikely that there can be another answer than "it depends". Please consider to instead help me with a comment to improve this question into one that can be answered.

1 Answer
1

It is unlikely that you will be able to build a reliable automated system, but you can apply statistical techniques (“machine learning”) to detect possible duplicates and report them for human review.

For example, you could write a collection of various heuristics, each of which outputs some similarity score that two entries are equivalent. You might have one heuristic that compares the names of two entries, where that heuristic knows about different ways to spell, abbreviate, and pronounce names. You might have another address to compare email addresses or postal addresses.

On top of these heuristics, you train a statistical model that weighs them appropriately and produces a binary classification duplicate/non-duplicate, ideally as a probability score. These heuristic are input features/variables for the statistical technique. Perhaps a Naive Bayes Classifier would be appropriate, but more complex models might be more able to make use of feature dependencies and interactions. Typically, Support Vector Machines or Neural Networks are used for multi-feature classification tasks with many data points. This combination of different predictors can also be understood as boosting.

It is important that the human decisions are fed back into the system so that duplicate detection accuracy improves over time (supervised learning). If the individual heuristics are also statistics-based, you might want to feed back decisions to them as well.

Note that your problem has similarities to spam filters: it too will have a set of heuristics that must be combined into a single decision.

It is probably not feasible to test all entries against each other for similarity – that would be an O(n²) operation. You might instead want to look at your heuristics and find a way to cluster entries into groups where the heuristic might match each other. For example, you might cluster entries by normalized last name, and run your detection system over those groups. Then you cluster by normalized postcode and run detection for each postcode. This indexing can speed up detection by many orders of magnitude.