Enterprise Information management, data, data quality

Menu

Beware the Ides of Match

If you have heard of the Ides of March, you know you’re supposed to beware them. Why? In ancient Rome, the Ides of March were equivalent to our March 15th. In William Shakespeare’s play, Julius Caesar, a soothsayer warns Caesar to “Beware the Ides of March”, the day on which he would be assassinated by a group of senators including his friend, Brutus.

15 March is just around the corner – what better time to call attention to poorly defined, ineffective or inaccurate match strategies.

Matching (or linking) is a key capability of data quality and master data management. It is the capability to identify duplicate records within a group of, for example, customers, suppliers, products or materials.

Yet, some of the most common match approaches are fraught with problems.

The absolute match. This approach relies on data being identical.

Absolute matches struggle in the real world becuase real data is frequently disimilar. Data may be misspelled, incomplete or inconsistent.
Less obviously, data may be identical but disimilar – for example two empty telephone numbers do not indicate the same contact.

This is why SQL can never really solve a match problem

Overly simplistic match algorithms, such as SOUNDEX

A number of expensive data quality tools rely on overly simplistic match algorithms, such as SOUNDEX

The algorithms cast too broad a net – linking values which are not in fact similar

Let us assume, for example, that we have three contacts – GARY, GERRY and GERALD. Using soundex the first two contacts would match (soundex G600) even though it is in fact the last two that should have matched.

Overly simple match approaches miss matches that should happen, while creating false positive matches – linking records that should not be links