July 19, 2017

Why CRM De-Duplication Doesn’t Deliver

In all my years looking at data, I’ve never found a CRM database without duplicates (I’m assuming a database of more than a few thousand records.)

Many of the CRM systems have easy ways to identify duplicates, but duplicates still remain.

So why is that?

The main reason is that duplicates are far more complex than simply matching of names, emails, etc.

Yes, you will match some obvious duplicates this way, but there are many more complex versions. CRM applications assume the data is correct in the first place, that the data is present and that the data is in the correct field. Consider this example of two duplicates:

In this example, you can’t match on the first name as the second record has an alias (many use Bill as a short name for William), the company names are slightly different and only one email exists. (To match on the first name, you would need a list of first name aliases.) With the human eye, we can see these are duplicate records, but not so easy to detect automatically.

Let’s look at a very similar example for another employee of Acme Ltd.:

From a purely data perspective, we have a very similar scenario as above. We have the same surname, similar company names, same job title and only one email. Here, there is no link between the first names, it’s not an alias, in fact they are of different genders.

Although both examples have the same job title, there is a subtle difference. In an organisation you are likely to have multiple Project Managers, but only one COO. This type of subtle difference is often the key to verifying duplicates. The uniqueness of the COO title adds more weight to the conclusion that these are duplicates.

Techniques such as alias matching, fuzzy matching and probability matching help resolve the above duplicates.

It’s just as important to validate two records as a duplicate as it is to validate them as not duplicates.

So, you can see, finding duplicates is not a simple process and CRM applications don’t tend to be experts in de-duplication. It’s better left to the experts.

On so many occasions CRM managers expect their internal processes to find and remove duplicates, but, in reality, it’s never the case.

Duplicates are for more complicated than one would expect.

A CRM application won’t have all the clever techniques required to find duplicates.

In fact, for many databases you must create very specific de-duplication processes.

Why is that?

Well, it all depends on what data you are storing and how good the quality is.

The above examples show the impact of poor quality data on detecting duplicates, a missing email or a different company name can create uncertainty.

Let’s look at how you store data, impacts on how duplicates are found.

Let’s say your database contacts have multiple addresses. Over time, your contact has moved homes and as a result you store their previous address (your CRM has the ability to do this.)

Here we have two tables. The blue table is a list of contacts with a link via an address identifier to the green table of addresses.

You can see if the CRM application matches on current address only, then these two versions of ‘Blake Argile’ would not be detected as a duplicate. However, if there was a match with the previous address as well, we can see these two versions would be identified as duplicates.

So, we have seen, the quality of your data and how you store it can impact how well duplicates are found.

Let’s look at one more common problem most CRM databases must manage.

Here we have two first names in the first name field. Now it’s clear that ‘Mary’ is the same person from both records, as she has the same address and surname.

So what do we do now:

Do we merge these two duplicates together and have ‘Rob & Mary’ as the single name

Do we split the first record into two, one for ‘Rob’ and another for ‘Mary’ then merge the two for ‘Mary’

Do we remove ‘Mary’ from the first record as she has her own record.

If we split ‘Rob’ and ‘Mary’, how do we connect them, as it looks like they are a couple. Should we create a family link value

I hope the few examples above shed light onto why finding duplicates is complex and not surprising CRM applications won’t find then easily.

There are so many more complex scenarios then the ones presented here.

If you would like some help in this area, do contact us, we love solving complex de-duplication problems, matching problems and identity resolution problems.

Comments

In all my years looking at data, I’ve never found a CRM database without duplicates (I’m assuming a database of more than a few thousand records.)

Many of the CRM systems have easy ways to identify duplicates, but duplicates still remain.

So why is that?

The main reason is that duplicates are far more complex than simply matching of names, emails, etc.

Yes, you will match some obvious duplicates this way, but there are many more complex versions. CRM applications assume the data is correct in the first place, that the data is present and that the data is in the correct field. Consider this example of two duplicates:

In this example, you can’t match on the first name as the second record has an alias (many use Bill as a short name for William), the company names are slightly different and only one email exists. (To match on the first name, you would need a list of first name aliases.) With the human eye, we can see these are duplicate records, but not so easy to detect automatically.

Let’s look at a very similar example for another employee of Acme Ltd.:

From a purely data perspective, we have a very similar scenario as above. We have the same surname, similar company names, same job title and only one email. Here, there is no link between the first names, it’s not an alias, in fact they are of different genders.

Although both examples have the same job title, there is a subtle difference. In an organisation you are likely to have multiple Project Managers, but only one COO. This type of subtle difference is often the key to verifying duplicates. The uniqueness of the COO title adds more weight to the conclusion that these are duplicates.

Techniques such as alias matching, fuzzy matching and probability matching help resolve the above duplicates.

It’s just as important to validate two records as a duplicate as it is to validate them as not duplicates.

So, you can see, finding duplicates is not a simple process and CRM applications don’t tend to be experts in de-duplication. It’s better left to the experts.

On so many occasions CRM managers expect their internal processes to find and remove duplicates, but, in reality, it’s never the case.

Duplicates are for more complicated than one would expect.

A CRM application won’t have all the clever techniques required to find duplicates.

In fact, for many databases you must create very specific de-duplication processes.

Why is that?

Well, it all depends on what data you are storing and how good the quality is.

The above examples show the impact of poor quality data on detecting duplicates, a missing email or a different company name can create uncertainty.

Let’s look at how you store data, impacts on how duplicates are found.

Let’s say your database contacts have multiple addresses. Over time, your contact has moved homes and as a result you store their previous address (your CRM has the ability to do this.)

Here we have two tables. The blue table is a list of contacts with a link via an address identifier to the green table of addresses.

You can see if the CRM application matches on current address only, then these two versions of ‘Blake Argile’ would not be detected as a duplicate. However, if there was a match with the previous address as well, we can see these two versions would be identified as duplicates.

So, we have seen, the quality of your data and how you store it can impact how well duplicates are found.

Let’s look at one more common problem most CRM databases must manage.

Here we have two first names in the first name field. Now it’s clear that ‘Mary’ is the same person from both records, as she has the same address and surname.

So what do we do now:

Do we merge these two duplicates together and have ‘Rob & Mary’ as the single name

Do we split the first record into two, one for ‘Rob’ and another for ‘Mary’ then merge the two for ‘Mary’

Do we remove ‘Mary’ from the first record as she has her own record.

If we split ‘Rob’ and ‘Mary’, how do we connect them, as it looks like they are a couple. Should we create a family link value

I hope the few examples above shed light onto why finding duplicates is complex and not surprising CRM applications won’t find then easily.

There are so many more complex scenarios then the ones presented here.

If you would like some help in this area, do contact us, we love solving complex de-duplication problems, matching problems and identity resolution problems.