Enterprise Information management, data, data quality

Menu

What is MDM – Data Quality (Matching)

The most common issue hampering successful MDM initiatives is a lack of data quality. So, many organisations spend tens of millions on fancy new MDM applications and discover that, once they have taken on the data, they have simply another data source containing duplicated data that is not fit for purpose. Might as well have saved the millions!

Of course, data quality is about much more than matching. But let us accept that, for MDM, the principle goal is to create a single, unique record for each object in our master – be they Material, Contract, Rental Unit, Client or Supplier. So our objective, from a data quality perspective, must be the same.

When I talk about matching I am talking about creating a key (or keys) to link all related records. So duplicate records may still exist but will be identified as duplicates and will share common keys with the master record.

Four common mistakes are forced by the simple matching strategies imposed by many commercial MDM applications.

Reliance on existing”unique” values. It is tempting to assume that records have been correctly labelled, particularly when it comes to so-called unique identifiers or foreign keys. So invoice number 505 will always be the same invoice right? Well, what about if the two invoice 505s have different issue dates? What is they are issued by different suppliers? What if they are identical except for the invoiced amount? What if the invoice number is not captured? Master data frequently comes from different systems or sources, and will inherit and exacerbate all data quality issues that are not addressed during data take on. In our experience we have picked up many issues with duplicated or invalid values in so called unique identifiers that makes them an unreliable indicator of uniqueness. Similarly, manual errors made when mapping foreign keys will cause huge inconsistencies if they are assumed accurate for purposes of creating the Master. The bottom line! Any match process that relies on a single attribute is unreliable! Ideally, at least three attributes should be compared.

The use of absolute matching (e.g. SQL where x=y ). Absolute matching assumes that two identical elements are the same and, conversely, that two dissimilar attributes are not the same. So invoice (1) 505 is the same as invoice (2) 505 but different from invoice (3) 000505 . And Mr Smith cannot possible be the same as Smith (Mr). Real data, particularly when captured in different systems, follows different standards and patterns. Of course, data standardisation carried out as part of a data quality initiative can (and should) be used to improve the quality of data – but not all data can be standardised and absolute matching will fail. This approach also fails to address the question: “what if one (or both) of the attributes being compared are blank/null?” Are two empty fields the same?

The use of MAGIC….. OK, no one uses magic for matching. But a lot of tools rely on very complex, statistical matching algorithms that rely on the weightings assigned to different attributes and match algorithms to generate a probability that the items are the same. Every thing over a certain threshhold (say 90% ) is deemed to be a match.These solutions claim to be able to match data without applying any data cleansing – so data stored to different standards will be matched without any effort. So “Mr Smith” and “Smith Mr” may match – but so may “Smit Mr” and “Mrs Smith”. The problem is that items can match the threshhold for a variety of reasons. Changing the weighting assigned to different attributes can deliver radically different (or inconsistent) results. Even, changing the order of the attributes can change the results. This approach is practically guaranteed to falsely match a number of records and it is very difficult to understand why or to improve the results. Once you have merged “Smit Mr” with “Mrs Smith” it is very difficult to go back – particularly if the MDM application will push the new “Master Data” and overwrite “Mrs Smith” with “Smit Mr”.

Most MDM applications provide a work around for these common errors. Because they do not trust the results all matches are deemed suspect and are pushed through the MDM workflow module for manual validation and verification. Unfortunately, this places an unacceptable workload on the operation team responsible for validating. Assuming ten minutes per match it will take twenty one working days (167 hours) to validate 1000 matches. Monthly! What if you have millions of matches? The MDM solution implemented by a South African retail bank generates 30000 exceptions (requiring manual intervention) monthly. This option is untenable!

So what is the answer! What is required is a human friendly rules management system that allows business to isolate particular match rules and test for accuracy. If the system cannot provide this level of granularity, (or relies on magic), then your match process cannot be relied upon to provide consistent, accurate results. Of course, any automated system cannot provide a 100% accurate match. I would suggest that it is better to have one or two duplicates than to incorrectly match records that are not the same. The objective should be to get these exceptions down to a manageable number, with a small number of exceptions being pushed to he workflow process, and with no risk of incorrectly matching and overwriting a record with the wrong values.

A three step process is required to ensure match accuracy.

Data should be assessed for accuracy and completeness. deaaly, we need at least three elements to allow us to match. If the ideal attributes are not populated, or inconsistent, or are captured to different standards then we need to apply data cleansing principles. O

Data cleansing principles must be applied to ensure the records are as consistent, complete and accurate as possible. I will discuss this element in more detail in my next post, but in principle, complex attributes need to be broken up into their unique elements; data stored to different standards should be adapted to the agreed master standard; and data should be enriched where possible to update missing information (or to remove garbage) critical to matching. Of course, analysis may show that acceptable matches can be derived using a smaller set of attributes.

A deterministic fuzzy match strategy should be applied. Deterministic matching applies specific, easily understood business rules to match data – the data either matches or it does not. For example, rule 1 may state that Client are the same if the have the same name, the same passport (or social security) number and the same address. rule 2 may state that, the names can be different if they share the same passport number and the same address. Unlike probabilistic match rules, which rely in hard to understand statistical manipulations, each rule can be tested and, if necessary improved. So you may decide that rule 2, in this example, is not a valid match and exclude this from your results …. or push it as an exception to the workflow process. Or you could improve the rule by looking at how similar the names are and, say, match on the same surname (only) , if the passport number and address are the same.

The bottom line is that this approach will deliver fewer exceptions as each condition can be tested and, once signed off, trusted to provide no false positives. It may require a little more work upfront – which can be reduced by using the off the shelf cleansing and matching rules providedby the data quality tool for your data set. In our experience we have been able to deploy robust matching rules, either off the shelf ( for South African name and address data) or within a day for other data sets. This small amount of effort radically reduces the operational load as very few exceptions are generated requiring a manual decision.

In my next post I will discuss how data cleansing affects matching and MDM.