Articles

Measuring Data Quality

By: Bryn Davies

Recently in information management discussions there has been a lot of talk about the measurement of data quality. This is often seen by some as a good way to highlight just how bad their data quality is, or merely to prove that it’s not as bad as people are saying it is. As with any quality management programme, ongoing measurement is a crucial component, but it’s a lot easier said than done!

Grab the tape measure

Unfortunately, when assessing data quality, a lot of people tend to focus on the “data” part rather than on the "quality" part: it is relatively straightforward to count the number of defects in a given data set, and to perhaps label them according to typical dimensions such as "completeness", "timeliness" etc. On presenting the results that the product database’s package_type attribute exhibits 63% inconsistency for example, the response is frequently "So what?" Indeed, because unless the impact of this state of affairs is well understood and articulated, ultimately in money terms, nobody is incentivised to do anything about it at all! Measurement for the sake of measurement is, therefore, self-defeating as it delivers very little value. This is why it is the "quality" part that must be focused on. But what does this mean?

How dirty is dirty?

Data is multi faceted, making any attempts to measure its quality a potentially difficult and complex task. However, bearing in mind that the quality of any product is determined by its users, helps shift the focus from merely measuring the actual object (in our case data) to measuring the impact of its lack of quality on users (in our case our employees, partners and customers). This provides objectivity to the measurement of data quality, in that it serves as a guide to which aspects of the data should be assessed. This means understanding the information value chains that span our business processes, and the value of quality data to those who depend on it to execute their daily job effectively in serving the needs and goals of the organisation. It also means that what is measured will be different for different data sets, divisions, companies and end users, and might well also be different from one month or year to the next. Bottom line: Don’t just measure because it can be measured, measure what matters!

Metres, litres, kilograms or feet?

Given the above discussion, it would be almost impossible to define a set of generic and standard dimensions across which any database’s quality could be measured. Most speak of typical dimensions such as "completeness", "conformity", "integrity" and the like, but in the end it’s up to an organisation to either follow the guidelines of their chosen data quality methodology, or to merely define their own, meaningful dimensions. Whatever you call them, it is critical that labels, and more importantly their meaning, are used with absolute consistency (no pun intended!). It is counter-productive to have one department speaking about data being "acceptable" whilst another uses the term "valid" to mean the same thing. Believe it – people are incredibly subjective when it comes to the semantics of what would otherwise appear to be simple terminology!

Accuracy and structure

Remember too that, whilst many dimensions can be measured using software tools, there is one that can only be assessed manually: I will use the data quality guru Larry English’s term "accuracy", which is a comparison between the data and whatever it is (supposed to) represent in the real world: For example, software can tell you whether your customer Mr. Smith’s ID number is a valid ID number as stored in your database, but it takes a phone call or an email to check to see that it is indeed Mr. Smith's ID number. Also ensure that all measurement is conducted at source, as close to the point of original creation, in order to ensure a true picture of issues contributing to poor quality. Finally, data quality measurement should also include an assessment of data models and of metadata, as these aspects of databases can also contribute to poor data quality.

Conclusion

The measurement of the quality of your data is key to many aspects of a data quality programme:

it provides an initial benchmark

it helps to raise awareness of the effects of data non-quality

it facilitates the discovery of defective or poor business processes (the very essence of data quality – see my previous articles)

on a regular, ongoing basis, it ensures that corrective steps taken to improve data quality are still in control

it helps to sustain a culture of quality

Because a data quality initiative is forever, and not just a once-off project, good metrics become a means of consistent communication throughout the organisation, for all to understand and to be empowered. As the old saying goes: "To measure is to know".