Dirty Data: Dirt Detectors

So far this week we’ve covered common types of dirty data including data that has duplicates, missing values and inconsistencies. There are, of course, many more ways your data can get dirty! Instead of enumerating every type of dirty data, it’s good to have a strategy to have a more general check of your data. One great way to do this is Cross-Checking.

Conceptually, cross-checking is very simple. You use two or more different sources of data to ensure they all report the same values. For example, you might have data on the number of purchases from your website in both your analytics system (tracking the user actions) and also in your payments system (tracking monetary transactions). By comparing these two metrics, which are really tracking the same thing, you can tell if there are dirty data issues.

In practice, cross-checking is harder than it sounds because of one simple question: If two of your data sources disagree, which one is correct? There is no easy answer, you will need to fall back on some of the techniques we’ve discussed earlier this week. The good news is that you know there is a problem! Hence, cross checking is useful to detect data issues but won’t solve them for you.

There are a number of other techniques to detect and correct dirty data which I didn’t have time to cover this week but I encourage you to read about: