I hate to say this, but we build models and discover business insights off of faulty data. We like to think that the data we work with is pristine, flawless, perfect, but often it's just plain dirty. The exceptions to this rule are often similar to the Iris Data Set: complete and probably perfect, but minuscule in comparison to the size of modern-day data spaces.

There are three main things that make data dirty:

Missing Values

Incorrect Domain Values

False Data

We can often identify the first two during data exploration. Nulls and instances like a work tenure of 2107 years jump out fairly easily. Sometimes we're able to update these with a ground truth from elsewhere in the data space. Other times we do data cleansing by removing rows or replacing values as appropriate.

It is much harder to identify or fix the third type of dirty data: data that is non-observably incorrect. This may be fine in the day to day life of or data but is especially bad when we think about supervised learning problems.

Imagine that we poll 100 people, ask them them "Do you think you're happier than the average person?" and get the below distribution:

We have 71 hypothetical people that see themselves as more than averagely happy and 29 who see them selves as less than averagely happy. But how many of those 71 would actually just being uncomfortable saying they were unhappy in a survey? And are there any of our less-than average happy people who are just humble and would be uncomfortable saying they were better than average in a survey?

What They Felt

Yes

No

What They Said

Yes

?

?

71

No

?

?

29

?

?

We could build a model on this data set that is perfect in prediction and has a huge F-Score. But depending on how deceptive our poll-takers were feeling we may end up with a model that is bad at actually predicting happiness. Perhaps too existential of an example. But in our businesses do we want to predict the real answers or the answers skewed by false data?

This gives us something to think about in all of our predictive models. When we have a model with recall of 70 percent, is it that we are missing 30% of the category or that 30% of the category isn't actually that category? I hypothesize that it's likely somewhere in between.

So what can we do?

Think about how much you trust your training data set. Do you think most of the values are right? Was anything done by hand when assigning those values? Are there are assumptions being made to assign the predictive variable?

Cluster! Without including what you are trying to predict, cluster your data. Then look at how many of each predictive value end up in each cluster. Have a cluster with 90% churn customers and 10% non-churn? Those non-churners are probably worth looking into

Look at what is going wrong. Digging into instances where your model and the ground truth disagree is a great way to improve the model, improve the ground truth, and discover new business insights