Dropping missing values is sub-optimal because when you drop observations, you drop information.

The fact that the value was missing may be informative in itself.

Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!

Imputing missing values is sub-optimal because the value was originally missing but you filled it in, which always leads to a loss in information, no matter how sophisticated your imputation method is.

Again, "missingness" is almost always informative in itself, and you should tell your algorithm if a value was missing.

Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the patterns already provided by other features.

Missing data is like missing a puzzle piece. If you drop it, that’s like pretending the puzzle slot isn’t there. If you impute it, that’s like trying to squeeze in a piece from somewhere else in the puzzle.

In short, you should always tell your algorithm that a value was missing because missingness is informative.

So how can you do so?

The key is to tell your algorithm that the value was originally missing.

Missing categorical data

The best way to handle missing data for categorical features is to simply label them as ’Missing’!

You’re essentially adding a new class for the feature.

This tells the algorithm that the value was missing.

This also gets around the technical requirement for no missing values.

Missing numeric data

For missing numeric data, you should flag and fill the values.

Flag the observation with an indicator variable of missingness.

Then, fill the original missing value with 0 just to meet the technical requirement of no missing values.

By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean.

Checkpoint Quiz

After properly completing the Data Cleaning step, you'll have a robust dataset that avoids many of the most common pitfalls.

This can really save you from a ton of headaches down the road, so please don't rush this step.

Here's a quick quiz to check that you got everything:

What are 2 types of unwanted observations to remove from the start?

What are 3 types of structural errors to look out for?

How should you handle missing data?

Why is it sub-optimal to drop observations with missing data or impute missing values?

You can keep track of your answers in the Companion Worksheet, which also has an answer key at the end.