Tag: data imputation pandas

Incomplete data or a missing value is a common issue in data analysis. Systems or humans often collect data with missing values. Actually, we can do data analysis on data with missing values, it means we do not aware of the quality of data. However, it may produce the wrong results because of those missing values. The common approach to deal with missing value is dropping all tuples that have missing values. The problem with this dropping approach is it may generate bias results especially if the rows that contain NaN values are large, while in the end, we have to drop a large number of tuples. This way can be used if the data has a small number of missing values. In the case of data with a large number of missing values, we have to repair those missing values.

There are a lot of proposed imputation methods for repairing missing values. The simplest one is to repair missing values with the mean, median, or mode. It can be the mean of whole data or mean of each column in the data frame.

In this experiment, we will use Boston housing dataset. The Boston data frame has 506 rows and 14 columns. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data has been used in many machine learning papers that address regression problems. MEDV attribute is the target (dependent variable), where others are independent variables. This dataset is available in the scikit-learn library, so we can just import it directly. As usual, in this experiment, I am going to use Python Jupyter notebook. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.

Using the code above, we can replace some values (20%) in Boston dataset to NA. We also can change the percentage of NA by changing the code above(see .2*). Na values are absolutely random with respect to the whole data. Then, now check again is there any missing values in our boston dataset?

boston.isnull().sum()

The result shows that all columns have around 20% NaN values. Then how to replace all those missing values (impute those missing values) based on the mean of each column?

Now, use command boston.head() to see the data. We have fixed missing values based on the mean of each column. We also can impute our missing values using median() or mode() by replacing the function mean().

This imputation method is the simplest one, there are a lot of sophisticated algorithms (e.g., regression, monte carlo, etc) out there that can be used for repairing missing values. Maybe I’ll post it next time.