Tag: data imputation python

In the last post, I wrote about how to deal with missing values in a dataset. Honestly, that post is related to my PhD project. I will not explain the detail of my project but I need to replace a certain of percentage (10,20,…90 %) of my dataset to NaN then impute all those NaN values. In that post, I did experiment using Boston dataset, a quite small dataset (13 dimensions and 506 rows/tuples).

I got problems when I did experiments with a bigger dataset, for instance, Flights dataset. This dataset consists of 11 dimensions and almost one million rows (tuples) which is a quite large number. Let see Figure below:

To replace 80% values to NaN in Flight dataset using Pandas operation, it takes around 469 seconds. It’s really slow. Moreover, in this case, I only work on 8 dimensions (only numerical attributes).

I guess there are some reasons, why it has a slow performance: 1) because of the code itself; 2) due to using Pandas for large number operations, or 3) due to both reasons.

I was trying to find the answer and I found two posts about comparison performance between Numpy and Pandas including when we should use Numpy and Pandas: ([1], [2])

After reading those posts, I decided to use Numpy instead of Pandas in my operation due to my dataset has a large number of tuples (almost one million tuples).

This is how I implement. This function below is a function for replacing values to NaN:

The code below is for missing values imputation. The code below is based on scikit learn example (scikit-learn has a function for imputing missing values). I imputed all numerical missing values with mean and all categorical missing values with the most frequent values:

In Figure above, it can be seen that I converted Pandas data frame to numpy array. Just use this command

data = df.values

, your data frame will be converted to numpy array. Then I run the dropout function when all data in the form of numpy array. In the end, I re-converted again the data to Pandas dataframe after the operations finished.

Using Numpy operation to replace 80% data to NaN including imputing all NaN with most frequent values only takes 4 seconds. Moreover, in this case, I work on 11 dimensions (categorical and numerical attributes).

From this post, I just want to share to you that your choice matters. When we want to deal with a large number of tuples, we may consider choosing numpy instead of pandas. However, another important thing is no one can write optimized code!!

Incomplete data or a missing value is a common issue in data analysis. Systems or humans often collect data with missing values. Actually, we can do data analysis on data with missing values, it means we do not aware of the quality of data. However, it may produce the wrong results because of those missing values. The common approach to deal with missing value is dropping all tuples that have missing values. The problem with this dropping approach is it may generate bias results especially if the rows that contain NaN values are large, while in the end, we have to drop a large number of tuples. This way can be used if the data has a small number of missing values. In the case of data with a large number of missing values, we have to repair those missing values.

There are a lot of proposed imputation methods for repairing missing values. The simplest one is to repair missing values with the mean, median, or mode. It can be the mean of whole data or mean of each column in the data frame.

In this experiment, we will use Boston housing dataset. The Boston data frame has 506 rows and 14 columns. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data has been used in many machine learning papers that address regression problems. MEDV attribute is the target (dependent variable), where others are independent variables. This dataset is available in the scikit-learn library, so we can just import it directly. As usual, in this experiment, I am going to use Python Jupyter notebook. If you are not familiar with Jupyter Notebook, Pandas, Numpy, and other python libraries, I have a couple of old posts that may useful for you: 1) setup anaconda 2) understand python libraries for data science.

Using the code above, we can replace some values (20%) in Boston dataset to NA. We also can change the percentage of NA by changing the code above(see .2*). Na values are absolutely random with respect to the whole data. Then, now check again is there any missing values in our boston dataset?

boston.isnull().sum()

The result shows that all columns have around 20% NaN values. Then how to replace all those missing values (impute those missing values) based on the mean of each column?

Now, use command boston.head() to see the data. We have fixed missing values based on the mean of each column. We also can impute our missing values using median() or mode() by replacing the function mean().

This imputation method is the simplest one, there are a lot of sophisticated algorithms (e.g., regression, monte carlo, etc) out there that can be used for repairing missing values. Maybe I’ll post it next time.