Search form

Keep calm & cleanse your data first

You are here

As a Data Analyst, we flaunt our skills for analysing the data using different data tools, for creating some funky charts from the data and for predicting the future with the data. To my surprise, we all miss out on one step which is a must for making any sense out of data. Yes, THE ONE BIG STEP before analysing which takes the major chunk of our time. DATA CLEANING!!!!

I have never seen anyone flaunting about his/her Data Cleaning skills. According to research, every Data Analyst follows an 80-20 Rule which means 80% of his time is spent on Data Cleaning and the rest 20% is spent on Data Analysing, Data Visualization, Data Mining, Data Warehousing, Data integration and all other data related processes.

One Step before Analysis

In a data analyst’s life, the biggest hurdle is Data Cleaning because it is the only thing on earth which can neither be generalized nor automated. Every piece of data is unique in itself and if by any luck the data is collected from “the common man” then it becomes even more unique. Don’t get surprised, I can prove that to you. You just have to play a small game: Ask any 5 of your friends to fill up the following details about themselves individually without seeing how the other person has filled it.

You can observe how different people mention their names, phone numbers and states . These are just basic questions and still have so many things to clean. Think about a full questionnaire having 50 such fields to be filled by over 200000 candidates. That sounds like a big task, isn’t it? That’s why 80% of our time is spent on cleaning the data.

Some of you might be thinking, what’s the point spending so much time on something which is not so cool? What if we skip this part and directly start applying our algorithms on the data? Will it make any difference? The answer is a big YES. It makes a lot of difference. It’s an absolute myth that you can run an algorithm over raw data and have any useless insights pop-up.

So, if we cannot skip this part then we should atleast try to tackle it in a systematic manner. No two data preparation techniques are the same so automation is hard. Here I have attempted to jot down few techniques which are more frequently used in data cleaning. I would like to call it a checklist for data cleaning.

Checklist for Data Cleaning:

Store your data in a data frame with suitable column names

Each column of your data should be consistent with one data type (like numeric, integer, character)

Converting variables (columns) to a suitable data type(like if age variable is stored as a character type, convert it to a numeric type so that you can apply relevant mathematical operations on it)

Apply Date Conversions as per the requirement

Apply Character manipulations:

Removing prepending or trailing white spaces

Trim strings to a certain width

Transform to upper/lower/proper case

Search for strings containing simple patterns (substrings)

Approximate matching procedures based on string distances

String Normalization(transforming strings to a smaller set of string values which are more easily processed)

Removing the missing values/NA’s from the dataset

Removing/Trimming irrelevant variables

Adding new variables into the dataset using variables which are already present

Detection of Inconsistency which means checking for information which violates the basic logic (like Phone number containing alphabets)

Selection of field causing inconsistency (like for State of Texas Country cannot be Canada. In this case, it is not very clear whether State is wrong or Country is wrong)

Simple transformations which means finding out patterns in data and transforming them accordingly (like a simple Phone Number can be written in many formats (172)-64**82 or 0172647**82 or +92-172-647**82 but they all are same so we need to find out patterns and replace them to one standard format)

Converting to a single unit (like if a Height variable has heights in centimetres, feet and inches then we should first convert them to one single unit and then proceed with any computations)

Deterministic Imputation which means sometimes it is possible to find a missing value of one variable with the help of other variables (like if we know for a particular entry the State is Karnataka and Pin Code is 560102 then definitely City will be Bangalore)

Checking for Outliers

This checklist is definitely not exhaustive and can have many more points in it. If you have any other points for data cleaning, please mention them in your comments. Let’s make this checklist bigger and better so that we all can save our precious time.

Tags

LAURELS FOR LYMBYC

Lymbyc, being the world's first virtual analyst has proven its mettle in the industry . Be it the "Most innovative Data science Product" by Aegis or "the top 10 emerging Analytics startups in India to watch out for in 2018" by Analytics India Magazine, Lymbyc is making heads turns and making headlines

About

Lymbyc, a leading edge innovator in AI and Machine Learning takes inspiration from the “Limbic” brain — the part that stores, dissects, rationalizes and generates memories and actions. That’s exactly what we do — giving business users the power of a virtual analyst that is intuitive, actionable and context driven to answer all their business queries